Skip to content

t0ster/simtxt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README

Features

  • Uses LSI model to search for similar sentences
  • Index is updated in background by separate worker
  • Index is updated on every new text addition
  • Robust caching on client side
  • GraphQL API
  • If tests pass docker images are built automatically by GithHub action and uploaded to Docker Hub

Limitations

  • The whole index is stored in RAM, so this will not work for big datasets
  • Currently only one instance of background worker is supported
  • Index is updated on every new text addition, this will not work for high-load (in real world scenario I would update it let’s say every 10 minutes)

Architecture and Stack

Client

  • TypeScript / React
  • Apollo Client for querying GraphQL and caching

Server

  • AIOHTTP as HTTP server
  • Tartiflette for GraphQL (schema can be found here)
  • Mongo as DB
  • Gensim for LSI model
  • Spacy to split text on sentences
  • MongoDB tailable cursor and capped collection for PubSub

Running

To run production build try the following. This will fetch pre-built images from docker hub.

docker-compose up

For development (this will build docker images from source, which may take some time):

docker-compose -f docker-compose.dev.yml up

Environment Variables

Server

SIMTXT_DB_URIURI for Mongo database (default: mongodb://localhost:27017)
SIMTXT_INDEX_MIN_SCOREIf similarity score is more than this value we are including sentence to the list of similar sentences (default: 0.1)
SIMTXT_LSI_NUM_TOPICNumber of topics for LSI model (default: 100)

If you have small dataset and not getting any similar sentences try to change SIMTXT_INDEX_MIN_SCORE to 0, for example SIMTXT_INDEX_MIN_SCORE=0 docker-compose up

Client

REACT_APP_GRAPHQL_URIURI for GraphQL API (default: http://localhost:8080/graphql)

Tests & Linting

Server

pytest
pylint simtxt
mypy simtxt

Client

yarn test --watchAll=false
npx eslint src