- Uses LSI model to search for similar sentences
- Index is updated in background by separate worker
- Index is updated on every new text addition
- Robust caching on client side
- GraphQL API
- If tests pass docker images are built automatically by GithHub action and uploaded to Docker Hub
- The whole index is stored in RAM, so this will not work for big datasets
- Currently only one instance of background worker is supported
- Index is updated on every new text addition, this will not work for high-load (in real world scenario I would update it let’s say every 10 minutes)
- TypeScript / React
- Apollo Client for querying GraphQL and caching
- AIOHTTP as HTTP server
- Tartiflette for GraphQL (schema can be found here)
- Mongo as DB
- Gensim for LSI model
- Spacy to split text on sentences
- MongoDB tailable cursor and capped collection for PubSub
To run production build try the following. This will fetch pre-built images from docker hub.
docker-compose up
For development (this will build docker images from source, which may take some time):
docker-compose -f docker-compose.dev.yml up
SIMTXT_DB_URI | URI for Mongo database (default: mongodb://localhost:27017 ) |
SIMTXT_INDEX_MIN_SCORE | If similarity score is more than this value we are including sentence to the list of similar sentences (default: 0.1 ) |
SIMTXT_LSI_NUM_TOPIC | Number of topics for LSI model (default: 100 ) |
If you have small dataset and not getting any similar sentences try to change SIMTXT_INDEX_MIN_SCORE
to 0
, for example SIMTXT_INDEX_MIN_SCORE=0 docker-compose up
REACT_APP_GRAPHQL_URI | URI for GraphQL API (default: http://localhost:8080/graphql ) |
pytest
pylint simtxt
mypy simtxt
yarn test --watchAll=false
npx eslint src