PubTrends is an interactive scientific literature exploration tool that helps researchers analyze topics, visualize research trends, and discover related works.
Available online at: https://pubtrends.info/
With PubTrends, you can:
- Gain a concise overview of your research area.
- Explore popular trends and impactful publications.
- Discover new and promising research directions.
See example of analysis at: https://pubtrends.info/about.html
- Pubmed ~40 mln papers and 450 mln citations
- Semantic Scholar 170 mln papers and 600 mln citations
PubTrends is a Python / Kotlin + JavaScript web service with a PostgreSQL backend. It uses:
- Backend: Nginx + Flask + Gunicorn
- Task Queue: Celery + Redis
- DataBase: Postgres + Kotlin ORM + Psycopg2
- Data Analysis: Pandas, NumPy, Scikit-learn
- Semantic Search: Faiss + Postgres pgvector
- NLP: NLTK, SpaCy, word2vec (GenSim), Fasttext, Sentence-Transformers, custom node2vec
- Visualization: Bokeh, Holoviews, Seaborn, Matplotlib
- Frontend: Bootstrap, jQuery, Cytoscape.js
- Deployment: Docker Compose
- Testing: JUnit + PyTest + Flake8 + TeamCity
See environment.yml for the full list of libraries used in the project.
Two Docker images are used for testing and deployment:
- biolabs/pubtrends - production
- biolabs/pubtrends-test - testing
We use Docker Hub to store built images.
-
Copy and modify
config.propertiesto~/.pubtrends/config.properties.
Ensure that file contains correct information about the database(s) (url, port, DB name, username and password). -
Conda environment
pubtrendscan be easily created for launching Jupyter Notebook and Web Service:conda env create -f env/environment.yml source activate pubtrends -
Build base Docker image
biolabs/pubtrendsand nested imagebiolabs/pubtrends-testfor testing.docker build -f resources/docker/main/Dockerfile -t biolabs/pubtrends --platform linux/amd64 . docker build -f resources/docker/test/Dockerfile -t biolabs/pubtrends-test --platform linux/amd64 . -
Init Postgres database.
- Launch Docker image:
docker run --rm --name pubtrends-postgres \ -e POSTGRES_USER=biolabs -e POSTGRES_PASSWORD=mysecretpassword \ -v ~/postgres/:/var/lib/postgresql/data \ -e PGDATA=/var/lib/postgresql/data/pgdata \ -p 5432:5432 \ -d postgres:17- Create a database (once a database is created use
-d pubtrendsargument):
psql -h localhost -p 5432 -U biolabs ALTER ROLE biolabs WITH LOGIN; CREATE DATABASE pubtrends OWNER biolabs;- Configure memory params in
~/postgres/pgdata/postgresql.conf.
# Memory settings effective_cache_size = 8GB # ~ 50 to 75% (can be set precisely by referring to “top” free+cached) shared_buffers = 2GB # ~ 1/4 – 1/3 total system RAM work_mem = 1GB # For sorting, ordering etc max_connections = 4 # Total mem is work_mem * connections maintenance_work_mem = 1GB # Memory for indexes, etc # Write performance checkpoint_timeout = 10min checkpoint_completion_target = 0.8 synchronous_commit = offYou can check current settings by command
SHOW ALL;in psql console.
Use the following command to test and build the JAR package:
./gradlew clean test shadowJar
Postgresql should be configured and launched.
Launch crawler to download and keep up to date a Pubmed database:
java -cp build/libs/pubtrends-dev.jar org.jetbrains.bio.pubtrends.pm.PubmedLoader --fillDatabase
Command line options supported:
resetDatabase- clear current contents of the database (for development)fillDatabase- option to fill a database with Pubmed data. Can be interrupted at any moment.lastId- force downloading from given id from articles packpubmed20n{lastId+1}.xml.
Updates - add the following line to crontab:
crontab -e
0 22 * * * java -cp pubtrends-<version>.jar org.jetbrains.bio.pubtrends.pm.PubmedLoader --fillDatabase | \
tee -a crontab_update.log
Download Sample from Semantic Scholar or full archive. See Open Corpus.
The latest release can be found at: https://api.semanticscholar.org/api-docs/datasets#tag/Release-Data
curl https://api.semanticscholar.org/datasets/v1/release/
-
Linux & Mac OS
# Fail on errors set -euox pipefail DATE="2022-05-01" PUBTRENDS_JAR= wget https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/$DATE/manifest.txt echo "" > complete.txt N=$(cat manifest.txt | grep corpus | wc -l) cat manifest.txt | grep corpus | while read -r file; do if [[ -z $(grep "$file" complete.txt) ]]; then echo "Processing $file / $N" wget https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/$DATE/$file; java -cp $PUBTRENDS_JAR org.jetbrains.bio.pubtrends.ss.SemanticScholarLoader --fillDatabase $(pwd)/$file rm $file; echo "$file" >> complete.txt fi; done java -cp $PUBTRENDS_JAR org.jetbrains.bio.pubtrends.ss.SemanticScholarLoader --index --finish -
Windows 10 PowerShell
$DATE = "2023-03-14 $PUBTRENDS_JAR = curl.exe -o .\manifest.txt https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/$DATE/manifest.txt echo "" > .\complete.txt foreach ($file in Get-Content .\manifest.txt) { $sel = Select-String -Path .\complete.txt -Pattern $file if ($sel -eq $null) { echo "Processing $file" curl.exe -o .\$file https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/$DATE/$file java -cp $PUBTRENDS_JAR org.jetbrains.bio.pubtrends.ss.SemanticScholarLoader --fillDatabase .\$file del ./$file echo $file >> .\complete.txt } } java -cp $PUBTRENDS_JAR org.jetbrains.bio.pubtrends.ss.SemanticScholarLoader --index --finish
Please ensure that embeddings Postgres DB with vector extension is up and running
docker run --rm --name pgvector -p 5430:5432 \
-m 32G \
-e POSTGRES_USER=biolabs -e POSTGRES_PASSWORD=mysecretpassword \
-e POSTGRES_DB=pubtrends \
-v ~/pgvector/:/var/lib/postgresql/data \
-e PGDATA=/var/lib/postgresql/data/pgdata \
-d pgvector/pgvector:pg17
Then you'll be able to update embeddings with a commandline below. It will compute embeddings and store them into the vector DB, and update the Faiss index for fast search.
docker build -f pysrc/preprocess/embeddings/Dockerfile -t update_embeddings --platform linux/amd64 .
docker run -v ~/.pubtrends:/config:ro \
-v ~/.pubtrends/logs:/logs \
-v ~/.pubtrends/sentence-transformers:/sentence-transformers \
-v ~/.pubtrends/nltk_data:/home/user/nltk_data \
-v ~/.pubtrends/faiss:/faiss \
-it update_embeddings /bin/bash
source activate pubtrends
export PYTHONPATH=$PYTHONPATH:$(pwd)
/bin/bash ~/pubtrends/scripts/nlp.sh
python pysrc/preprocess/update_embeddings.py
Please ensure that you have a database configured, up and running.
Then launch web-service or use jupyter notebook for development.
-
Create necessary folders with script
scripts/init.shand download prerequisites.bash scripts/init.sh bash scripts/nlp.sh -
Start Redis
docker run -p 6379:6379 redis:7.4.2 -
Configure conda environment
pubtrendsconda env create -f env/environment.ymlEnable environment by command
source activate pubtrends. -
Start Celery worker queue
celery -A pysrc.celery.tasks worker -c 1 --loglevel=debug -
Start flask server at http://localhost:5000/
python -m pysrc.app.pubtrends_app -
Start service for text embeddings based on either pretrained fasttext model or sentence-transformer at http://localhost:5001/
python -m pysrc.endpoints.embeddings.fasttext.fasttext_app
or
python -m pysrc.endpoints.embeddings.sentence_transformer.sentence_transformer_app
- Optionally start semantic search service http://localhost:5002/
python -m pysrc.semantic_search.semantic_search_app
Notebooks are located under the /notebooks folder. Please configure PYTHONPATH before using jupyter.
export PYTHONPATH=$PYTHONPATH:$(pwd)
jupyter notebook
-
Start a Docker image with a Postgres environment for tests (Kotlin and Python development)
docker run --rm --platform linux/amd64 --name pubtrends-test \ --publish=5433:5432 --volume=$(pwd):/pubtrends -i -t biolabs/pubtrends-testNOTE: don't forget to stop the container afterward.
-
Kotlin tests
./gradlew clean test -
Python tests with code style check for development (including integration with Kotlin DB writers)
source activate pubtrends; pytest pysrc -
Python tests within Docker (ensure that
./build/libs/pubtrends-dev.jarfile is present)docker run --rm --platform linux/amd64 --volume=$(pwd):/pubtrends -t biolabs/pubtrends-test /bin/bash -c \ "/usr/lib/postgresql/17/bin/pg_ctl -D /home/user/postgres start; \ cd /pubtrends; cp config.properties /home/user/.pubtrends/; \ source activate pubtrends; pytest pysrc"
Deployment is done with docker-compose:
- Gunicorn serving main pubtrends Flask app
- Redis as a message proxy
- Celery workers queue
Please ensure that you have configured and prepared the database(s).
-
Modify file
config.propertieswith information about the database(s). File from the project folder is used in this case. -
Start Postgres server.
docker run --rm --name pubtrends-postgres -p 5432:5432 \ -m 32G \ -e POSTGRES_USER=biolabs -e POSTGRES_PASSWORD=mysecretpassword \ -e POSTGRES_DB=pubtrends \ -v ~/postgres/:/var/lib/postgresql/data \ -e PGDATA=/var/lib/postgresql/data/pgdata \ -d postgres:17NOTE: stop Postgres docker image with timeout
--time=300to avoid DB recovery.\NOTE2: for speed reasons we use materialize views, which are updated upon successful database update. In case of an emergency stop, the view should be refreshed manually to ensure sort by citations works correctly:
psql -h localhost -p 5432 -U biolabs -d pubtrends refresh materialized view matview_pmcitations; -
Build ready for deployment package with script
scripts/dist.sh.scripts/dist.sh build=build-number ga=google-analytics-id -
Launch pubtrends with docker-compose (one of the options)
# start with local word2vec tf-idf tokens embeddings docker-compose -f docker-compose/word2vec.yml up --build # start with BioWord2Vec tokens embeddings docker-compose -f docker-compose/fasttext.yml up --build # start with Sentence Transformer for text embeddings docker-compose -f docker-compose/sentence-transformer.yml up --build # Start with Semantic Search based on Sentence Transformer docker-compose -f docker-compose/semantic-search.yml up --buildUse these commands to stop compose build and check logs:
# stop docker-compose -f docker-compose/semantic-search.yml down --remove-orphans # inpect logs docker-compose -f docker-compose/semantic-search.yml logsPubtrends will be serving on port 5000.
Use simple placeholder during maintenance.
cd pysrc/app; python -m http.server 5000
- Update
CHANGES.md - Update version in
scripts/dist.sh - Launch
scripts/dist.sh,pubtrends-XXX.tar.gzwill be created in thedistdirectory.
See AUTHORS.md for a list of authors and contributors.
-
Shpynov, O. and Kapralov, N., 2021, August. PubTrends: a scientific literature explorer. In Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (pp. 1-1). https://doi.org/10.1145/3459930.3469501
