Skip to content

checkmate17/Daily-paper-using-OpenAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ArXplorer

Recommender of daily papers from arXiv, customized with your Prompt. Minimal, hackable, no-boilerplate.


"I like innovative papers in large foundation models, multimodal methods, symbolic reasoning and automation."

What's up

Now we've been overwhelmed by papers on arXiv. With ~300 new additions daily in cs.AI section alone, sifting through them can be daunting. This project scrapes daily feed from https://arxiv.org/list/{namespace}/new, collecting author data and performing two-stage ranking:

  • Coarse Ranking: Use the authors' impact index and a CPU-friendly embedding model (per MTEB leaderboard 🤗) to reduce candidate pools into ~20 by weighted Copeland ranking.
  • Reranking: Optionally use gpt-4 to choose top k and write a summary (which is cheap for just one call per day).

Quick Start

Prepare environment

conda create -n "arxplorer" python==3.11
conda activate arxplorer
pip install -r requirements.txt

(Recommended) Use an OpenAI key for summarization and better ranking.

echo 'OPENAI_API_KEY=your_api_key_here' >> .env

GO!

python run.py

Customization

You may customize your preferences or interests by

echo 'INSTRUCTION="I like ..."' >> .env

Use namespace to specify the section in arXiv to scrape from (make sure https://arxiv.org/list/{namespace}/new can be visited). Use top_k to specify the final number of feeds you want to see. coarse_k is the intermediate number from coarse ranking and should always be larger than top_k.

python run.py --namespace="cs.AI" --top_k=10 --coarse_k=20

fast_mode is set to True by default, which ignores author-related features. Collecting author data stably (using scholarly and free-proxy can be painfully slow at the beginning (and going faster as authors_cache.db builds up the cache). If you are deploying on server or have ~1hr to let it run,

python run.py --fast_mode=False

Disclaimer

This ranker is soooo biased and I'm pretty sure some cool papers are overlooked. But I feel it helpful in capturing part of which I regret to miss.

Next Step

I'll create a Tweeter Bot soon to serve this project into daily feed. Feel free to contact me magician1206(Discord) for suggestions or contribute to more features, faster pipelines etc :)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages