Datasets available

🍻 Datasets Available in BEIR

The BEIR benchmark (originally) contains 18 retrieval datasets from diverse domains and tasks. The benchmark focuses on zero-shot evaluation (i.e. no training data available) of lexical and neural retrievers across on the diverse datasets. Here below the dataset links and statistics:

Four private datasets: Send me an email on [email protected] for direct access of these datasets via a private google drive link. Please make sure you have the necessary licenses involved before you send in the email.

🍻 Download a BEIR dataset

To load one of the already preprocessed datasets in your current directory as follows:

from beir import util
from beir.datasets.data_loader import GenericDataLoader

dataset = "scifact"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
data_path = util.download_and_unzip(url, "datasets")
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")

This will download the scifact dataset under the datasets directory.

🍻 Available Datasets

Command to generate md5hash using Terminal: md5hash filename.zip.

Dataset	Website	BEIR-Name	Type	Queries	Corpus	Rel D/Q	Down-load	md5
MSMARCO	Homepage	`msmarco`	`train` `dev` `test`	6,980	8.84M	1.1	Link	`444067daf65d982533ea17ebd59501e4`
TREC-COVID	Homepage	`trec-covid`	`test`	50	171K	493.5	Link	`ce62140cb23feb9becf6270d0d1fe6d1`
NFCorpus	Homepage	`nfcorpus`	`train` `dev` `test`	323	3.6K	38.2	Link	`a89dba18a62ef92f7d323ec890a0d38d`
BioASQ	Homepage	`bioasq`	`train` `test`	500	14.91M	8.05	No	How to Reproduce?
NQ	Homepage	`nq`	`train` `test`	3,452	2.68M	1.2	Link	`d4d3d2e48787a744b6f6e691ff534307`
HotpotQA	Homepage	`hotpotqa`	`train` `dev` `test`	7,405	5.23M	2.0	Link	`f412724f78b0d91183a0e86805e16114`
FiQA-2018	Homepage	`fiqa`	`train` `dev` `test`	648	57K	2.6	Link	`17918ed23cd04fb15047f73e6c3bd9d9`
Signal-1M(RT)	Homepage	`signal1m`	`test`	97	2.86M	19.6	No	How to Reproduce?
TREC-NEWS	Homepage	`trec-news`	`test`	57	595K	19.6	No	How to Reproduce?
ArguAna	Homepage	`arguana`	`test`	1,406	8.67K	1.0	Link	`8ad3e3c2a5867cdced806d6503f29b99`
Touche-2020	Homepage	`webis-touche2020`	`test`	49	382K	19.0	Link	`46f650ba5a527fc69e0a6521c5a23563`
CQADupstack	Homepage	`cqadupstack`	`test`	13,145	457K	1.4	Link	`4e41456d7df8ee7760a7f866133bda78`
Quora	Homepage	`quora`	`dev` `test`	10,000	523K	1.6	Link	`18fb154900ba42a600f84b839c173167`
DBPedia	Homepage	`dbpedia-entity`	`dev` `test`	400	4.63M	38.2	Link	`c2a39eb420a3164af735795df012ac2c`
SCIDOCS	Homepage	`scidocs`	`test`	1,000	25K	4.9	Link	`38121350fc3a4d2f48850f6aff52e4a9`
FEVER	Homepage	`fever`	`train` `dev` `test`	6,666	5.42M	1.2	Link	`5a818580227bfb4b35bb6fa46d9b6c03`
Climate-FEVER	Homepage	`climate-fever`	`test`	1,535	5.42M	3.0	Link	`8b66f0a9126c521bae2bde127b4dc99d`
SciFact	Homepage	`scifact`	`train` `test`	300	5K	1.1	Link	`5f7d1de60b170fc8027bb7898e2efca1`
Robust04	Homepage	`robust04`	`test`	249	528K	69.9	No	How to Reproduce?

Disclaimer

Similar to Tensorflow datasets or HuggingFace's datasets library, we just downloaded and prepared public datasets. We only distribute these datasets in a specific format, but we do not vouch for their quality or fairness, or claim that you have license to use the dataset. It remains the user's responsibility to determine whether you as a user have permission to use the dataset under the dataset's license and to cite the right owner of the dataset.

If you're a dataset owner and wish to update any part of it, or do not want your dataset to be included in this library, feel free to post an issue here or make a pull request!

If you're a dataset owner and wish to include your dataset or model in this library, feel free to post an issue here or make a pull request!