Find dataset references in scientific papers.
This tool extracts text from papers (given a DOI) and identifies references to datasets on multiple archives:
pip install -r requirements.txtpython find_reuse.py 10.1038/s41593-024-01783-4python find_reuse.py --file dois.txtWhen processing multiple DOIs, a progress bar shows the current status.
Automatically discover papers that reference datasets by searching PubMed:
# Discover papers (default: 100 results)
python find_reuse.py --discover
# Discover more papers
python find_reuse.py --discover --max-results 500
# Save results to file
python find_reuse.py --discover -o results.jsonDiscovery mode:
- Searches PubMed for papers mentioning DANDI, OpenNeuro, Figshare, or PhysioNet
- Retrieves full text for each paper
- Extracts specific dataset IDs
- Follows citations to data descriptor papers (Scientific Data, Data MDPI) to find indirect references
- Returns structured JSON with all findings
By default, the tool follows citations to data descriptor papers (Scientific Data, Data MDPI journals) to find datasets that are referenced indirectly. This can be disabled:
python find_reuse.py --no-follow-references 10.1038/s41593-024-01783-4python find_reuse.py -v 10.1038/s41593-024-01783-4Output is always JSON:
{
"doi": "10.1038/s41593-024-01783-4",
"archives": {
"DANDI Archive": {
"dataset_ids": ["000130"],
"matches": [
{
"id": "000130",
"pattern_type": "doi",
"matched_string": "10.48324/dandi.000130"
}
]
}
},
"source": "europe_pmc+crossref",
"error": null
}10.48324/dandi.{id}- DANDI DOI formatdandiarchive.org/dandiset/{id}- URL formatgui.dandiarchive.org/#/dandiset/{id}- GUI URL formatDANDI: {id}orDANDI {id}- Text mentionsdandiset/{id}- Generic dandiset reference
10.18112/openneuro.{id}- OpenNeuro DOI formatopenneuro.org/datasets/{id}- URL formatOpenNeuro: {id}orOpenNeuro {id}- Text mentionsds{6 digits}- Dataset ID pattern
10.6084/m9.figshare.{id}- Figshare DOI format (with optional version)figshare.com/articles/{name}/{id}- URL formatfigshare.com/ndownloader/files/{id}- Download URL format
10.13026/{id}- PhysioNet DOI format (e.g.,10.13026/C2KX0P)physionet.org/content/{id}- URL formatphysionet.org/physiobank/database/{id}- PhysioBank URL format
The tool queries multiple sources to maximize coverage:
- Europe PMC - Full text for open access articles (requires PMCID)
- NCBI PubMed Central - Full text for open access articles
- CrossRef - References section (always checked, often contains dataset DOIs)
- Publisher HTML - Direct scraping of open access article pages (Nature, Springer, Cell, etc.)
The tool can also be used programmatically:
from find_reuse import ArchiveFinder
finder = ArchiveFinder(verbose=True)
result = finder.find_references("10.1038/s41593-024-01783-4")
print(result['archives']) # {'DANDI Archive': {'dataset_ids': ['000130'], ...}}
print(result['source']) # 'europe_pmc+crossref'