The softcite-extractions-oa dataset is a collection of ML-identified mentions of software detected in about 24 million academic papers. The papers are all open access papers available circa 2024. The extractions were created from academic PDFs using the Softcite mention extraction toolchain, which is built on the Grobid model trained on the Softcite Annotations dataset v2. More details available at the Softcite Org home page.
This work used JetStream 2 at Indiana through allocation CIS220172 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.
The data files are hosted outside github, on Zenodo at https://zenodo.org/records/15066399. This GitHub repo is documentation and hosts the files used to convert the data from json into tabular format in parquet.
These extractions are the result of a machine learning model; they are probabilistic and will have both false positives and false negatives. The performance of the model shows f-scores around 0.8 (see https://doi.org/10.1145/3459637.3481936 for full details, and https://github.com/softcite/#papers for more details, including the annotation scheme used in the underlying gold standard dataset. See https://github.com/softcite/ for the training data, models, tools, and papers.
Please create Issues in this repository when you encounter problems in the dataset. That said, we can't correct these manually, but any explanation you can give will help us improve the training data and improve the model. Please also share transformations you have to apply to the dataset in your work with it.
A paper can contain many mentions, each of which was found in a full-text snippet of context, and extracts the (raw and normalized) software name , version number, creator, url, as well as associated citation to the reference list of the paper.
Each mention has multiple purpose assessments about the relationship between the software and the paper: Was the software used in the research?, Was it created in the course of the research?, Was the software shared alongside this paper? These probabilistic assessments (0..1 range) are made in two ways: using only the information from the specific mention and using all the mentions within a single paper together (mention-level vs document-level); thus each mention has six purpose assessments.
Parquet files are available from Zenodo at https://zenodo.org/records/15066399. There are three sub-folders:
full_dataset
p01_one_percent_random_subset
p05_five_percent_random_subset
The random subsets are subsets of papers, with all of the extractions in those papers. We created these to make prototyping analyses easier. Inside each folder are three files:
papers.parquet
mentions.pdf.parquet
purpose_assessments.pdf.parquet
For these examples, the 5% subset of the data is used.
These examples require the tidyverse
and arrow
packages to run, but should otherwise work as-is.
library(tidyverse)
library(arrow)
- How many papers mention OpenStreetMap?
This example filters by software_normalied
as this is less noisy than software_raw
.
> mentions <- open_dataset("p05_five_percent_random_subset/mentions.pdf.parquet")
> mentions |>
+ filter(software_normalized == "OpenStreetMap") |>
+ select(paper_id) |>
+ distinct() |>
+ count() |>
+ collect()
# A tibble: 1 × 1
n
<int>
1 376
- How did the number of papers referencing STATA each year change from 2000-2020?
By joining the Mentions table with Papers, we can compute statistics requiring access to paper metadata. Analyses like these are why we include fields such as paper_id
in Mentions, even though it denormalizes the tables.
> papers <- open_dataset("p05_five_percent_random_subset/papers.parquet")
> mentions <- open_dataset("p05_five_percent_random_subset/mentions.pdf.parquet")
>
> mentions |>
+ filter(software_normalized == "STATA") |>
+ select(paper_id) |>
+ distinct() |>
+ inner_join(papers, by = c("paper_id")) |>
+ filter(published_year >= 2000, published_year <= 2020) |>
+ count(published_year) |>
+ arrange(published_year) |>
+ collect()
# A tibble: 21 × 2
published_year n
<int> <int>
1 2000 11
2 2001 14
3 2002 20
4 2003 29
5 2004 51
6 2005 32
7 2006 42
8 2007 49
9 2008 77
10 2009 87
# ℹ 11 more rows
# ℹ Use `print(n = ...)` to see more rows
- What are the most popular software packages used since 2020, by number of distinct papers?
Answering this question requires joining all three tables.
Especially with the full dataset, we generally recommend using select
statements before and after joins to reduce memory overhead.
Here we use the PurposeAssessments table to evaluate whether software was "used" in a paper.
The "document" scope is appropriate here as we're interested in whether the software was used by the paper, not whether particular mentions of the software indicate this.
> papers <- open_dataset("p05_five_percent_random_subset/papers.parquet")
> mentions <- open_dataset("p05_five_percent_random_subset/mentions.pdf.parquet")
> purposes <- open_dataset("p05_five_percent_random_subset/purpose_assessments.pdf.parquet")
>
> papers |>
+ filter(published_year >= 2020) |>
+ select(paper_id) |>
+ inner_join(mentions, by=c("paper_id")) |>
+ select(software_mention_id, software_normalized) |>
+ inner_join(purposes, by=c("software_mention_id")) |>
+ filter(scope=="document", purpose=="used", certainty_score > 0.5) |>
+ select(paper_id, software_normalized) |>
+ distinct() |>
+ count(software_normalized) |>
+ arrange(desc(n)) |>
+ collect()
# A tibble: 79,730 × 2
software_normalized n
<chr> <int>
1 SPSS 22596
2 GraphPad Prism 8080
3 Excel 6131
4 ImageJ 5477
5 MATLAB 5117
6 SAS 3480
7 SPSS Statistics 3065
8 Stata 2545
9 script 2247
10 Matlab 2225
# ℹ 79,720 more rows
# ℹ Use `print(n = ...)` to see more rows
The Grobid extraction pipeline worked with multiple sources for each paper, including PDFs and xml sources from publishers, such as JATS and TEI XML. This produced json files, which were then processed to tabular formats in parquet.
The tablular dataset includes only extractions from PDF sources, to avoid complexity of multiple source types for a single paper. This decision was made easier based on the reality that PDF was available for all papers, but other papers sources were only available for smaller subsets.
Details of the full json data, from all source document types, and the way those were read and mapped to tabular data are available in Extracting Tables.