Skip to content

softcite/softcite-extractions-oa

Repository files navigation

Softcite Extractions from Open Access Literature

The softcite-extractions-oa dataset is a collection of ML-identified mentions of software detected in about 24 million academic papers. The papers are all open access papers available circa 2024. The extractions were created from academic PDFs using the Softcite mention extraction toolchain, which is built on the Grobid model trained on the Softcite Annotations dataset v2. More details available at the Softcite Org home page.

This work used JetStream 2 at Indiana through allocation CIS220172 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.

The data

The data files are hosted outside github, on Zenodo at https://zenodo.org/records/15066399. This GitHub repo is documentation and hosts the files used to convert the data from json into tabular format in parquet.

Reporting extraction errors/omissions

These extractions are the result of a machine learning model; they are probabilistic and will have both false positives and false negatives. The performance of the model shows f-scores around 0.8 (see https://doi.org/10.1145/3459637.3481936 for full details, and https://github.com/softcite/#papers for more details, including the annotation scheme used in the underlying gold standard dataset. See https://github.com/softcite/ for the training data, models, tools, and papers.

Please create Issues in this repository when you encounter problems in the dataset. That said, we can't correct these manually, but any explanation you can give will help us improve the training data and improve the model. Please also share transformations you have to apply to the dataset in your work with it.

The data model

A paper can contain many mentions, each of which was found in a full-text snippet of context, and extracts the (raw and normalized) software name , version number, creator, url, as well as associated citation to the reference list of the paper.

Each mention has multiple purpose assessments about the relationship between the software and the paper: Was the software used in the research?, Was it created in the course of the research?, Was the software shared alongside this paper? These probabilistic assessments (0..1 range) are made in two ways: using only the information from the specific mention and using all the mentions within a single paper together (mention-level vs document-level); thus each mention has six purpose assessments.

drawing

Getting Started

Getting the Parquet files

Parquet files are available from Zenodo at https://zenodo.org/records/15066399. There are three sub-folders:

full_dataset
p01_one_percent_random_subset
p05_five_percent_random_subset

The random subsets are subsets of papers, with all of the extractions in those papers. We created these to make prototyping analyses easier. Inside each folder are three files:

papers.parquet
mentions.pdf.parquet
purpose_assessments.pdf.parquet

Example Analyses

For these examples, the 5% subset of the data is used. These examples require the tidyverse and arrow packages to run, but should otherwise work as-is.

library(tidyverse)
library(arrow)
  1. How many papers mention OpenStreetMap?

This example filters by software_normalied as this is less noisy than software_raw.

> mentions <- open_dataset("p05_five_percent_random_subset/mentions.pdf.parquet")
> mentions |>
+   filter(software_normalized == "OpenStreetMap") |>
+   select(paper_id) |>
+   distinct() |>
+   count() |>
+   collect()
# A tibble: 1 × 1
      n
  <int>
1   376
  1. How did the number of papers referencing STATA each year change from 2000-2020?

By joining the Mentions table with Papers, we can compute statistics requiring access to paper metadata. Analyses like these are why we include fields such as paper_id in Mentions, even though it denormalizes the tables.

> papers <- open_dataset("p05_five_percent_random_subset/papers.parquet")
> mentions <- open_dataset("p05_five_percent_random_subset/mentions.pdf.parquet")
> 
> mentions |>
+   filter(software_normalized == "STATA") |>
+   select(paper_id) |>
+   distinct() |>
+   inner_join(papers, by = c("paper_id")) |>
+   filter(published_year >= 2000, published_year <= 2020) |>
+   count(published_year) |>
+   arrange(published_year) |>
+   collect()
# A tibble: 21 × 2
   published_year     n
            <int> <int>
 1           2000    11
 2           2001    14
 3           2002    20
 4           2003    29
 5           2004    51
 6           2005    32
 7           2006    42
 8           2007    49
 9           2008    77
10           2009    87
# ℹ 11 more rows
# ℹ Use `print(n = ...)` to see more rows
  1. What are the most popular software packages used since 2020, by number of distinct papers?

Answering this question requires joining all three tables. Especially with the full dataset, we generally recommend using select statements before and after joins to reduce memory overhead. Here we use the PurposeAssessments table to evaluate whether software was "used" in a paper. The "document" scope is appropriate here as we're interested in whether the software was used by the paper, not whether particular mentions of the software indicate this.

> papers <- open_dataset("p05_five_percent_random_subset/papers.parquet")
> mentions <- open_dataset("p05_five_percent_random_subset/mentions.pdf.parquet")
> purposes <- open_dataset("p05_five_percent_random_subset/purpose_assessments.pdf.parquet")
> 
> papers |>
+   filter(published_year >= 2020) |>
+   select(paper_id) |>
+   inner_join(mentions, by=c("paper_id")) |>
+   select(software_mention_id, software_normalized) |>
+   inner_join(purposes, by=c("software_mention_id")) |>
+   filter(scope=="document", purpose=="used", certainty_score > 0.5) |>
+   select(paper_id, software_normalized) |>
+   distinct() |>
+   count(software_normalized) |>
+   arrange(desc(n)) |>
+   collect()
# A tibble: 79,730 × 2
   software_normalized     n
   <chr>               <int>
 1 SPSS                22596
 2 GraphPad Prism       8080
 3 Excel                6131
 4 ImageJ               5477
 5 MATLAB               5117
 6 SAS                  3480
 7 SPSS Statistics      3065
 8 Stata                2545
 9 script               2247
10 Matlab               2225
# ℹ 79,720 more rows
# ℹ Use `print(n = ...)` to see more rows

Additional details and provenance

The Grobid extraction pipeline worked with multiple sources for each paper, including PDFs and xml sources from publishers, such as JATS and TEI XML. This produced json files, which were then processed to tabular formats in parquet.

The tablular dataset includes only extractions from PDF sources, to avoid complexity of multiple source types for a single paper. This decision was made easier based on the reality that PDF was available for all papers, but other papers sources were only available for smaller subsets.

Details of the full json data, from all source document types, and the way those were read and mapped to tabular data are available in Extracting Tables.

About

Analyses of software mentions and dependencies

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages