-
Notifications
You must be signed in to change notification settings - Fork 0
Methods: ePrints
Each ePrints repository is processed separately and will produce its own files. These were later merged into one, disregarding any entries without any detected GitHub links.
We use the ePrints advanced search tool to download an XML file with all entries for a given timeframe. The number of entries and subsequent file size varies significantly between the repositories.
The advanced search request was:
f"https://{repo}/cgi/search/archive/advanced?screen=Search&" \
"output=XML&" \
"_action_export_redir=Export&" \
"dataset=archive&" \
"_action_search=Search&" \
"documents_merge=ALL&" \
"documents=&" \
"eprintid=&" \
"title_merge=ALL&" \
"title=&" \
"contributors_name_merge=ALL&" \
"contributors_name=&" \
"abstract_merge=ALL&" \
"abstract=&" \
f"date={date}&" \
"keywords_merge=ALL&" \
"keywords=&" \
"divisions_merge=ANY&" \
"pres_type=paper&" \
"refereed=EITHER&" \
"publication%2Fseries_name_merge=ALL&" \
"publication%2Fseries_name=&" \
"documents.date_embargo=&" \
"lastmod=&" \
"pure_uuid=&" \
"contributors_id=&" \
"satisfyall=ALL&" \
"order=contributors_name%2F-date%2Ftitle"
The resulting XML file is stored - in subsequent runs of the script, downloading the extract can be switched off if the file is already present locally.
The ePrints entries are then parsed, extracting title, date and creators. One of the creators is recorded as an associated author - this is not reflecting main authorship but only to be able to find the publication in other tools such as CrossRef if that is of interest later on. In our analysis, we have interpreted the date field as publication date, but this will not always be entirely correct as the field is not limited in that way by ePrints. Nevertheless, it gives us a rough indication of when the piece was published.
Moreover, we also extract any links to downloadable files. Not every entry will have files, and we disregard any that do not. These files can be images, PDFs, Word documents, etc. - we do not make any distinction at this stage.
The resulting CSV file extraction_pdf_urls_<eprints-repo>_<date-ramge>.csv
has one row for each link to a downloadable file:
column name | type | description | comment |
---|---|---|---|
title |
str |
publication title | |
date |
date |
ePrints date field | YYYY-MM-DD |
author_for_reference |
str |
one of the individuals listed as creators | |
pdf_url |
str |
link to a downloadable file | not necessarily a PDF, but will be assumed as such in later steps, hence the column name |
In the next step, each downloadable file is downloaded and parsed as a PDF.
If it is not a PDF file, this attempt will fail and the script will move on.
The text is extracted from the PDF using pdfminer
and scanned for any occurrences of a domain specified as an argument of the script.
In our case, we looked for github.com
, but the script could be used to check for other services.
The regular expression used is rf"(?P<url>https?://(www\.)?{re.escape(domain)}[^\s]+)"
.
This will often not recognise the proper end of the link, but this will be addressed in later processing steps.
We also store on which page the link is found.
It would be interesting to map the links to the section they were found in, but this was outside the scope of this project and is not straightforward for links found in footnotes, which is often the case.
The resulting CSV file extracted_urls_<eprints-repo>_<date-range>_<domain>.csv
has one row for each link to the domain found in one of the publications:
column name | type | description | comment |
---|---|---|---|
title |
str |
publication title | |
date |
date |
ePrints date field | YYYY-MM-DD |
author_for_reference |
str |
one of the individuals listed as creators | |
pdf_url |
str |
link to a PDF linked to the publication | |
page_no |
int |
number of the page the link was found on | the first page is denoted as 0 |
domain_url |
str |
extracted link to the domain |
As a last step, the extracted GitHub links are cleaned to the expected format (github.com//) and then validated using the GitHub API.
Specifically, we use the GitHub API to check all repositories of the detected user for the repository with the best match to the detected repository name that has a Levenshtein ratio of at least 0.7.
This helps to deal with any text detection errors and at the same time validates that the repository is (still) reachable.
The resulting CSV file cleaned_urls_<eprints-repo>_<date-range>_<domain>.csv
has one row for each link to the domain found in one of the publications:
column name | type | description | comment |
---|---|---|---|
title |
str |
publication title | |
date |
date |
ePrints date field | YYYY-MM-DD |
author_for_reference |
str |
one of the individuals listed as creators | |
pdf_url |
str |
link to a PDF linked to the publication | |
page_no |
int |
number of the page the link was found on | the first page is denoted as 0 |
domain_url |
str |
extracted link to the domain | |
pattern_cleaned_url |
link after pattern cleaning | uses regular expression | |
github_user_cleaned_url |
repository ID (<username>/<reponame> ) after validating through GitHub API |
as described above |