Skip to content

Methods: ePrints

Kara Moraw edited this page Jul 10, 2024 · 3 revisions

Each ePrints repository is processed separately and will produce its own files. These were later merged into one, disregarding any entries without any detected GitHub links.

Searching ePrints for publication files

We use the ePrints advanced search tool to download an XML file with all entries for a given timeframe. The number of entries and subsequent file size varies significantly between the repositories.

The advanced search request was:

f"https://{repo}/cgi/search/archive/advanced?screen=Search&" \
                "output=XML&" \
                "_action_export_redir=Export&" \
                "dataset=archive&" \
                "_action_search=Search&" \
                "documents_merge=ALL&" \
                "documents=&" \
                "eprintid=&" \
                "title_merge=ALL&" \
                "title=&" \
                "contributors_name_merge=ALL&" \
                "contributors_name=&" \
                "abstract_merge=ALL&" \
                "abstract=&" \
                f"date={date}&" \
                "keywords_merge=ALL&" \
                "keywords=&" \
                "divisions_merge=ANY&" \
                "pres_type=paper&" \
                "refereed=EITHER&" \
                "publication%2Fseries_name_merge=ALL&" \
                "publication%2Fseries_name=&" \
                "documents.date_embargo=&" \
                "lastmod=&" \
                "pure_uuid=&" \
                "contributors_id=&" \
                "satisfyall=ALL&" \
                "order=contributors_name%2F-date%2Ftitle"

The resulting XML file is stored - in subsequent runs of the script, downloading the extract can be switched off if the file is already present locally.

The ePrints entries are then parsed, extracting title, date and creators. One of the creators is recorded as an associated author - this is not reflecting main authorship but only to be able to find the publication in other tools such as CrossRef if that is of interest later on. In our analysis, we have interpreted the date field as publication date, but this will not always be entirely correct as the field is not limited in that way by ePrints. Nevertheless, it gives us a rough indication of when the piece was published.

Moreover, we also extract any links to downloadable files. Not every entry will have files, and we disregard any that do not. These files can be images, PDFs, Word documents, etc. - we do not make any distinction at this stage.

The resulting CSV file extraction_pdf_urls_<eprints-repo>_<date-ramge>.csv has one row for each link to a downloadable file:

column name type description comment
title str publication title
date date ePrints date field YYYY-MM-DD
author_for_reference str one of the individuals listed as creators
pdf_url str link to a downloadable file not necessarily a PDF, but will be assumed as such in later steps, hence the column name

Searching for GitHub links

In the next step, each downloadable file is downloaded and parsed as a PDF. If it is not a PDF file, this attempt will fail and the script will move on. The text is extracted from the PDF using pdfminer and scanned for any occurrences of a domain specified as an argument of the script. In our case, we looked for github.com, but the script could be used to check for other services.

The regular expression used is rf"(?P<url>https?://(www\.)?{re.escape(domain)}[^\s]+)". This will often not recognise the proper end of the link, but this will be addressed in later processing steps. We also store on which page the link is found. It would be interesting to map the links to the section they were found in, but this was outside the scope of this project and is not straightforward for links found in footnotes, which is often the case.

The resulting CSV file extracted_urls_<eprints-repo>_<date-range>_<domain>.csv has one row for each link to the domain found in one of the publications:

column name type description comment
title str publication title
date date ePrints date field YYYY-MM-DD
author_for_reference str one of the individuals listed as creators
pdf_url str link to a PDF linked to the publication
page_no int number of the page the link was found on the first page is denoted as 0
domain_url str extracted link to the domain

Validating GitHub links

As a last step, the extracted GitHub links are cleaned to the expected format (github.com//) and then validated using the GitHub API. Specifically, we use the GitHub API to check all repositories of the detected user for the repository with the best match to the detected repository name that has a Levenshtein ratio of at least 0.7. This helps to deal with any text detection errors and at the same time validates that the repository is (still) reachable. The resulting CSV file cleaned_urls_<eprints-repo>_<date-range>_<domain>.csv has one row for each link to the domain found in one of the publications:

column name type description comment
title str publication title
date date ePrints date field YYYY-MM-DD
author_for_reference str one of the individuals listed as creators
pdf_url str link to a PDF linked to the publication
page_no int number of the page the link was found on the first page is denoted as 0
domain_url str extracted link to the domain
pattern_cleaned_url link after pattern cleaning uses regular expression
github_user_cleaned_url repository ID (<username>/<reponame>) after validating through GitHub API as described above
Clone this wiki locally