INSPIRE harvesting #285

ntarocco · 2024-12-13T13:06:58Z

RFC
Some theses are published in external systems and not first in CDS. In such cases, they get harvested by INSPIRE (or directly published there), for example this one and then they are "pushed" to CDS.

The end goal is to be able to harvest CERN theses from INSPIRE, and add them to the new Theses community.

Pushing from INSPIRE to CDS actually means that in INSPIRE, the Library adds such records to the ForCDS OAI-PMH set. CDS will then harvested with a specific OAI-PMH job. Here an example.

When records are harvested from INSPIRE, we have a transformation/mapping module that will serialize an INSPIRE record to a CDS record (mapping rules).

The harvesting should be a recurrent job. In the future, we will have to harvest also preprints, articles, proceedings and other document types.

Tasks

Design how to harvest theses from INSPIRE: which protocol/endpoint to use, do we create a module in CDS-RDM or something re-usable, what UI/configuration is needed (starting from invenio-jobs...)
Prototype the harvester module:
- Fetch records, by searching by datetime range, and paginating. We should also be able to fetch single records in case of manual fixing
- Insert or update based on IDs: when inserting, the owner should be . Each theses should be added to the Theses submission.
- Report and alert in case of merge conflicts or errors (based on current implementation in CDS). This should be done step-by-step: initially, with a very basic set of information, and then later on with all fields mapped to our data model.
- Ideally, when an error occurs, we want to notify the Library so they can correct things before the next run, without our intervention. We need to take into account that sometimes INSPIRE fails when we download the file, we should re-try from time to time.
- The job/async task should not take hours: it is better to split it in multiple different async tasks.
Document the high-level transformation rules currently implemented in the harvesting-kit module.

TODO:

Jobs:

Set up invenio-job (invenio-jobs: set up a new job for INSPIRE-CDS harvester #321)
Job report inveniosoftware/invenio-jobs#67

Reader:

Implement reader component (INSPIRE to CDS-RDM harvester: implement reader component #322)

Transformer:

Review mapping rules (INSPIRE to CDS-RDM harvester: review mapping rules for transformer component #323)
Create System User INSPIRE to CDS-RDM harvester: create System User #345
Implement transformation rules (INSPIRE to CDS-RDM harvester: implement record transformation functionality #325)
Harvest the files (INSPIRE to CDS-RDM harvester: harvest record's files #326)
Logging (INSPIRE to CDS-RDM harvester: set up logging #327)
Notifications for us and library about harvesting failures as well as successes
Transform ROR affiliations author affiliations: transform affiliations when RORs are available on the INSPIRE record #454
(For non-thesis records) transformer: adjust imprint isbns custom field to non-thesis cases #457

Writer:

Implement writer component (INSPIRE to CDS-RDM harvester: implement writer component #329)
job logging: register partial success for errors happened in writer #430

Separate tasks:

The text was updated successfully, but these errors were encountered:

ntarocco added this to the CERN Thesis milestone Dec 13, 2024

ntarocco assigned anikachurilova Dec 13, 2024

anikachurilova added this to Sprint Q2/2025 🌻 Dec 16, 2024

anikachurilova moved this to In progress in Sprint Q2/2025 🌻 Dec 16, 2024

kpsherva moved this from In progress to Ready in Sprint Q2/2025 🌻 Jan 22, 2025

kpsherva added the epic label Jan 31, 2025

kpsherva moved this from Ready to Backlog 😴 in Sprint Q2/2025 🌻 Mar 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

INSPIRE harvesting #285

INSPIRE harvesting #285

ntarocco commented Dec 13, 2024 •

edited by anikachurilova

Loading

INSPIRE harvesting #285

INSPIRE harvesting #285

Comments

ntarocco commented Dec 13, 2024 • edited by anikachurilova Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tasks

TODO:

ntarocco commented Dec 13, 2024 •

edited by anikachurilova

Loading