Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INSPIRE harvesting #285

Open
ntarocco opened this issue Dec 13, 2024 · 0 comments
Open

INSPIRE harvesting #285

ntarocco opened this issue Dec 13, 2024 · 0 comments
Assignees
Milestone

Comments

@ntarocco
Copy link
Contributor

ntarocco commented Dec 13, 2024

RFC
Some theses are published in external systems and not first in CDS. In such cases, they get harvested by INSPIRE (or directly published there), for example this one and then they are "pushed" to CDS.

The end goal is to be able to harvest CERN theses from INSPIRE, and add them to the new Theses community.

Pushing from INSPIRE to CDS actually means that in INSPIRE, the Library adds such records to the ForCDS OAI-PMH set. CDS will then harvested with a specific OAI-PMH job. Here an example.

When records are harvested from INSPIRE, we have a transformation/mapping module that will serialize an INSPIRE record to a CDS record (mapping rules).

The harvesting should be a recurrent job. In the future, we will have to harvest also preprints, articles, proceedings and other document types.

Tasks

  • Design how to harvest theses from INSPIRE: which protocol/endpoint to use, do we create a module in CDS-RDM or something re-usable, what UI/configuration is needed (starting from invenio-jobs...)
  • Prototype the harvester module:
    • Fetch records, by searching by datetime range, and paginating. We should also be able to fetch single records in case of manual fixing
    • Insert or update based on IDs: when inserting, the owner should be . Each theses should be added to the Theses submission.
    • Report and alert in case of merge conflicts or errors (based on current implementation in CDS). This should be done step-by-step: initially, with a very basic set of information, and then later on with all fields mapped to our data model.
    • Ideally, when an error occurs, we want to notify the Library so they can correct things before the next run, without our intervention. We need to take into account that sometimes INSPIRE fails when we download the file, we should re-try from time to time.
    • The job/async task should not take hours: it is better to split it in multiple different async tasks.
  • Document the high-level transformation rules currently implemented in the harvesting-kit module.

Open questions

  1. How do we harvest only CERN content? Currently, with OAI-PMH, the Library will add them to the ForCDS set, but what about if we start using the JSON APIs?
  1. Who is owner of the harvested theses?
  2. How do we handle files?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants