You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RFC
Some theses are published in external systems and not first in CDS. In such cases, they get harvested by INSPIRE (or directly published there), for example this one and then they are "pushed" to CDS.
The end goal is to be able to harvest CERN theses from INSPIRE, and add them to the new Theses community.
Pushing from INSPIRE to CDS actually means that in INSPIRE, the Library adds such records to the ForCDS OAI-PMH set. CDS will then harvested with a specific OAI-PMH job. Here an example.
When records are harvested from INSPIRE, we have a transformation/mapping module that will serialize an INSPIRE record to a CDS record (mapping rules).
The harvesting should be a recurrent job. In the future, we will have to harvest also preprints, articles, proceedings and other document types.
Tasks
Design how to harvest theses from INSPIRE: which protocol/endpoint to use, do we create a module in CDS-RDM or something re-usable, what UI/configuration is needed (starting from invenio-jobs...)
Prototype the harvester module:
Fetch records, by searching by datetime range, and paginating. We should also be able to fetch single records in case of manual fixing
Insert or update based on IDs: when inserting, the owner should be . Each theses should be added to the Theses submission.
Report and alert in case of merge conflicts or errors (based on current implementation in CDS). This should be done step-by-step: initially, with a very basic set of information, and then later on with all fields mapped to our data model.
Ideally, when an error occurs, we want to notify the Library so they can correct things before the next run, without our intervention. We need to take into account that sometimes INSPIRE fails when we download the file, we should re-try from time to time.
The job/async task should not take hours: it is better to split it in multiple different async tasks.
Document the high-level transformation rules currently implemented in the harvesting-kit module.
Open questions
How do we harvest only CERN content? Currently, with OAI-PMH, the Library will add them to the ForCDS set, but what about if we start using the JSON APIs?
RFC
Some theses are published in external systems and not first in CDS. In such cases, they get harvested by INSPIRE (or directly published there), for example this one and then they are "pushed" to CDS.
The end goal is to be able to harvest CERN theses from INSPIRE, and add them to the new Theses community.
Pushing from INSPIRE to CDS actually means that in INSPIRE, the Library adds such records to the
ForCDS
OAI-PMH set. CDS will then harvested with a specific OAI-PMH job. Here an example.When records are harvested from INSPIRE, we have a transformation/mapping module that will serialize an INSPIRE record to a CDS record (mapping rules).
The harvesting should be a recurrent job. In the future, we will have to harvest also preprints, articles, proceedings and other document types.
Tasks
invenio-jobs
...)Open questions
ForCDS
set, but what about if we start using the JSON APIs?The text was updated successfully, but these errors were encountered: