-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generation of resource maps with invalid characters #51
Comments
Awesome, thanks for the bug report. I'll take a look at this and see which piece of software this bug lives inside. @jeanetteclark could you elaborate on this line of your code snippet?
What does that mean? |
PS @jeanetteclark were you able to reproduce a resource map that had this bogus content? <rdf:Description rdf:about="file:///tmp/RtmphWZjPl/_:r1510618411r30298r1">
<foaf:name rdf:resource="file:///tmp/RtmphWZjPl/"DataONE R Client"^^<http://www.w3.org/2001/XMLSchema#string"/>
</rdf:Description> Specifically, the |
Looks like the " part of this bug is related to the custom resource map parsing routine I had to put in for arcticdatautils to support PROV a while back. This package uses that routine to update an existing resource map to the next version of the package. To do that, all the triples are loaded from the RDF/XML into a
If you take a look at the Object column, you'll see the text:
which looks to be the cause of the |
Oh, and a PS: I didn't catch this bug during development/testing because I didn't try parsing registry-created resource maps and only tested on resource maps built in R. |
@amoeba the I've been think about a way to 'repair' the resource maps using |
I think this is a problem with my EML code. for some reason it does not add a new paragraph to the EML. I haven't been able to reproduce anything with the |
PPS: At least the |
Oh, thanks @gothub. I had looked a few times before and not seen those methods. From their names, it sounds like they'll work nicely. |
Here's the list of 1184 resource map identifiers and their upload date. This list has RDF documents with either pids-with-bad-ids-and-uris.txt Note that this ticket is a duplicate of https://github.nceas.ucsb.edu/KNB/arctic-data/issues/247 which describes the same problem. |
Does anyone have any clues where the 'file:///...' strings were introduced in this workflow? I haven't found it yet in the R client. Also, I'm also noticing that the
Which should be
|
@gothub My guess is that there is some processing code that is inadvertently passing the file object reference to |
https://arcticdata.io/catalog/#view/doi:10.18739/A2136S |
I checked the old version of that RM and it did not have either the |
I found 154 more in the KNB: |
The R packages
The workflow in R would be:
The package relationships would then be manually inspected to verify correctness.
Then the package is uploaded, with only the resource map being updated:
Once this has been done for a representative sample of the affected resource maps, then the |
@amoeba A patch may be in order to prevent RDFs updated through arcticdatautils from inhibiting the addition of prov relationships. This may not be news to you nor very helpful info, but I am documenting my 2 cents here. |
Thanks @jagoldstein, that is news and is helpful. I'm not actively working on this but I'm watching this thread so the extra info is helpful. |
@gothub |
This commit addresses the cause of #51 The previous version of `parse_resource_map` used a SPARQL query to pull the triples out of the resource map being updated but didn't actually parse the result in an RDF-aware manner. This lead to " characters ending up in various places in the RDF/XML which causes tons of issues. The new routine simply uses datapack::parseRDF instead and just filters out cito:documents, cito:isDocumentedBy, dcterms:identifier, and the DataONE R Client statement.
This commit addresses the cause of #51 The previous version of `parse_resource_map` used a SPARQL query to pull the triples out of the resource map being updated but didn't actually parse the result in an RDF-aware manner. This lead to " characters ending up in various places in the RDF/XML which causes tons of issues. The new routine simply uses datapack::parseRDF instead and just filters out cito:documents, cito:isDocumentedBy, dcterms:identifier, and the DataONE R Client statement.
@amoeba Edit: An MRE for this is:
Noting the output:
[original issue below]
These steps generate a resource map with this, evidently problematic line:
Link to package: https://dev.nceas.ucsb.edu/#view/urn:uuid:0c720fed-bbe9-4076-9e59-7636730b3d5a
Attached is a list of the 854 resource maps with this problem on the ADC. There are likely some on the KNB as well
bad_rms_ADC.txt
The text was updated successfully, but these errors were encountered: