Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generation of resource maps with invalid characters #51

Open
jeanetteclark opened this issue Nov 18, 2017 · 19 comments
Open

generation of resource maps with invalid characters #51

jeanetteclark opened this issue Nov 18, 2017 · 19 comments
Assignees

Comments

@jeanetteclark
Copy link
Collaborator

jeanetteclark commented Nov 18, 2017

@amoeba Edit: An MRE for this is:

library(arcticdatautils)
library(dataone)

mn <- MNode("https://dev.nceas.ucsb.edu/knb/d1/mn/v2")
pkg <- create_dummy_package(mn)
new_pkg <- publish_update(mn, pkg$metadata, pkg$resource_map, pkg$data)
cat(rawToChar(getObject(mn, new_pkg$resource_map)))

Noting the output:

<rdf:Description rdf:about="_:r1511139237r13053r1">
    <foaf:name rdf:resource="&quot;DataONE R Client&quot;^^&lt;http://www.w3.org/2001/XMLSchema#string"/>
</rdf:Description>

[original issue below]

library(dataone)
library(datapack)
library(arcticdatautils)
library(EML)

cn <- CNode('STAGING2')
mn <- getMNode(cn,"urn:node:mnTestKNB")

#write metadata and attach a data file in registry on dev.nceas.ucsb.edu
id <- 'knb.109096.1'

#read in registry EML
outpath <- '~/example.xml'
writeBin(getObject(mn, id), outpath)

#make edits and save EML
eml <- read_eml(outpath)
eml@dataset@abstract@[email protected][[2]] <- new('para', .Data = 'and edited using R') #this does not actually change anything, annoyingly
write_eml(eml, outpath)

#get ids from initial submission
ids <- get_package(mn, id)
#update EML with new version using publish_update
id_new <- publish_update(mn, metadata_pid = ids$metadata, resource_map_pid = ids$resource_map, data_pids = ids$data, metadata_path = outpath)

These steps generate a resource map with this, evidently problematic line:

  <rdf:Description rdf:about="https://cn-stage-2.test.dataone.org/cn/v1/resolve/knb.109095.1">
    <dcterms:identifier rdf:resource="&quot;knb.109095.1&quot;^^&lt;http://www.w3.org/2001/XMLSchema#string"/>
  </rdf:Description>

Link to package: https://dev.nceas.ucsb.edu/#view/urn:uuid:0c720fed-bbe9-4076-9e59-7636730b3d5a

Attached is a list of the 854 resource maps with this problem on the ADC. There are likely some on the KNB as well

bad_rms_ADC.txt

@amoeba
Copy link
Contributor

amoeba commented Nov 19, 2017

Awesome, thanks for the bug report. I'll take a look at this and see which piece of software this bug lives inside.

@jeanetteclark could you elaborate on this line of your code snippet?

eml@dataset@abstract@[email protected][[2]] <- new('para', .Data = 'and edited using R') #this does not actually change anything, annoyingly

What does that mean?

@amoeba
Copy link
Contributor

amoeba commented Nov 20, 2017

PS @jeanetteclark were you able to reproduce a resource map that had this bogus content?

<rdf:Description rdf:about="file:///tmp/RtmphWZjPl/_:r1510618411r30298r1">
    <foaf:name rdf:resource="file:///tmp/RtmphWZjPl/&quot;DataONE R Client&quot;^^&lt;http://www.w3.org/2001/XMLSchema#string"/>
</rdf:Description>

Specifically, the file:///tmp part

@amoeba
Copy link
Contributor

amoeba commented Nov 20, 2017

Looks like the " part of this bug is related to the custom resource map parsing routine I had to put in for arcticdatautils to support PROV a while back. This package uses that routine to update an existing resource map to the next version of the package. To do that, all the triples are loaded from the RDF/XML into a data.frame, some simple logic is used to only update triples relating to Data Packaging (basically: documented/isDocumentedBy & aggregates/aggregatedBy and some more) while leaving the rest in (e.g. PROV). The routine I'm using spits out these rows:

> statements
                 subject                                       predicate                                                       object
4  _:r1511136883r13440r1                  http://xmlns.com/foaf/0.1/name "DataONE R Client"^^<http://www.w3.org/2001/XMLSchema#string
22 _:r1511136883r13440r1 http://www.w3.org/1999/02/22-rdf-syntax-ns#type                               http://purl.org/dc/terms/Agent

If you take a look at the Object column, you'll see the text:

"DataONE R Client"^^<http://www.w3.org/2001/XMLSchema#string

which looks to be the cause of the &quot; @gothub you recently implemented some routine(s) similar to this in datapack. I haven't looked at them yet but perhaps yours work better and arcticdatautils should switch using them?

@amoeba
Copy link
Contributor

amoeba commented Nov 20, 2017

Oh, and a PS: I didn't catch this bug during development/testing because I didn't try parsing registry-created resource maps and only tested on resource maps built in R.
PPS: Added an MRE at the top which shows what's going on a little more clearly.

@gothub
Copy link

gothub commented Nov 20, 2017

@amoeba the datapack resource map parsing routines parseRDF(), getTriples() currently
handles PROV statement.

I've been think about a way to 'repair' the resource maps using datapack, but don't have all the details yet.

@jeanetteclark
Copy link
Collaborator Author

eml@dataset@abstract@[email protected][[2]] <- new('para', .Data = 'and edited using R') #this does not actually change anything, annoyingly

I think this is a problem with my EML code. for some reason it does not add a new paragraph to the EML.

I haven't been able to reproduce anything with the file:// yet but I can try this morning...I have an idea about that one

@amoeba
Copy link
Contributor

amoeba commented Nov 20, 2017

PPS: At least the &quot; regression was very likely introduced in https://github.com/NCEAS/arcticdatautils/tree/v0.5.4

@amoeba
Copy link
Contributor

amoeba commented Nov 20, 2017

Oh, thanks @gothub. I had looked a few times before and not seen those methods. From their names, it sounds like they'll work nicely.

@csjx
Copy link
Member

csjx commented Nov 20, 2017

Here's the list of 1184 resource map identifiers and their upload date. This list has RDF documents with either file:// URIs in it, or incorrect dcterms:identifer fields with &quot; entities in the statement.

pids-with-bad-ids-and-uris.txt

Note that this ticket is a duplicate of https://github.nceas.ucsb.edu/KNB/arctic-data/issues/247 which describes the same problem.

@gothub
Copy link

gothub commented Nov 21, 2017

Does anyone have any clues where the 'file:///...' strings were introduced in this workflow? I haven't found it yet in the R client.

Also, I'm also noticing that the dcterms:identifier triples have been converted to
rdf:resources which is incorrect, they should be literals. Here is a sample of the incorrect
one:

  <rdf:Description rdf:about="https://cn.dataone.org/cn/v1/resolve/arctic-data.10018.1">
    <dcterms:identifier rdf:resource="file:///tmp/RtmppO7bqc/&quot;arctic-data.10018.1&quot;^^&lt;http://www.w3.org/2001/XMLSchema#string"/>
  </rdf:Description>

Which should be

  <rdf:Description rdf:about="https://cn.dataone.org/cn/v1/resolve/arctic-data.10018.1">
    <dcterms:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#string">arctic-data.10018.1</dcterms:identifier>
  </rdf:Description>

@csjx
Copy link
Member

csjx commented Nov 21, 2017

@gothub My guess is that there is some processing code that is inadvertently passing the file object reference to arcticdatautils::update_resource_map() instead of the identifier of the object, and so R is trying to serialize a string from the object as best it can, and ends up spitting out the file:///... URI of the object. This similarly happens in Java when you try to call System.out.println(myObject) and you print the object's memory address rather than the object name. Of course I'm speculating, but that is the direction I would look here as a start.

@amoeba
Copy link
Contributor

amoeba commented Nov 21, 2017

The relevant parts of arcticdatautils I'd blame for bad behavior is these hacks of functions:

parse_resource_map <- function(path) {

filter_packaging_statements <- function(statements) {

parse_resource_map is particularly hackish

@jagoldstein
Copy link

https://arcticdata.io/catalog/#view/doi:10.18739/A2136S
I was able to add prov to the package found here ^ even though the first version was submitted via the registry and I later updated it w arcticdatautils. Peter speculates that this worked because the RDF update was performed prior to June 2017

@jeanetteclark
Copy link
Collaborator Author

I checked the old version of that RM and it did not have either the file:// or the "&quot; strings

@csjx
Copy link
Member

csjx commented Nov 21, 2017

I found 154 more in the KNB:

pids-with-bad-ids-and-uris-knb.txt

@gothub
Copy link

gothub commented Nov 21, 2017

The R packages dataone and datapack are being updated to repair the problems that we have seen with resource maps:

  • 'file:///' strings
  • &quot; strings
  • dcterms:identifiers as rdf:resources, not literals
  • multiple creators for the aggregation or no creator for the aggregation
  • anything else?

The workflow in R would be:

d1c <- D1Client("PROD", "urn:node:ARCTIC")
pkg <- getDataPackage(mn, id="resource_map_doi:10.18739/A2XT16", lazy=T, limit="0MB", quiet=F, repair=TRUE)

The package relationships would then be manually inspected to verify correctness.

pkg

Then the package is uploaded, with only the resource map being updated:

newId <- uploadDataPackage(d1c, pkg, quiet=FALSE)

Once this has been done for a representative sample of the affected resource maps, then the
process can be automated to update the rest.

@jagoldstein
Copy link

@amoeba
I am experimenting with updating EMLs with arcticdatautils both before and AFTER we have added prov. It seems the issue may only apply to RDFs that were updated via that library after June 2017.

A patch may be in order to prevent RDFs updated through arcticdatautils from inhibiting the addition of prov relationships. This may not be news to you nor very helpful info, but I am documenting my 2 cents here.

@amoeba
Copy link
Contributor

amoeba commented Nov 21, 2017

Thanks @jagoldstein, that is news and is helpful. I'm not actively working on this but I'm watching this thread so the extra info is helpful.

@dmullen17
Copy link
Member

@gothub
I found another way that the resource map error can crop up:
I uploaded a data package to the arctic data center with arcticdatautils using publish_object and create_resource_map. The error is not present in this resource map first resource map.
However if i use publish_update to give the data package a DOI, the new resource map exhibits the error, even though this data package did not originate from the registry. The same error comes up if publish an xml using a pre-generated DOI.

amoeba added a commit that referenced this issue Nov 22, 2017
This commit addresses the cause of #51

The previous version of `parse_resource_map` used a SPARQL
query to pull the triples out of the resource map being updated
but didn't actually parse the result in an RDF-aware manner.
This lead to &quot; characters ending up in various places in
the RDF/XML which causes tons of issues.

The new routine simply uses datapack::parseRDF instead and
just filters out cito:documents, cito:isDocumentedBy, dcterms:identifier, and the DataONE R Client statement.
laijasmine pushed a commit that referenced this issue Oct 2, 2020
This commit addresses the cause of #51

The previous version of `parse_resource_map` used a SPARQL
query to pull the triples out of the resource map being updated
but didn't actually parse the result in an RDF-aware manner.
This lead to &quot; characters ending up in various places in
the RDF/XML which causes tons of issues.

The new routine simply uses datapack::parseRDF instead and
just filters out cito:documents, cito:isDocumentedBy, dcterms:identifier, and the DataONE R Client statement.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants