-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker: Make fresh RDF load for virtuoso #10
Comments
E.g. using
|
I'm getting the same error using Docker and tenforce/virtuoso, which is currently using version 07.20.3212. Everything but describe is working. So far I haven't been able to resolve the issue:
|
Setting sql:describe-mode fixes the issue temporarily:
See http://docs.openlinksw.com/virtuoso/rdfsparql.html#rdfsqlfromsparqldescribe However, I was unable to change the mode permanently as described in https://www.mail-archive.com/[email protected]/msg06396.html Finally, I updated to a later build (version: 07.20.3214) and DESCRIBE seems to work fine now. |
Thanks for the details! I have triggered a new build on https://hub.docker.com/r/stain/virtuoso/ to see if this solves it for me as well. I keep getting confused by the many versions of the same "version" of Virtuoso.. |
Making a raw dump using Virtuoso likely dumps to an internal Virtuoso format. If that is the case then it might be different across versions (opensource, commercial and/or version number). We can load and dump quads in an open format for this. dumping-your-virtuoso-data-as-quads and loading-quads-in-virtuoso describe how you can dump and load quads in the tenforce/virtuoso docker. All manual commands should translate to other Virtuoso installations. Automatic loading is a handy extra. I did not find quads in http://data.openphacts.org/1.5/rdf/, perhaps it would make sense to provide tar.gz dumps of the triplestore contents in quad format? I suppose that would fix the DESCRIBE queries. |
Thanks for your suggestions, We load the 1.5 data to named graphs with staging.sql I agree it would make sense to preprocess the files and redistribute as nquads, and thus not even be bound to Virtuoso. For instance if you want to replicate the Open PHACTS RDF in an Apache Jena Fuseki installation, then you now currently to do the equivalent of staging.sql with Java code. I did make alternative nquads dumps using Jena's However I found that nquads loading was about as slow loading in Virtuoso as parsing the other formats (Turtle, RDF/XML) - ie. easily a couple of days on a regular disk, or many hours on SSD raid. The backup dumps however loads in less than 20 minutes, as they are basically just concatenating disk blocks in native format. But as we have found out, this comes at the risk of not being consistent between commercial and open source version. How big data have you used the nquads loading with? Perhaps it is faster with the fixed number of quads per file from The current Docker setup of Open PHACTS loads from backup dump - dumped from the Open PHACTS developer instance running the commercial Virtuoso. The problem is that I can't run There is also one issue in that the current database dump from dev is not necessarily consistent with the RDF files, e.g. see openphacts/GLOBAL/#297 (private bug :-( ) - while a fresh load from fixed RDF dumps means you only got what you loaded and nothing older. It is also nice to have loading split per graph/datasource, so it can be customized later, e.g. a client might want to ignore chembl dataset from Open PHACTS to use a newer or older version. And thus this needs to be changed to use some from-rdf staging using Virtuoso's RDF parsing, currently using staging.sh from stain/virtuoso which does pretty much as To help keep install times down, a new dump can still be made and used for Open PHACTS installation, as long as we lock down the version numbers of the open source Virtuoso and Docker images (as we can't redistribute the commercial Virtuoso image). |
BTW, In case you were going to try the load.sql with the Open PHACTS data, (you'll need a decent machine and a weekend), there are two versions:
After executing the second I get these errors, so that's what I'm investigating at the moment:
I don't know yet if these errors also happen with the original RDF files, or if this is a bug in virtuoso or Jena that produced the nq files. |
We haven't experimented with data dumps of this size in the current Virtuoso. Perhaps hosting an image with the openphacts data loaded into it could be a solution for those that want fast load times? Not depending on a single vendor is likely a good thing. However if it takes excessively long to load the contents, then it might not be usable at all. It is not hard to provide the dumps in the various formats, but it is an additional overhead from your end. |
Thanks for the suggestion. We have talked about making an AWS image as a shortcut, #11 -- but part of the reason for going with Docker here was for the ability to independently update and customize individual pieces of software (e.g. Virtuoso) or data (the RDF loaded) - a VM image generally locks those things down. I think a backup dump from the open source virtuoso should load fine in a commercial install of same or newer version - as basically the open source version is just lagging behind the commercial in features - and distributing processed RDF data in new nquad dumps (with the expected graph names used by Open PHACTS) is probably still useful for platform independence. I will raise that as a separate issue. |
..as backup dump from http://data.openphacts.org/1.5/virtuoso/ is made on commercial Virtuoso, it might not be storage-compatible with open source virtuoso.
The text was updated successfully, but these errors were encountered: