Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker: Make fresh RDF load for virtuoso #10

Open
stain opened this issue Sep 10, 2015 · 9 comments
Open

Docker: Make fresh RDF load for virtuoso #10

stain opened this issue Sep 10, 2015 · 9 comments

Comments

@stain
Copy link
Contributor

stain commented Sep 10, 2015

..as backup dump from http://data.openphacts.org/1.5/virtuoso/ is made on commercial Virtuoso, it might not be storage-compatible with open source virtuoso.

@stain
Copy link
Contributor Author

stain commented Sep 10, 2015

E.g. using DESCRIBE in SPARQL fails:

http://heater.cs.man.ac.uk:3003/sparql?default-graph-uri=&query=DESCRIBE+%3Chttp%3A%2F%2Frdf.wikipathways.org%2FPathway%2FWP1907_r76850%3E&format=text%2Fturtle&timeout=0&debug=on

Virtuoso 37000 Error SP031: SPARQL: Internal error: corrupted metadata: PointerToCorrupted

SPARQL query:
define sql:big-data-const 0 
#output-format:text/turtle
define sql:signal-void-variables 1 DESCRIBE <http://rdf.wikipathways.org/Pathway/WP1907_r76850>

@keski
Copy link

keski commented Sep 30, 2015

I'm getting the same error using Docker and tenforce/virtuoso, which is currently using version 07.20.3212. Everything but describe is working. So far I haven't been able to resolve the issue:

Virtuoso 37000 Error SP031: SPARQL: Internal error: corrupted metadata: PointerToCorrupted

SPARQL query:
define sql:big-data-const 0 
#output-format:text/turtle
define sql:signal-void-variables 1 DESCRIBE <http://example.org#Pizza>

@keski
Copy link

keski commented Oct 1, 2015

Setting sql:describe-mode fixes the issue temporarily:

SQL>SPARQL
define sql:describe-mode "SPO"
DESCRIBE <http://example.org#Pizza>

See http://docs.openlinksw.com/virtuoso/rdfsparql.html#rdfsqlfromsparqldescribe

However, I was unable to change the mode permanently as described in https://www.mail-archive.com/[email protected]/msg06396.html

Finally, I updated to a later build (version: 07.20.3214) and DESCRIBE seems to work fine now.

@stain
Copy link
Contributor Author

stain commented Oct 1, 2015

Thanks for the details! I have triggered a new build on https://hub.docker.com/r/stain/virtuoso/ to see if this solves it for me as well. I keep getting confused by the many versions of the same "version" of Virtuoso..

@madnificent
Copy link

Making a raw dump using Virtuoso likely dumps to an internal Virtuoso format. If that is the case then it might be different across versions (opensource, commercial and/or version number). We can load and dump quads in an open format for this.

dumping-your-virtuoso-data-as-quads and loading-quads-in-virtuoso describe how you can dump and load quads in the tenforce/virtuoso docker. All manual commands should translate to other Virtuoso installations. Automatic loading is a handy extra.

I did not find quads in http://data.openphacts.org/1.5/rdf/, perhaps it would make sense to provide tar.gz dumps of the triplestore contents in quad format? I suppose that would fix the DESCRIBE queries.

@stain
Copy link
Contributor Author

stain commented Oct 8, 2015

Thanks for your suggestions, dump_nquads is probably relevant!

We load the 1.5 data to named graphs with staging.sql

I agree it would make sense to preprocess the files and redistribute as nquads, and thus not even be bound to Virtuoso. For instance if you want to replicate the Open PHACTS RDF in an Apache Jena Fuseki installation, then you now currently to do the equivalent of staging.sql with Java code.

I did make alternative nquads dumps using Jena's riot (to test if they were faster to load), but these are all in the default graph rather than correct graphs (except in nx_np.tar).

However I found that nquads loading was about as slow loading in Virtuoso as parsing the other formats (Turtle, RDF/XML) - ie. easily a couple of days on a regular disk, or many hours on SSD raid.

The backup dumps however loads in less than 20 minutes, as they are basically just concatenating disk blocks in native format. But as we have found out, this comes at the risk of not being consistent between commercial and open source version.

How big data have you used the nquads loading with? Perhaps it is faster with the fixed number of quads per file from dump_nquads -- while the nq files I have from direct conversion are up to 1.5 GB large, which means each rdf loader thread runs for a very long time once they hit the big files. Which file size have you found work well, and how many loader threads?

The current Docker setup of Open PHACTS loads from backup dump - dumped from the Open PHACTS developer instance running the commercial Virtuoso.

The problem is that I can't run dump_nquads on the developer instance now, as it has since moved on - although I could of course do it on a second instance on different ports and folders and start with the database dump.

There is also one issue in that the current database dump from dev is not necessarily consistent with the RDF files, e.g. see openphacts/GLOBAL/#297 (private bug :-( ) - while a fresh load from fixed RDF dumps means you only got what you loaded and nothing older.

It is also nice to have loading split per graph/datasource, so it can be customized later, e.g. a client might want to ignore chembl dataset from Open PHACTS to use a newer or older version.

And thus this needs to be changed to use some from-rdf staging using Virtuoso's RDF parsing, currently using staging.sh from stain/virtuoso which does pretty much as

To help keep install times down, a new dump can still be made and used for Open PHACTS installation, as long as we lock down the version numbers of the open source Virtuoso and Docker images (as we can't redistribute the commercial Virtuoso image).

@stain
Copy link
Contributor Author

stain commented Oct 8, 2015

BTW, In case you were going to try the load.sql with the Open PHACTS data, (you'll need a decent machine and a weekend), there are two versions:

After executing the second I get these errors, so that's what I'm investigating at the moment:

virtuosostagingrdf_1 | 10:48:10 PL LOG:  File /staging/WP/README.nq.gz error 37000 SP029: NQuads RDF loader, line 1: syntax error processed pending to here.
virtuosostagingrdf_1 | 10:48:12 PL LOG:  File /staging/WP/WPREACTRDF/WP1811_r77053.nq.gz error 37000 SP029: NQuads RDF loader, line 1033: syntax error processed pending to here.
virtuosostagingrdf_1 | 10:48:12 PL LOG:  File /staging/WP/WPREACTRDF/WP1812_r76969.nq.gz error 37000 SP029: NQuads RDF loader, line 395: syntax error processed pending to here.
virtuosostagingrdf_1 | 10:48:12 PL LOG:  File /staging/WP/WPREACTRDF/WP1813_r77051.nq.gz error 37000 SP029: NQuads RDF loader, line 2757: Undefined namespace prefix at p://www.w3.org/1999/02/22-rdf-syntax-ns#type processed pending to here.
virtuosostagingrdf_1 | 10:48:18 PL LOG:  File /staging/WP/WPREACTRDF/WP1889_r77003.nq.gz error 37000 SP029: NQuads RDF loader, line 437: Undefined namespace prefix at tp://vocabularies.wikipathways.org/gpml#zorder processed pending to here.
virtuosostagingrdf_1 | 10:48:26 PL LOG:  File /staging/WP/WPREACTRDF/WP1928_r76893.nq.gz error 37000 SP029: NQuads RDF loader, line 7975: syntax error processed pending to here.
virtuosostagingrdf_1 | 10:48:31 PL LOG:  File /staging/WP/WPREACTRDF/WP2658_r76836.nq.gz error 37000 SP029: NQuads RDF loader, line 2035: syntax error processed pending to here.
virtuosostagingrdf_1 | 10:48:48 PL LOG:  File /staging/WP/WPREACTRDF/WP2710_r76924.nq.gz error 37000 SP029: NQuads RDF loader, line 2630: syntax error processed pending to here.
virtuosostagingrdf_1 | 10:48:48 PL LOG:  File /staging/WP/WPREACTRDF/WP2683_r76880.nq.gz error 37000 SP029: NQuads RDF loader, line 49: Undefined namespace prefix at ttp://rdf.wikipathways.org/Pathway/WP2683_r76880/GpmlLabel/c872d1c2-60fa-4
virtuosostagingrdf_1 | 10:48:48 PL LOG:  File /staging/WP/WPREACTRDF/WP2684_r76883.nq.gz error 37000 SP029: NQuads RDF loader, line 839: syntax error processed pending to here.
virtuosostagingrdf_1 | 10:48:49 PL LOG:  File /staging/WP/WPREACTRDF/WP2715_r76932.nq.gz error 37000 SP029: NQuads RDF loader, line 685: syntax error processed pending to here.
virtuosostagingrdf_1 | 10:48:52 PL LOG:  File /staging/WP/WPREACTRDF/WP2737_r76970.nq.gz error 37000 SP029: NQuads RDF loader, line 145: Undefined namespace prefix at http://identifiers.org/ncbigene/6157 processed pending to here.
virtuosostagingrdf_1 | 10:48:55 PL LOG:  File /staging/WP/voidInteractions.nq.gz error 37000 SP029: NQuads RDF loader, line 1: syntax error processed pending to here.
virtuosostagingrdf_1 | 10:48:57 PL LOG:  File /staging/caloha/caloha.nq.gz error 37000 SP029: NQuads RDF loader, line 1: syntax error processed pending to here.
virtuosostagingrdf_1 | 10:55:34 PL LOG:  File /staging/aers/faers-of-2012-generated-on-2012-07-09.nq.gz error 37000 SP029: NQuads RDF loader, line 16517015: syntax error processed pending to here.
virtuosostagingrdf_1 | 12:00:16 PL LOG:  File /staging/drugbank/drugbank.nq.gz error 37000 SP029: NQuads RDF loader, line 27164: syntax error processed pending to here.
virtuosostagingrdf_1 | 12:00:45 PL LOG:  File /staging/nx_np/nextprot-01001-02000.nq.gz error 37000 SP029: NQuads RDF loader, line 150261: syntax error processed pending to here.
virtuosostagingrdf_1 | 12:00:46 PL LOG:  File /staging/nx_np/nextprot-02001-03000.nq.gz error 37000 SP029: NQuads RDF loader, line 90637: syntax error processed pending to here.
virtuosostagingrdf_1 | 12:00:47 PL LOG:  File /staging/nx_np/nextprot-03001-04000.nq.gz error 37000 SP029: NQuads RDF loader, line 19717: syntax error processed pending to here.
virtuosostagingrdf_1 | 12:01:53 PL LOG:  File /staging/nx_np/nextprot-04001-05000.nq.gz error 37000 SP029: NQuads RDF loader, line 2133629: syntax error processed pending to here.
virtuosostagingrdf_1 | 12:01:54 PL LOG:  File /staging/nx_np/nextprot-05001-06000.nq.gz error 37000 SP029: NQuads RDF loader, line 76123: syntax error processed pending to here.
virtuosostagingrdf_1 | 12:08:03 PL LOG:  File /staging/nx_np/nextprot-07001-08000.nq.gz error 37000 SP029: NQuads RDF loader, line 54964: syntax error processed pending to here.
virtuosostagingrdf_1 | 12:26:12 PL LOG:  File /staging/nx_np/nextprot-10001-11000.nq.gz error 37000 SP029: NQuads RDF loader, line 1177063: syntax error processed pending to here.
virtuosostagingrdf_1 | 12:26:56 PL LOG:  File /staging/nx_np/nextprot-00001-01000.nq.gz error 37000 SP029: NQuads RDF loader, line 1108725: syntax error processed pending to here.
virtuosostagingrdf_1 | 12:26:57 PL LOG:  File /staging/nx_np/nextprot-12001-13000.nq.gz error 37000 SP029: NQuads RDF loader, line 123438: syntax error processed pending to here.
virtuosostagingrdf_1 | 12:27:03 PL LOG:  File /staging/nx_np/nextprot-13001-14000.nq.gz error 37000 SP029: NQuads RDF loader, line 790806: syntax error processed pending to here.
virtuosostagingrdf_1 | 12:27:18 PL LOG:  File /staging/nx_np/nextprot-11001-12000.nq.gz error 37000 SP029: NQuads RDF loader, line 1507387: syntax error processed pending to here.
virtuosostagingrdf_1 | 12:27:35 PL LOG:  File /staging/nx_np/nextprot-16001-17000.nq.gz error 37000 SP029: NQuads RDF loader, line 140761: syntax error processed pending to here.
virtuosostagingrdf_1 | 12:29:06 PL LOG:  File /staging/nx_np/nextprot-14001-15000.nq.gz error 37000 SP029: NQuads RDF loader, line 3619693: syntax error processed pending to here.
virtuosostagingrdf_1 | 12:31:27 PL LOG:  File /staging/nx_np/nextprot-19001-20000.nq.gz error 37000 SP029: NQuads RDF loader, line 373101: Undefined namespace prefix at tp://purl.uniprot.org/cosmic/349315 processed pending to here.
virtuosostagingrdf_1 | 12:31:28 PL LOG:  File /staging/nx_np/nextprot-20001-20200.nq.gz error 37000 SP029: NQuads RDF loader, line 65711: syntax error processed pending to here.
virtuosostagingrdf_1 | 13:46:00 PL LOG:  File /staging/void/WP_voidInteractions.nq.gz error 37000 SP029: NQuads RDF loader, line 1: syntax error processed pending to here.
virtuosostagingrdf_1 | 13:46:07 PL LOG:  File /staging/void/aers_void.nq.gz error 37000 SP029: NQuads RDF loader, line 1: syntax error processed pending to here.
virtuosostagingrdf_1 | 13:46:28 PL LOG:  File /staging/void/wp_void_2015_03_12.nq.gz error 37000 SP029: NQuads RDF loader, line 1: syntax error processed pending to here.

I don't know yet if these errors also happen with the original RDF files, or if this is a bug in virtuoso or Jena that produced the nq files.

@madnificent
Copy link

We haven't experimented with data dumps of this size in the current Virtuoso. Perhaps hosting an image with the openphacts data loaded into it could be a solution for those that want fast load times?

Not depending on a single vendor is likely a good thing. However if it takes excessively long to load the contents, then it might not be usable at all. It is not hard to provide the dumps in the various formats, but it is an additional overhead from your end.

@stain
Copy link
Contributor Author

stain commented Oct 12, 2015

Thanks for the suggestion. We have talked about making an AWS image as a shortcut, #11 -- but part of the reason for going with Docker here was for the ability to independently update and customize individual pieces of software (e.g. Virtuoso) or data (the RDF loaded) - a VM image generally locks those things down.

I think a backup dump from the open source virtuoso should load fine in a commercial install of same or newer version - as basically the open source version is just lagging behind the commercial in features - and distributing processed RDF data in new nquad dumps (with the expected graph names used by Open PHACTS) is probably still useful for platform independence. I will raise that as a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants