Skip to content

europeana/sparql-updater

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SPARQL updater

Software to automatically fill a Virtuoso DB with Europeana datasets from the Europeana FTP server and update the data sets regularly

Steps for local testing

  1. Check if the configuration in the file /src/main/resources/sparql-updater.user.properties is present and values are correct
  2. Run mvn clean install to create the file /target/sparql-updater.jar. This file contains the code to automatically load sets from the Europeana FTP server and write it to Virtuoso. It will also check regularly if datasets were modified and if so will update Virtuoso by uploading changes sets again and deleting the old set.
  3. Run docker build . -t europeana/sparql to create a Docker image containing both Virtuoso and the sparql-updater.jar. This file will now contain the sparql-updater.user.properties file, so don't push this to DockerHub!
  4. Start the container using the file docker-compose-localtest.yml. The Virtuoso GUI will be available at http://localhost:8890/

Some things to be aware of:

  • Loading all Europeana datasets will require around 150GB of disk space!
  • For local testing purposes we use a hard-coded password (see DBA_PASSWORD variable in docker-compose-localtest.yml file. For production purposes the credentials in this .yml file and in the user.properties file should be changed.
  • After startup a folder named /database is created relative to the startup location. This folder contains the Virtuoso database files but also has a folder named tmp-ingest where files are stored that are downloaded from the ftp-server and generated by the sparql-updater for ingestion. These files are automatically deleted when they are no longer needed.
  • You can check which datasets are loaded using this SPARQL query: SELECT DISTINCT ?g WHERE { GRAPH ?g {?s a ?o} }
  • You can use the DELETE_VIRTUOSO_DB=true environment variable to clear the Virtuoso database on startup.
  • Virtuoso recommends to configure the environment variables VIRT_NumberOfBuffers and VIRT_MaxDirtyBuffers depending on the amount of available RAM (see also Virtuoso performance tuning tutorial Doing this will reduce the number of messages from Virtuoso during a full update (e.g. the ones about "Write wait on column page x....". The settings do not seem to have any effect on performance though.

If you are making (configuration) changes to the sparql-updater don't forget to:

  1. Rebuild the jar
  2. Rebuild the Docker image
  3. Recreate the container

About

Software to automatically update the Virtuoso DB

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published