-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Extracted from https://addshore.com/2019/10/your-own-wikidata-query-service-with-no-limits/.
Loading data from a totally fresh TTL dump into a blank query service is not a quick task currently. In production (wikidata.org) it takes roughly a week, and I had a similar experience while trying to streamline the process as best I could on GCE.
For a dump taken at the end of 2018 the timings for each stage were as follows:
Data dump download: 2 hours
Data Munge: 20 hours
Data load: 4.5 days
Total time: ~5.5 days
Various parts of the process lead me to believe that this could be done faster as throughout CPU usage was pretty low and not all memory was utilized. The loading of the data into blazegraph was by far the slowest step, but digging into this would require someone that is more familiar with the blazegraph internals.