-
-
Notifications
You must be signed in to change notification settings - Fork 17
Description
I use the filter in 2 ramdisks, each around 100GB large to speed up processing. Still my 32 cores machine idles at around 5% and will take 12-16 hours filtering all entries (0.5ms average time).
As i don't know nodejs a lot im not sure i can add multi threading to this but node.js totally can invoke child threads - is there an easy 2-3 line addition possible to spawn more threads? See https://nodejs.org/docs/latest/api/cluster.html
Im using server boards, but i guess lots of ppl doing this will sit on a ryzen system or similiar.
multicore unpacking the archive is doable with 'pbzip2 -d -c /mnt/ramdisk/latest-all.json.bz2 | wikibase-dump-filter', thus showing node at exactly 100% and unzipping at ~110%, so its still node the bottleneck. This halves average to 0.25 for me, but with "just" 64GB RAM on maybe some rented hosted machine with lots of cores you can get filter time down to under 30 minutes with multicore processing, thus greatly reducing costs for weekly updates.
Thanks for your great work, really sparing me days of processing,
R