-
Notifications
You must be signed in to change notification settings - Fork 5
perl port of scy's levitation
License
sbober/levitation-perl
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is a Perl port of scy's levitation. It reads MediaWiki dump files revision by revision and writes a data stream to stdout suitable for git fast-import. The first 1000 pages of the german Wikipedia and all their revisions (about 390000) can be dumped in about 15 min on relatively moderate hardware. Dependencies ------------ You need at least Perl 5.10. The Perl interpreter has to be compiled with threads support. You also need a working C compiler for the inline SHA1 C function. Currently this _must_ be gcc 4.3 callable as 'gcc-4.3'. This will be fixed soon. You need the following modules and their dependencies from CPAN: - Regexp::Common - Inline - JSON::XS - Compress::Raw::Zlib - Carp::Assert - CDB_File - XML::Bare >= 0.44 - Deep::Hash::Utils Some Linux distributions will already have the first set. Under Debian / Ubuntu the following command should set you: sudo apt-get install libregexp-common-perl \ libinline-perl libjson-xs-perl \ libcompress-raw-zlib-perl libcarp-assert-perl Usage ----- First, initialize a git repository: cd /tmp mkdir blawiki cd blawiki git init Then, "levitate". This is a three-step process: cat /path/to/blawiki-dump.xml | /path/to/levitation-perl/step1.pl LC_ALL=C sort rev-table.txt > rev-sorted.txt /path/to/levitation-perl/step2.pl | /path/to/levitation-perl/gfi.pl Alternatively, you can just change to an empty directory and call the "levitate" helper script with a path to a dump as parameter (may be 7z, bz2, gz or xml): mkdir /tmp/blawiki cd /tmp/blawiki /path/to/levitation-perl/levitate /path/to/blawiki-dump... Lots of progress information is printed to standard error, so it may be best to redirect that to a file. Have fun.
About
perl port of scy's levitation
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published