Skip to content

sbober/levitation-perl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

89 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is a Perl port of scy's levitation. It reads MediaWiki dump files
revision by revision and writes a data stream to stdout suitable for 
git fast-import.

The first 1000 pages of the german Wikipedia and all their revisions
(about 390000) can be dumped in about 15 min on relatively moderate
hardware.


Dependencies
------------

You need at least Perl 5.10. The Perl interpreter has to be compiled
with threads support.

You also need a working C compiler for the inline SHA1 C function.
Currently this _must_ be gcc 4.3 callable as 'gcc-4.3'. This will be
fixed soon.

You need the following modules and their dependencies from CPAN:

- Regexp::Common
- Inline
- JSON::XS
- Compress::Raw::Zlib
- Carp::Assert

- CDB_File
- XML::Bare      >= 0.44
- Deep::Hash::Utils

Some Linux distributions will already have the first set.
Under Debian / Ubuntu the following command should set you:

  sudo apt-get install libregexp-common-perl \
                       libinline-perl libjson-xs-perl \
                       libcompress-raw-zlib-perl libcarp-assert-perl


Usage
-----

First, initialize a git repository:

  cd /tmp
  mkdir blawiki
  cd blawiki
  git init


Then, "levitate". This is a three-step process:

  cat /path/to/blawiki-dump.xml | /path/to/levitation-perl/step1.pl
  LC_ALL=C sort rev-table.txt > rev-sorted.txt
  /path/to/levitation-perl/step2.pl | /path/to/levitation-perl/gfi.pl


Alternatively, you can just change to an empty directory and call the
"levitate" helper script with a path to a dump as parameter (may be 
7z, bz2, gz or xml):

  mkdir /tmp/blawiki
  cd /tmp/blawiki
  /path/to/levitation-perl/levitate /path/to/blawiki-dump...

Lots of progress information is printed to standard error, so it may be
best to redirect that to a file.

Have fun.

About

perl port of scy's levitation

Resources

License

Stars

Watchers

Forks

Packages

No packages published