Skip to content
LarsJørgenSolberg edited this page Nov 29, 2012 · 20 revisions

Background

The Wikipedia Corpus Builder (WCB) is a toolkit for extracting relevant linguistic content from Wikipedia. It was used in the creation of the 2012 versions of WeScience and WikiWoods, through the MSc thesis of Lars Jørgen Solberg at the Department of Informatics at the University of Oslo.

Installation

Make sure that the following prerequisites are installed:

If the command python -c 'from mwlib import cdb' does not give any error message and your shell is able to find tokenizer and ngram you should be in good shape.

WCB itself can be downloaded from https://github.com/larsjsol/wcb.

Running on the English Wikipedia

The setup used in the creation of WikiWoods 2.0 is included in the wcb/enwiki-20080727 directory. It should be usable on newer snapshots as well.

First prepare a database snapshot:

  1. Download a snapshot from either WikiWoods or http://dumps.wikimedia.org/.

  2. Decompress the the snapshot: bunzip enwiki-20080727-pages-articles.xml.bz2

  3. Create a Constant Database: mw-buildcdb --input enwiki-20080727-pages-articles.xml --output OUTDIR

  4. Change the wikiconf entry in wcb/enwiki-20080727/paths.txt so it points to the file wikiconf.txt created in the previous step.

Most of the modules in WCB needs access to a paths.txt-file and determines its location by examining the variable PATHSFILE. This variable can be set by doing something like export PATHSFILE=./wcb/enwiki-20080727/paths.txt.

As a test run ./wcb/scripts/gml.py --senseg 'Context-free language', which should print GML to stdout. The first invocation of this command will take some time as it will examine all templates in the snapshot.

WCB can create corpora directly from a snapshot or from files containing wiki markup by using the scripts build_corpus.py (snapshot) or build_corpus_files.py (plain files). The following example shows the creation a corpus containing all articles in a snapshot using 20 parallel processes:

mkdir corpus
./wcb/scripts/build_corpus.py -p 20 corpus 

Information on command line parameters for these scripts can be found by using the --help switch.

Adaptations to Other Languages

Construction of WeScience 2.0

Clone this wiki locally