Resolve tokenization issues causing BitFunnel parser crashes #6

hausdorff · 2016-10-28T20:21:34Z

The corpus as processed by the current version of Workbench contains
characters (mostly punctuation) that cause the BitFunnel parser to
crash. This commit will cause Workbench to handle these cases correctly.

There are 2 issues at the root of this problem: first, the Lucene
analyzer (which we use to generate the BitFunnel chunk files) attempts
to preserve URLs, and so colons are not removed from the middle of a
term such as Wikipedia:dump. This causes our parser to crash. Since
Luecene does remove the colon when it does not seem to appear in a URI,
we simply have removed colons from all terms.

Second, we are not using the Lucene tokenizer to process article titles.
This leaves a wide variety of puntuation in the corpus which crashes the
tokenizer. In the new version of the corpus, the title is tokenized to
avoid such problems.

The corpus as processed by the current version of Workbench contains characters (mostly punctuation) that cause the BitFunnel parser to crash. This commit will cause Workbench to handle these cases correctly. There are 2 issues at the root of this problem: first, the Lucene analyzer (which we use to generate the BitFunnel chunk files) attempts to preserve URLs, and so colons are not removed from the middle of a term such as `Wikipedia:dump`. This causes our parser to crash. Since Luecene does remove the colon when it does not seem to appear in a URI, we simply have removed colons from all terms. Second, we are not using the Lucene tokenizer to process article titles. This leaves a wide variety of puntuation in the corpus which crashes the tokenizer. In the new version of the corpus, the title is tokenized to avoid such problems.

MikeHopcroft mentioned this pull request Nov 23, 2016

Wikipedia extraction seems to be giving bigrams #3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve tokenization issues causing BitFunnel parser crashes #6

Resolve tokenization issues causing BitFunnel parser crashes #6

hausdorff commented Oct 28, 2016

Resolve tokenization issues causing BitFunnel parser crashes #6

Are you sure you want to change the base?

Resolve tokenization issues causing BitFunnel parser crashes #6

Conversation

hausdorff commented Oct 28, 2016