Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve tokenization issues causing BitFunnel parser crashes #6

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

hausdorff
Copy link
Member

The corpus as processed by the current version of Workbench contains
characters (mostly punctuation) that cause the BitFunnel parser to
crash. This commit will cause Workbench to handle these cases correctly.

There are 2 issues at the root of this problem: first, the Lucene
analyzer (which we use to generate the BitFunnel chunk files) attempts
to preserve URLs, and so colons are not removed from the middle of a
term such as Wikipedia:dump. This causes our parser to crash. Since
Luecene does remove the colon when it does not seem to appear in a URI,
we simply have removed colons from all terms.

Second, we are not using the Lucene tokenizer to process article titles.
This leaves a wide variety of puntuation in the corpus which crashes the
tokenizer. In the new version of the corpus, the title is tokenized to
avoid such problems.

The corpus as processed by the current version of Workbench contains
characters (mostly punctuation) that cause the BitFunnel parser to
crash. This commit will cause Workbench to handle these cases correctly.

There are 2 issues at the root of this problem: first, the Lucene
analyzer (which we use to generate the BitFunnel chunk files) attempts
to preserve URLs, and so colons are not removed from the middle of a
term such as `Wikipedia:dump`. This causes our parser to crash. Since
Luecene does remove the colon when it does not seem to appear in a URI,
we simply have removed colons from all terms.

Second, we are not using the Lucene tokenizer to process article titles.
This leaves a wide variety of puntuation in the corpus which crashes the
tokenizer. In the new version of the corpus, the title is tokenized to
avoid such problems.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant