Skip to content
This repository has been archived by the owner on Feb 22, 2021. It is now read-only.

null as title of the article #12

Open
nick-magnini opened this issue Jan 4, 2016 · 2 comments
Open

null as title of the article #12

nick-magnini opened this issue Jan 4, 2016 · 2 comments
Labels

Comments

@nick-magnini
Copy link

When the wikipedia is processed for word2vec corpus, the titles of the pages (the first word of each line) is null. So basically all pages start with "null..". Which part of the code takes care of that and how can we change it so instead of that we can present it with the page title?

@keynmol keynmol added the ready label Jan 6, 2016
@nick-magnini
Copy link
Author

I still get null as the first token of each line ....

@dav009
Copy link
Contributor

dav009 commented Jan 15, 2016

Probably best way to address this problem is to use :
https://github.com/idio/json-wikipedia

for extracting text out of the dumps.
I will work on a refactor for it

@Lugrin Lugrin added backlog and removed ready labels Oct 17, 2016
@Lugrin Lugrin added icebox and removed backlog labels Apr 10, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants