-
Notifications
You must be signed in to change notification settings - Fork 137
Wikipedia articles coverage #4
Comments
Thanks, this is definitely an important issue to address |
Yes, the lack of explicit links is definitely one of the problems. I think that doing entity linking inside each article might lead to better coverage (by restricting the candidate entities to the ones that are already linked inside the article). This might lead to some false positives in some cases though. Also, the first phrase in a wiki article does not have an explicit link but it could be linked to the id of the article without much risk . :) |
I was thinking on something like this:
|
I think that indeed the first line needs some preprocessing to solve these issues, but I don't think that the vectors are gonna be polluted by adding the first line, as it usually contains quite useful context :) Yes, we are describing almost the same thing for intra-article entity linking :) I am proposing that you can even expand that logic to every link in the article (by considering their corresponding mentions). It would be interesting to create a small collection to evaluate this. |
got it. |
@nickvosk A good remark that we could use here might be [1]. As the surface forms referring to the article's entity are usually on Bold just like the PR there suggested, it seems to be a more informed assumption. [1] dbpedia-spotlight/dbpedia-spotlight#356 Edit: Updated wrong link |
Can you elaborate on how this would fix the coverage problem @dav009 ? Also, this paper looks relevant : |
@nickvosk :) good reference, I think I saw it before on ACL. AS the paper suggest we could as well run some NEL, with some very high confidence values to add some extra links and probably get above the min-threshold imposed by the implementation of gensim's word2vec |
@dav009 exactly :) |
Looking at some old raw counts via dbpedia spotlight project it seems that out of 6M topics in those counts 4M have less than 5 links. Surpringly filtering topics with more than 50 links give us: 268836, which is similar to our current coverage: 226319. |
Hi @dav009, very promising work here!
I wrote a simple script to test the coverage of the prebuilt model for English Wikipedia articles. I used the Wikipedia article titles from a preprocessed Wikipedia Miner March 2014 dump.
Out of
4342357
articles, only226319
had a matching vector (~5%
). I have noticed that the model usually covers popular entities but does not cover tail entities. I guess this might be because words below a certain count were ignored and because of errors in preprocessing.Any ideas on this? I have noticed that your TODOs include resolving redirects and also co-reference resolution inside the articles, but I guess we would expect better coverage even without these.
Thanks.
The text was updated successfully, but these errors were encountered: