-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
created or.toml #107
base: main
Are you sure you want to change the base?
created or.toml #107
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! You can find the sample extraction here: https://github.com/Common-Voice/cv-sentence-extractor/suites/774458492/artifacts/8112008 (see https://discourse.mozilla.org/t/scraper-automatic-sample-sentences-extracted-in-pull-request/55217 for a full explanation).
I see a few issues at first glance:
- There are English sentences in the OR wikipedia? (having a look at the allowed_symbols config option might help here instead of using the "disallowed_symbols")
- There seem to be more abbreviations
Additionally to that, I highly suggest adding a blocklist as well: https://github.com/Common-Voice/cv-sentence-extractor#using-disallowed-words
Happy to help if you have any questions.
Hey Michael, thanks for flagging these. As a Wikipedia editor myself, I couldn't stop myself fixing some of the issues that you flagged. :-) So, there it goes -- I have started checking the English sentences and some are actual content (the rest being quotes like someone saying something about some person/place/incident -- original quotes are kept without translation in some articles) but fixing will take longer. The good news is many articles were due maintenance tags and deletion (oops) and this became a good excuse for some cleanup for good. Pat your shoulders as you indirectly contributed to Wikipedia! I'm yet to work on the blocklist. In the meantime, is it possible to run the code and create such sample text that contains English? Maybe something I can share with the Wikipedia community so more helping hands can clean up. Also, the extractor needs to be told to not collect the citations or footnotes. It's "References" in English Wikipedia and "ଆଧାର" or "ଟୀକା" in Odia. I see some such citations included in the file that you sent. Your comment says "requested changes". Does that mean that I need to work on the disallowed word list and this article both? I am a bit unsure what is the ask for this very file "or.toml" and would appreciate if you can help. |
You can run as explained in the README, and use the option
Note that this will take quite some time, and we will not be able to use that resulting file, as we have a limit of sentences per article we can take. Might be easier to take the extraction from WikiExtractor and extract the sentences from there, then you don't have to run this script here just to identify all sentences. However, you'll need to do that to generate the block list, so probably a win-win if you do it.
As we're using WikiExtractor before running our script, we do not have that info. And as far as I can see there is no such option in WikiExtractor?
In the end we can merge this PR and run the extraction once the following is achieved:
If it's achievable to get the error rate down to an acceptable level only with the rule file and no blocklist, that's an option too, but I heavily doubt that as we've seen a lot of improvement for other languages once a blocklist was added (as described in the README). |
No description provided.