-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WikiCorpus.filter_wiki/remove_markup don't remove heading-markup #2561
Comments
Many users will want headings preserved - they include great text for full-text indexing, learning word-vectors, and other purposes. As the headings are very recognizable in the text, users who don't want them should probably filter them out themselves, in a post-WikiCorpus step. Or, WikiCorpus could include an option to strip headings – but I would suggest that both for continuity with current behavior, and the many users who will want them retained, that any such option should be default 'off'. |
Names like Also look at documentation of
The same for PS: It can be a wrong title of this ticket. It is not about removing heading but heading markup. So instead of |
Thanks for clarifying, I understand much better now! Yes, the methods/comments indicate 'markup' will be removed, which creates an expectation problem. And based on both the issue title, and the patch-submitter's initial implementation, I initially thought you wanted the headings entirely stripped – which seemed wrong to me, because even though the headings often aren't proper sentences, they've still always been of interest (even necessary!) when I've used wiki text. How about:
For completeness, as this markup slipped by, a quick visual check of a handful of longer article texts could be done to see if any other present-but-likely-unwanted markup persists – and then if so, depending on the complexity/desirability of removing that, either fixing now or adding a new issue. |
Sounds good. Probably in some new major version (where behaviour changes are acceptable) this flag can be set to |
Drawn in by the hacktoberfest label. Would be interested in working on this if it still needs done. |
Hi @gojomo, sure, let's make it done. I'll adjust the PR during the week according to the discussion. |
Hi there, please find the adjustments at #2622 |
Problem description
I am trying to get clean wiki texts. But still getting headings markup.
Steps/code/corpus to reproduce
Just create WikiCorpus and call
get_texts
. Some texts will contain==headings==
(different number of=
and different headings, of course).Simple test-case:
Versions
The text was updated successfully, but these errors were encountered: