-
Notifications
You must be signed in to change notification settings - Fork 688
fix: ensure short strings of legitimate content are not excluded #867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Just one nit, but I think this otherwise looks good. We really appreciated the thorough evaluation. Like you, we feel it's a balancing act in terms of the impact, but we think it's generally a positive change so let's roll with it. :-)
6e44d6b to
f418741
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
This ports "fix: ensure short strings of legitimate content are not excluded" (mozilla/readability#867)
This ports "fix: ensure short strings of legitimate content are not excluded" (mozilla/readability#867)
This PR changes one of the
haveToRemovechecks to allow for short paragraphs, such as the written dialog in the linked issue, by addinglinkDensityto the primary "short content" check. The rationale being that short strings which don't contain any links are likely to be text the user would want to read, provided that the initial preprocessing has already removed the bulk of the page elements.I regenerated all test cases to check for regressions and added a couple of new checks to deal with those:
adWordsandloadingWordsregexes to help identify ad blocks and loading indicators.textDensity+ image count to exclude elements without useful content.Test case changes:
citylab-1now includes the published time in the Readability content.ehow-1lost an extraneous "Other People Are Reading" header but gained an extraneous "Found This Helpful" headerehow-2lost its "Other People Are Reading" header and gained the word "Save" previously used for a buttonengadgetnow shows the base price and score originally present in the product reviewfirefox-nightly-blogno longer shows "Leave a Reply"mercurialnow correctly shows a previously excluded code snippets and commandsqqnow shows the page header with published datetoc-missingnow incorrectly shows "Interactive Editor"wikipedianow shows image captions wrapped in<div><p>...</p></div>The changes which are definitely regressions seem like an acceptable trade for the improvements gained. I would like some input on whether the
adWordsandloadingWordsregexes are acceptable though. They are scoped so that they will only produce a match against a node's entireinnerTextstring, so it's unlikely to impact real content, but it still feels like a slippery slope.Closes #861