-
Notifications
You must be signed in to change notification settings - Fork 632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: ensure short strings of legitimate content are not excluded #867
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Just one nit, but I think this otherwise looks good. We really appreciated the thorough evaluation. Like you, we feel it's a balancing act in terms of the impact, but we think it's generally a positive change so let's roll with it. :-)
6e44d6b
to
f418741
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
This PR changes one of the
haveToRemove
checks to allow for short paragraphs, such as the written dialog in the linked issue, by addinglinkDensity
to the primary "short content" check. The rationale being that short strings which don't contain any links are likely to be text the user would want to read, provided that the initial preprocessing has already removed the bulk of the page elements.I regenerated all test cases to check for regressions and added a couple of new checks to deal with those:
adWords
andloadingWords
regexes to help identify ad blocks and loading indicators.textDensity
+ image count to exclude elements without useful content.Test case changes:
citylab-1
now includes the published time in the Readability content.ehow-1
lost an extraneous "Other People Are Reading" header but gained an extraneous "Found This Helpful" headerehow-2
lost its "Other People Are Reading" header and gained the word "Save" previously used for a buttonengadget
now shows the base price and score originally present in the product reviewfirefox-nightly-blog
no longer shows "Leave a Reply"mercurial
now correctly shows a previously excluded code snippets and commandsqq
now shows the page header with published datetoc-missing
now incorrectly shows "Interactive Editor"wikipedia
now shows image captions wrapped in<div><p>...</p></div>
The changes which are definitely regressions seem like an acceptable trade for the improvements gained. I would like some input on whether the
adWords
andloadingWords
regexes are acceptable though. They are scoped so that they will only produce a match against a node's entireinnerText
string, so it's unlikely to impact real content, but it still feels like a slippery slope.Closes #861