Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: ensure short strings of legitimate content are not excluded #867

Merged
merged 9 commits into from
May 20, 2024

Conversation

inhumantsar
Copy link
Contributor

@inhumantsar inhumantsar commented May 4, 2024

This PR changes one of the haveToRemove checks to allow for short paragraphs, such as the written dialog in the linked issue, by adding linkDensity to the primary "short content" check. The rationale being that short strings which don't contain any links are likely to be text the user would want to read, provided that the initial preprocessing has already removed the bulk of the page elements.

I regenerated all test cases to check for regressions and added a couple of new checks to deal with those:

  • adWords and loadingWords regexes to help identify ad blocks and loading indicators.
  • textDensity + image count to exclude elements without useful content.

Test case changes:

  • citylab-1 now includes the published time in the Readability content.
  • ehow-1 lost an extraneous "Other People Are Reading" header but gained an extraneous "Found This Helpful" header
  • ehow-2 lost its "Other People Are Reading" header and gained the word "Save" previously used for a button
  • engadget now shows the base price and score originally present in the product review
  • firefox-nightly-blog no longer shows "Leave a Reply"
  • mercurial now correctly shows a previously excluded code snippets and commands
  • qq now shows the page header with published date
  • toc-missing now incorrectly shows "Interactive Editor"
  • wikipedia now shows image captions wrapped in <div><p>...</p></div>

The changes which are definitely regressions seem like an acceptable trade for the improvements gained. I would like some input on whether the adWords and loadingWords regexes are acceptable though. They are scoped so that they will only produce a match against a node's entire innerText string, so it's unlikely to impact real content, but it still feels like a slippery slope.

Closes #861

Copy link
Contributor

@gijsk gijsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just one nit, but I think this otherwise looks good. We really appreciated the thorough evaluation. Like you, we feel it's a balancing act in terms of the impact, but we think it's generally a positive change so let's roll with it. :-)

@inhumantsar inhumantsar force-pushed the fix-short-paras-861 branch from 6e44d6b to f418741 Compare May 19, 2024 16:17
Copy link
Contributor

@gijsk gijsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@gijsk gijsk merged commit c7f0ef1 into mozilla:main May 20, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants