fix: ensure short strings of legitimate content are not excluded #867

inhumantsar · 2024-05-04T21:47:23Z

This PR changes one of the haveToRemove checks to allow for short paragraphs, such as the written dialog in the linked issue, by adding linkDensity to the primary "short content" check. The rationale being that short strings which don't contain any links are likely to be text the user would want to read, provided that the initial preprocessing has already removed the bulk of the page elements.

I regenerated all test cases to check for regressions and added a couple of new checks to deal with those:

adWords and loadingWords regexes to help identify ad blocks and loading indicators.
textDensity + image count to exclude elements without useful content.

Test case changes:

citylab-1 now includes the published time in the Readability content.
ehow-1 lost an extraneous "Other People Are Reading" header but gained an extraneous "Found This Helpful" header
ehow-2 lost its "Other People Are Reading" header and gained the word "Save" previously used for a button
engadget now shows the base price and score originally present in the product review
firefox-nightly-blog no longer shows "Leave a Reply"
mercurial now correctly shows a previously excluded code snippets and commands
qq now shows the page header with published date
toc-missing now incorrectly shows "Interactive Editor"
wikipedia now shows image captions wrapped in <div><p>...</p></div>

The changes which are definitely regressions seem like an acceptable trade for the improvements gained. I would like some input on whether the adWords and loadingWords regexes are acceptable though. They are scoped so that they will only produce a match against a node's entire innerText string, so it's unlikely to impact real content, but it still feels like a slippery slope.

Closes #861

gijsk

Thanks! Just one nit, but I think this otherwise looks good. We really appreciated the thorough evaluation. Like you, we feel it's a balancing act in terms of the impact, but we think it's generally a positive change so let's roll with it. :-)

Readability.js

gijsk

Thank you!

This ports "fix: ensure short strings of legitimate content are not excluded" (mozilla/readability#867)

inhumantsar mentioned this pull request May 4, 2024

fix: relax filtering of heading elements with classnames that include the word "header" #868

Open

gijsk requested changes May 17, 2024

View reviewed changes

Readability.js Outdated Show resolved Hide resolved

Readability.js Show resolved Hide resolved

inhumantsar added 4 commits May 19, 2024 11:10

fix: capture short paragraphs

9164961

clean up and fix some of the regressions

10164f4

update changelog, fix linting errors

cc4f534

address review comments

f418741

inhumantsar force-pushed the fix-short-paras-861 branch from 6e44d6b to f418741 Compare May 19, 2024 16:17

inhumantsar added 5 commits May 19, 2024 11:19

derp

63a5f36

yeah nvm

8264aef

hm ok

b45c505

not sure how i managed to delete these and not notice...

8ef68c7

another embarassing commit to fix a linting issue

7836fab

gijsk approved these changes May 20, 2024

View reviewed changes

gijsk merged commit c7f0ef1 into mozilla:main May 20, 2024

mislav added a commit to mislav/go-readability that referenced this pull request Jun 19, 2025

Ensure short strings of legitimate content are not excluded

0c9420f

This ports "fix: ensure short strings of legitimate content are not excluded" (mozilla/readability#867)

mislav added a commit to mislav/go-readability that referenced this pull request Jun 23, 2025

Ensure short strings of legitimate content are not excluded

19fa999

This ports "fix: ensure short strings of legitimate content are not excluded" (mozilla/readability#867)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: ensure short strings of legitimate content are not excluded #867

fix: ensure short strings of legitimate content are not excluded #867

Uh oh!

inhumantsar commented May 4, 2024 •

edited

Loading

Uh oh!

gijsk left a comment

Uh oh!

Uh oh!

Uh oh!

gijsk left a comment

Uh oh!

Uh oh!

fix: ensure short strings of legitimate content are not excluded #867

fix: ensure short strings of legitimate content are not excluded #867

Uh oh!

Conversation

inhumantsar commented May 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gijsk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gijsk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

inhumantsar commented May 4, 2024 •

edited

Loading