Skip to content

Commit c7f0ef1

Browse files
authored
fix: ensure short strings of legitimate content are not excluded (#867)
* fix: capture short paragraphs * clean up and fix some of the regressions * update changelog, fix linting errors
1 parent f5e8701 commit c7f0ef1

File tree

16 files changed

+3509
-30
lines changed

16 files changed

+3509
-30
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ reasonable).
1515

1616
- [Add Parsely tags as a fallback metadata source](https://github.com/mozilla/readability/pull/865)
1717
- [Fix the case that jsonld parse process is ignored when context url include the trailing slash](https://github.com/mozilla/readability/pull/833)
18+
- [Fixed situations where short paragraphs of legitimate content would be excluded](https://github.com/mozilla/readability/pull/867)
1819

1920
## [0.5.0] - 2023-12-15
2021

Readability.js

Lines changed: 54 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -145,7 +145,10 @@ Readability.prototype = {
145145
// see: https://en.wikipedia.org/wiki/Comma#Comma_variants
146146
commas: /\u002C|\u060C|\uFE50|\uFE10|\uFE11|\u2E41|\u2E34|\u2E32|\uFF0C/g,
147147
// See: https://schema.org/Article
148-
jsonLdArticleTypes: /^Article|AdvertiserContentArticle|NewsArticle|AnalysisNewsArticle|AskPublicNewsArticle|BackgroundNewsArticle|OpinionNewsArticle|ReportageNewsArticle|ReviewNewsArticle|Report|SatiricalArticle|ScholarlyArticle|MedicalScholarlyArticle|SocialMediaPosting|BlogPosting|LiveBlogPosting|DiscussionForumPosting|TechArticle|APIReference$/
148+
jsonLdArticleTypes: /^Article|AdvertiserContentArticle|NewsArticle|AnalysisNewsArticle|AskPublicNewsArticle|BackgroundNewsArticle|OpinionNewsArticle|ReportageNewsArticle|ReviewNewsArticle|Report|SatiricalArticle|ScholarlyArticle|MedicalScholarlyArticle|SocialMediaPosting|BlogPosting|LiveBlogPosting|DiscussionForumPosting|TechArticle|APIReference$/,
149+
// used to see if a node's content matches words commonly used for ad blocks or loading indicators
150+
adWords: /^(ad(vertising|vertisement)?|pub(licité)?|werb(ung)?|广|Реклама|Anuncio)$/iu,
151+
loadingWords: /^((loading||Загрузка|chargement|cargando)(|\.\.\.)?)$/iu,
149152
},
150153

151154
UNLIKELY_ROLES: [ "menu", "menubar", "complementary", "navigation", "alert", "alertdialog", "dialog" ],
@@ -2154,17 +2157,57 @@ Readability.prototype = {
21542157
embedCount++;
21552158
}
21562159

2160+
var innerText = this._getInnerText(node);
2161+
2162+
// toss any node whose inner text contains nothing but suspicious words
2163+
if (this.REGEXPS.adWords.test(innerText) || this.REGEXPS.loadingWords.test(innerText)) {
2164+
return true;
2165+
}
2166+
2167+
var contentLength = innerText.length;
21572168
var linkDensity = this._getLinkDensity(node);
2158-
var contentLength = this._getInnerText(node).length;
2159-
2160-
var haveToRemove =
2161-
(img > 1 && p / img < 0.5 && !this._hasAncestorTag(node, "figure")) ||
2162-
(!isList && li > p) ||
2163-
(input > Math.floor(p/3)) ||
2164-
(!isList && headingDensity < 0.9 && contentLength < 25 && (img === 0 || img > 2) && !this._hasAncestorTag(node, "figure")) ||
2165-
(!isList && weight < 25 && linkDensity > 0.2) ||
2166-
(weight >= 25 && linkDensity > 0.5) ||
2167-
((embedCount === 1 && contentLength < 75) || embedCount > 1);
2169+
var textishTags = ["SPAN", "LI", "TD"].concat(Array.from(this.DIV_TO_P_ELEMS));
2170+
var textDensity = this._getTextDensity(node, textishTags);
2171+
var isFigureChild = this._hasAncestorTag(node, "figure");
2172+
2173+
// apply shadiness checks, then check for exceptions
2174+
const shouldRemoveNode = () => {
2175+
const errs = [];
2176+
if (!isFigureChild && img > 1 && p / img < 0.5) {
2177+
errs.push(`Bad p to img ratio (img=${img}, p=${p})`);
2178+
}
2179+
if (!isList && li > p) {
2180+
errs.push(`Too many li's outside of a list. (li=${li} > p=${p})`);
2181+
}
2182+
if (input > Math.floor(p/3)) {
2183+
errs.push(`Too many inputs per p. (input=${input}, p=${p})`);
2184+
}
2185+
if (!isList && !isFigureChild && headingDensity < 0.9 && contentLength < 25 && (img === 0 || img > 2) && linkDensity > 0) {
2186+
errs.push(`Suspiciously short. (headingDensity=${headingDensity}, img=${img}, linkDensity=${linkDensity})`);
2187+
}
2188+
if (!isList && weight < 25 && linkDensity > 0.2) {
2189+
errs.push(`Low weight and a little linky. (linkDensity=${linkDensity})`);
2190+
}
2191+
if (weight >= 25 && linkDensity > 0.5) {
2192+
errs.push(`High weight and mostly links. (linkDensity=${linkDensity})`);
2193+
}
2194+
if ((embedCount === 1 && contentLength < 75) || embedCount > 1) {
2195+
errs.push(`Suspicious embed. (embedCount=${embedCount}, contentLength=${contentLength})`);
2196+
}
2197+
if (img === 0 && textDensity === 0) {
2198+
errs.push(`No useful content. (img=${img}, textDensity=${textDensity})`);
2199+
}
2200+
2201+
if (errs.length > 0) {
2202+
this.log("Checks failed", errs);
2203+
return true;
2204+
}
2205+
2206+
return false;
2207+
};
2208+
2209+
var haveToRemove = shouldRemoveNode();
2210+
21682211
// Allow simple lists of images to remain in pages
21692212
if (isList && haveToRemove) {
21702213
for (var x = 0; x < node.children.length; x++) {

test/debug-testcase.js

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
/* eslint-env node */
2+
3+
var Readability = require("../Readability");
4+
var {JSDOM} = require("jsdom");
5+
var fs = require("fs");
6+
var path = require("path");
7+
8+
var testcaseRoot = path.join(__dirname, "test-pages");
9+
10+
if (process.argv.length < 3) {
11+
console.log("No testcase provided.");
12+
process.exit(1);
13+
}
14+
15+
var src = fs.readFileSync(`${testcaseRoot}/${process.argv[2]}/source.html`, {encoding: "utf-8"}).trim();
16+
17+
var doc = new JSDOM(src, {url: "http://fakehost/test/page.html"}).window.document;
18+
19+
new Readability(doc, {debug: true}).parse();

test/test-pages/bug-1255978/expected.html

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,32 @@
1414
<p>1. Take any blankets or duvets off the bed</p>
1515
<p>Forrest Jones said that anything that comes into contact with any of the previous guest’s skin should be taken out and washed every time the room is made, but that even the fanciest hotels don’t always do so. "Hotels are getting away from comforters. Blankets are here to stay, however. But some hotels are still hesitant about washing them every day if they think they can get out of it," he said.</p>
1616
<div>
17+
<div data-video-id="4685984084001" data-embed="default" data-player="2d3d4a83-ba40-464e-9bfb-2804b076bf67" data-account="624246174001" id="4685984084001" role="region" aria-label="video player"><video id="4685984084001_html5_api" data-account="624246174001" data-player="2d3d4a83-ba40-464e-9bfb-2804b076bf67" data-embed="default" data-video-id="4685984084001" preload="none" poster="http://brightcove.vo.llnwd.net/e1/pd/624246174001/624246174001_4685986878001_4685984084001-vs.jpg?pubId=624246174001&amp;videoId=4685984084001" src="blob:http://www.independent.co.uk/112e1cb2-b0b1-e146-be22-fc6d052f7ddd"></video>
18+
<p><span>Play Video</span></p>
19+
<div dir="ltr" role="group">
20+
<p><span>Play</span></p>
21+
<div>
22+
<p><span>Current Time </span>0:00</p>
23+
</div>
24+
<div>
25+
<p><span>/</span></p>
26+
</div>
27+
<div>
28+
<p><span>Duration Time</span> 0:00</p>
29+
</div>
30+
<div tabindex="0" role="slider" aria-valuenow="NaN" aria-valuemin="0" aria-valuemax="100" aria-label="progress bar" aria-valuetext="0:00">
31+
<p><span><span>Loaded</span>: 0%</span>
32+
</p>
33+
<p><span><span>Progress</span>: 0%</span>
34+
</p>
35+
</div>
36+
<div>
37+
<p><span>Remaining Time</span> -0:00</p>
38+
</div>
39+
<p><span>Share</span></p>
40+
<p><span>Fullscreen</span></p>
41+
</div>
42+
</div>
1743
<p>Video shows bed bug infestation at New York hotel</p>
1844
</div>
1945
<div>

test/test-pages/citylab-1/expected.html

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,10 @@
2020
</figure>
2121
<div>
2222
<h2 itemprop="headline"> Why Neon Is the Ultimate Symbol of the 20th Century </h2>
23+
<div>
24+
<p><span><time>1:39 PM ET</time></span>
25+
</p>
26+
</div>
2327
</div>
2428
<h2 itemprop="description"> The once-ubiquitous form of lighting was novel when it first emerged in the early 1900s, though it has since come to represent decline. </h2>
2529
<section id="article-section-1">

test/test-pages/ehow-1/expected.html

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,15 @@
11
<div id="readability-page-1" class="page">
22
<div>
33
<header>
4+
<div>
5+
<p><span></span> <span></span> <span>Found This Helpful</span> </p>
6+
</div>
47
</header>
58
<div>
69
<p>Glass cloche terrariums are not only appealing to the eye, but they also preserve a bit of nature in your home and serve as a simple, yet beautiful, piece of art. Closed terrariums are easy to care for, as they retain much of their own moisture and provide a warm environment with a consistent level of humidity. You won’t have to water the terrariums unless you see that the walls are not misting up. Small growing plants that don’t require a lot of light work best such as succulents, ferns, moss, even orchids.</p>
710
<figure> <img src="http://img-aws.ehowcdn.com/640/cme/photography.prod.demandstudios.com/16149374-814f-40bc-baf3-ca20f149f0ba.jpg" alt="Glass cloche terrariums" title="Glass cloche terrariums" data-credit="Lucy Akins " longdesc="http://s3.amazonaws.com/photography.prod.demandstudios.com/16149374-814f-40bc-baf3-ca20f149f0ba.jpg" /> </figure>
811
<figcaption class="caption"> Glass cloche terrariums (Lucy Akins) </figcaption>
912
</div>
10-
<div id="relatedContentUpper" data-module="rcp_top">
11-
<header>
12-
<h3>Other People Are Reading</h3>
13-
</header>
14-
</div>
1513
<div>
1614
<p><span>What You'll Need:</span></p>
1715
<ul>

test/test-pages/ehow-2/expected.html

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,22 @@
11
<div id="readability-page-1" class="page">
2-
<div data-type="AuthorProfile">
3-
<div>
4-
<p><a id="img-follow-tip" href="http://fakehost/contributor/gina_robertsgrey/" target="_top">
5-
<img src="http://img-aws.ehowcdn.com/60x60/cme/cme_public_images/www_demandstudios_com/sitelife.studiod.com/ver1.0/Content/images/store/9/2/d9dd6f61-b183-4893-927f-5b540e45be91.Small.jpg" data-failover="//img-aws.ehowcdn.com/60x60/ehow-cdn-assets/test15/media/images/authors/missing-author-image.png" onerror="var failover = this.getAttribute(&apos;data-failover&apos;);
2+
<div>
3+
<div data-type="AuthorProfile">
4+
<div>
5+
<p><a id="img-follow-tip" href="http://fakehost/contributor/gina_robertsgrey/" target="_top">
6+
<img src="http://img-aws.ehowcdn.com/60x60/cme/cme_public_images/www_demandstudios_com/sitelife.studiod.com/ver1.0/Content/images/store/9/2/d9dd6f61-b183-4893-927f-5b540e45be91.Small.jpg" data-failover="//img-aws.ehowcdn.com/60x60/ehow-cdn-assets/test15/media/images/authors/missing-author-image.png" onerror="var failover = this.getAttribute(&apos;data-failover&apos;);
67
if (failover) failover = failover.replace(/^https?:/,&apos;&apos;);
78
var src = this.src ? this.src.replace(/^https?:/,&apos;&apos;) : &apos;&apos;;
89
if (src != failover){
910
this.src = failover;
1011
}" /> </a></p>
12+
</div>
13+
<p><time datetime="2016-09-14T07:07:00-04:00" itemprop="dateModified">Last updated September 14, 2016</time>
14+
</p>
15+
</div>
16+
<div data-score="true" data-url="http://www.ehow.com/how_4851888_throw-graduation-party-budget.html">
17+
<p><span> Save</span>
18+
</p>
1119
</div>
12-
<p><time datetime="2016-09-14T07:07:00-04:00" itemprop="dateModified">Last updated September 14, 2016</time>
13-
</p>
1420
</div>
1521
<div>
1622
<article data-type="article">
@@ -116,11 +122,6 @@
116122
</figure>
117123
<figcaption class="caption"> Mark Stout/iStock/Getty Images </figcaption>
118124
</div>
119-
<div id="relatedContentUpper" data-module="rcp_top">
120-
<header>
121-
<h3>Other People Are Reading</h3>
122-
</header>
123-
</div>
124125
</span>
125126
</span>
126127
<span>

test/test-pages/engadget/expected.html

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,15 @@ <h4> Gallery: Xbox One X | 14 Photos </h4>
1111
</div>
1212
</section>
1313
<div>
14+
<div>
15+
<div>
16+
<p><span>from</span>&nbsp;<span>$610.00</span>
17+
</p>
18+
</div>
19+
<div>
20+
<p> 87 </p>
21+
</div>
22+
</div>
1423
<div>
1524
<div>
1625
<ul>

test/test-pages/firefox-nightly-blog/expected.html

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -246,9 +246,6 @@ <h4>
246246
</p>
247247
</li>
248248
</ol>
249-
<div id="respond">
250-
<h3 id="reply-title"> Leave a Reply </h3>
251-
</div>
252249
</div>
253250
</div>
254251
</div>

test/test-pages/mercurial/expected.html

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,10 @@ <h3>
9090
<a href="#id7">Setting up</a><a href="#setting-up" title="Permalink to this headline"></a>
9191
</h3>
9292
<p> We’ll work through an example with three local repositories, although in the real world they’d most likely be on three different computers. First, the <tt><span>public</span></tt> repository is where tested, polished changesets live, and it is where you synchronize with the rest of your team. </p>
93+
<div>
94+
<pre>$ hg init public
95+
</pre>
96+
</div>
9397
<p> We’ll need two clones where work gets done, <tt><span>test-repo</span></tt> and <tt><span>dev-repo</span></tt>: </p>
9498
<div>
9599
<pre>$ hg clone public test-repo
@@ -124,6 +128,11 @@ <h3>
124128
</pre>
125129
</div>
126130
<p> and add </p>
131+
<div>
132+
<pre>[extensions]
133+
evolve =
134+
</pre>
135+
</div>
127136
<p> Keep in mind that in real life, these repositories would probably be on separate computers, so you’d have to login to each one to configure each repository. </p>
128137
<p> To start things off, let’s make one public, immutable changeset: </p>
129138
<div>
@@ -228,6 +237,11 @@ <h3>
228237
</pre>
229238
</div>
230239
<p> and add </p>
240+
<div>
241+
<pre>[extensions]
242+
evolve =
243+
</pre>
244+
</div>
231245
<p> Then edit Bob’s repository configuration: </p>
232246
<div>
233247
<pre>$ hg -R bob config --edit --local
@@ -523,6 +537,10 @@ <h3>
523537
<p> [figure SG07: 2:e011 now public not obsolete, 4:fe88 now bumped] </p>
524538
</blockquote>
525539
<p> As usual when there’s trouble in your repository, the solution is to evolve it: </p>
540+
<div>
541+
<pre>$ hg evolve --all
542+
</pre>
543+
</div>
526544
<p> Figure 8 illustrates Bob’s repository after evolving away the bumped changeset. Ignoring the obsolete changesets, Bob now has a nice, clean, simple history. His amendment of Alice’s bug fix lives on, as changeset 5:227d—albeit with a software-generated commit message. (Bob should probably amend that changeset to improve the commit message.) But the important thing is that his repository no longer has any troubled changesets, thanks to <tt><span>evolve</span></tt>. </p>
527545
<blockquote>
528546
<p> [figure SG08: 5:227d is new, formerly bumped changeset 4:fe88 now hidden] </p>

0 commit comments

Comments
 (0)