Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: ensure short strings of legitimate content are not excluded #867

Merged
merged 9 commits into from
May 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ reasonable).

- [Add Parsely tags as a fallback metadata source](https://github.com/mozilla/readability/pull/865)
- [Fix the case that jsonld parse process is ignored when context url include the trailing slash](https://github.com/mozilla/readability/pull/833)
- [Fixed situations where short paragraphs of legitimate content would be excluded](https://github.com/mozilla/readability/pull/867)

## [0.5.0] - 2023-12-15

Expand Down
65 changes: 54 additions & 11 deletions Readability.js
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,10 @@ Readability.prototype = {
// see: https://en.wikipedia.org/wiki/Comma#Comma_variants
commas: /\u002C|\u060C|\uFE50|\uFE10|\uFE11|\u2E41|\u2E34|\u2E32|\uFF0C/g,
// See: https://schema.org/Article
jsonLdArticleTypes: /^Article|AdvertiserContentArticle|NewsArticle|AnalysisNewsArticle|AskPublicNewsArticle|BackgroundNewsArticle|OpinionNewsArticle|ReportageNewsArticle|ReviewNewsArticle|Report|SatiricalArticle|ScholarlyArticle|MedicalScholarlyArticle|SocialMediaPosting|BlogPosting|LiveBlogPosting|DiscussionForumPosting|TechArticle|APIReference$/
jsonLdArticleTypes: /^Article|AdvertiserContentArticle|NewsArticle|AnalysisNewsArticle|AskPublicNewsArticle|BackgroundNewsArticle|OpinionNewsArticle|ReportageNewsArticle|ReviewNewsArticle|Report|SatiricalArticle|ScholarlyArticle|MedicalScholarlyArticle|SocialMediaPosting|BlogPosting|LiveBlogPosting|DiscussionForumPosting|TechArticle|APIReference$/,
// used to see if a node's content matches words commonly used for ad blocks or loading indicators
adWords: /^(ad(vertising|vertisement)?|pub(licité)?|werb(ung)?|广告|Реклама|Anuncio)$/iu,
loadingWords: /^((loading|正在加载|Загрузка|chargement|cargando)(…|\.\.\.)?)$/iu,
},

UNLIKELY_ROLES: [ "menu", "menubar", "complementary", "navigation", "alert", "alertdialog", "dialog" ],
Expand Down Expand Up @@ -2154,17 +2157,57 @@ Readability.prototype = {
embedCount++;
}

var innerText = this._getInnerText(node);

// toss any node whose inner text contains nothing but suspicious words
if (this.REGEXPS.adWords.test(innerText) || this.REGEXPS.loadingWords.test(innerText)) {
return true;
}

var contentLength = innerText.length;
var linkDensity = this._getLinkDensity(node);
var contentLength = this._getInnerText(node).length;

var haveToRemove =
(img > 1 && p / img < 0.5 && !this._hasAncestorTag(node, "figure")) ||
(!isList && li > p) ||
(input > Math.floor(p/3)) ||
(!isList && headingDensity < 0.9 && contentLength < 25 && (img === 0 || img > 2) && !this._hasAncestorTag(node, "figure")) ||
(!isList && weight < 25 && linkDensity > 0.2) ||
(weight >= 25 && linkDensity > 0.5) ||
((embedCount === 1 && contentLength < 75) || embedCount > 1);
var textishTags = ["SPAN", "LI", "TD"].concat(Array.from(this.DIV_TO_P_ELEMS));
var textDensity = this._getTextDensity(node, textishTags);
var isFigureChild = this._hasAncestorTag(node, "figure");

// apply shadiness checks, then check for exceptions
const shouldRemoveNode = () => {
const errs = [];
if (!isFigureChild && img > 1 && p / img < 0.5) {
errs.push(`Bad p to img ratio (img=${img}, p=${p})`);
}
if (!isList && li > p) {
errs.push(`Too many li's outside of a list. (li=${li} > p=${p})`);
}
if (input > Math.floor(p/3)) {
errs.push(`Too many inputs per p. (input=${input}, p=${p})`);
}
if (!isList && !isFigureChild && headingDensity < 0.9 && contentLength < 25 && (img === 0 || img > 2) && linkDensity > 0) {
errs.push(`Suspiciously short. (headingDensity=${headingDensity}, img=${img}, linkDensity=${linkDensity})`);
}
if (!isList && weight < 25 && linkDensity > 0.2) {
errs.push(`Low weight and a little linky. (linkDensity=${linkDensity})`);
}
if (weight >= 25 && linkDensity > 0.5) {
errs.push(`High weight and mostly links. (linkDensity=${linkDensity})`);
}
if ((embedCount === 1 && contentLength < 75) || embedCount > 1) {
errs.push(`Suspicious embed. (embedCount=${embedCount}, contentLength=${contentLength})`);
}
if (img === 0 && textDensity === 0) {
errs.push(`No useful content. (img=${img}, textDensity=${textDensity})`);
}

if (errs.length > 0) {
this.log("Checks failed", errs);
return true;
}

return false;
};

var haveToRemove = shouldRemoveNode();

// Allow simple lists of images to remain in pages
if (isList && haveToRemove) {
for (var x = 0; x < node.children.length; x++) {
Expand Down
19 changes: 19 additions & 0 deletions test/debug-testcase.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
/* eslint-env node */

var Readability = require("../Readability");
var {JSDOM} = require("jsdom");
var fs = require("fs");
var path = require("path");

var testcaseRoot = path.join(__dirname, "test-pages");

if (process.argv.length < 3) {
console.log("No testcase provided.");
process.exit(1);
}

var src = fs.readFileSync(`${testcaseRoot}/${process.argv[2]}/source.html`, {encoding: "utf-8"}).trim();

var doc = new JSDOM(src, {url: "http://fakehost/test/page.html"}).window.document;

new Readability(doc, {debug: true}).parse();
26 changes: 26 additions & 0 deletions test/test-pages/bug-1255978/expected.html
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,32 @@
<p>1. Take any blankets or duvets off the bed</p>
<p>Forrest Jones said that anything that comes into contact with any of the previous guest’s skin should be taken out and washed every time the room is made, but that even the fanciest hotels don’t always do so. "Hotels are getting away from comforters. Blankets are here to stay, however. But some hotels are still hesitant about washing them every day if they think they can get out of it," he said.</p>
<div>
<div data-video-id="4685984084001" data-embed="default" data-player="2d3d4a83-ba40-464e-9bfb-2804b076bf67" data-account="624246174001" id="4685984084001" role="region" aria-label="video player"><video id="4685984084001_html5_api" data-account="624246174001" data-player="2d3d4a83-ba40-464e-9bfb-2804b076bf67" data-embed="default" data-video-id="4685984084001" preload="none" poster="http://brightcove.vo.llnwd.net/e1/pd/624246174001/624246174001_4685986878001_4685984084001-vs.jpg?pubId=624246174001&amp;videoId=4685984084001" src="blob:http://www.independent.co.uk/112e1cb2-b0b1-e146-be22-fc6d052f7ddd"></video>
<p><span>Play Video</span></p>
<div dir="ltr" role="group">
<p><span>Play</span></p>
<div>
<p><span>Current Time </span>0:00</p>
</div>
<div>
<p><span>/</span></p>
</div>
<div>
<p><span>Duration Time</span> 0:00</p>
</div>
<div tabindex="0" role="slider" aria-valuenow="NaN" aria-valuemin="0" aria-valuemax="100" aria-label="progress bar" aria-valuetext="0:00">
<p><span><span>Loaded</span>: 0%</span>
</p>
<p><span><span>Progress</span>: 0%</span>
</p>
</div>
<div>
<p><span>Remaining Time</span> -0:00</p>
</div>
<p><span>Share</span></p>
<p><span>Fullscreen</span></p>
</div>
</div>
<p>Video shows bed bug infestation at New York hotel</p>
</div>
<div>
Expand Down
4 changes: 4 additions & 0 deletions test/test-pages/citylab-1/expected.html
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@
</figure>
<div>
<h2 itemprop="headline"> Why Neon Is the Ultimate Symbol of the 20th Century </h2>
<div>
<p><span><time>1:39 PM ET</time></span>
</p>
</div>
</div>
<h2 itemprop="description"> The once-ubiquitous form of lighting was novel when it first emerged in the early 1900s, though it has since come to represent decline. </h2>
<section id="article-section-1">
Expand Down
8 changes: 3 additions & 5 deletions test/test-pages/ehow-1/expected.html
Original file line number Diff line number Diff line change
@@ -1,17 +1,15 @@
<div id="readability-page-1" class="page">
<div>
<header>
<div>
<p><span></span> <span></span> <span>Found This Helpful</span> </p>
</div>
</header>
<div>
<p>Glass cloche terrariums are not only appealing to the eye, but they also preserve a bit of nature in your home and serve as a simple, yet beautiful, piece of art. Closed terrariums are easy to care for, as they retain much of their own moisture and provide a warm environment with a consistent level of humidity. You won’t have to water the terrariums unless you see that the walls are not misting up. Small growing plants that don’t require a lot of light work best such as succulents, ferns, moss, even orchids.</p>
<figure> <img src="http://img-aws.ehowcdn.com/640/cme/photography.prod.demandstudios.com/16149374-814f-40bc-baf3-ca20f149f0ba.jpg" alt="Glass cloche terrariums" title="Glass cloche terrariums" data-credit="Lucy Akins " longdesc="http://s3.amazonaws.com/photography.prod.demandstudios.com/16149374-814f-40bc-baf3-ca20f149f0ba.jpg" /> </figure>
<figcaption class="caption"> Glass cloche terrariums (Lucy Akins) </figcaption>
</div>
<div id="relatedContentUpper" data-module="rcp_top">
<header>
<h3>Other People Are Reading</h3>
</header>
</div>
<div>
<p><span>What You'll Need:</span></p>
<ul>
Expand Down
23 changes: 12 additions & 11 deletions test/test-pages/ehow-2/expected.html
Original file line number Diff line number Diff line change
@@ -1,16 +1,22 @@
<div id="readability-page-1" class="page">
<div data-type="AuthorProfile">
<div>
<p><a id="img-follow-tip" href="http://fakehost/contributor/gina_robertsgrey/" target="_top">
<img src="http://img-aws.ehowcdn.com/60x60/cme/cme_public_images/www_demandstudios_com/sitelife.studiod.com/ver1.0/Content/images/store/9/2/d9dd6f61-b183-4893-927f-5b540e45be91.Small.jpg" data-failover="//img-aws.ehowcdn.com/60x60/ehow-cdn-assets/test15/media/images/authors/missing-author-image.png" onerror="var failover = this.getAttribute(&apos;data-failover&apos;);
<div>
<div data-type="AuthorProfile">
<div>
<p><a id="img-follow-tip" href="http://fakehost/contributor/gina_robertsgrey/" target="_top">
<img src="http://img-aws.ehowcdn.com/60x60/cme/cme_public_images/www_demandstudios_com/sitelife.studiod.com/ver1.0/Content/images/store/9/2/d9dd6f61-b183-4893-927f-5b540e45be91.Small.jpg" data-failover="//img-aws.ehowcdn.com/60x60/ehow-cdn-assets/test15/media/images/authors/missing-author-image.png" onerror="var failover = this.getAttribute(&apos;data-failover&apos;);
if (failover) failover = failover.replace(/^https?:/,&apos;&apos;);
var src = this.src ? this.src.replace(/^https?:/,&apos;&apos;) : &apos;&apos;;
if (src != failover){
this.src = failover;
}" /> </a></p>
</div>
<p><time datetime="2016-09-14T07:07:00-04:00" itemprop="dateModified">Last updated September 14, 2016</time>
</p>
</div>
<div data-score="true" data-url="http://www.ehow.com/how_4851888_throw-graduation-party-budget.html">
<p><span> Save</span>
</p>
</div>
<p><time datetime="2016-09-14T07:07:00-04:00" itemprop="dateModified">Last updated September 14, 2016</time>
</p>
</div>
<div>
<article data-type="article">
Expand Down Expand Up @@ -116,11 +122,6 @@
</figure>
<figcaption class="caption"> Mark Stout/iStock/Getty Images </figcaption>
</div>
<div id="relatedContentUpper" data-module="rcp_top">
<header>
<h3>Other People Are Reading</h3>
</header>
</div>
</span>
</span>
<span>
Expand Down
9 changes: 9 additions & 0 deletions test/test-pages/engadget/expected.html
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,15 @@ <h4> Gallery: Xbox One X | 14 Photos </h4>
</div>
</section>
<div>
<div>
<div>
<p><span>from</span>&nbsp;<span>$610.00</span>
</p>
</div>
<div>
<p> 87 </p>
</div>
</div>
<div>
<div>
<ul>
Expand Down
3 changes: 0 additions & 3 deletions test/test-pages/firefox-nightly-blog/expected.html
Original file line number Diff line number Diff line change
Expand Up @@ -246,9 +246,6 @@ <h4>
</p>
</li>
</ol>
<div id="respond">
<h3 id="reply-title"> Leave a Reply </h3>
</div>
</div>
</div>
</div>
18 changes: 18 additions & 0 deletions test/test-pages/mercurial/expected.html
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,10 @@ <h3>
<a href="#id7">Setting up</a><a href="#setting-up" title="Permalink to this headline">¶</a>
</h3>
<p> We’ll work through an example with three local repositories, although in the real world they’d most likely be on three different computers. First, the <tt><span>public</span></tt> repository is where tested, polished changesets live, and it is where you synchronize with the rest of your team. </p>
<div>
<pre>$ hg init public
</pre>
</div>
<p> We’ll need two clones where work gets done, <tt><span>test-repo</span></tt> and <tt><span>dev-repo</span></tt>: </p>
<div>
<pre>$ hg clone public test-repo
Expand Down Expand Up @@ -124,6 +128,11 @@ <h3>
</pre>
</div>
<p> and add </p>
<div>
<pre>[extensions]
evolve =
</pre>
</div>
<p> Keep in mind that in real life, these repositories would probably be on separate computers, so you’d have to login to each one to configure each repository. </p>
<p> To start things off, let’s make one public, immutable changeset: </p>
<div>
Expand Down Expand Up @@ -228,6 +237,11 @@ <h3>
</pre>
</div>
<p> and add </p>
<div>
<pre>[extensions]
evolve =
</pre>
</div>
<p> Then edit Bob’s repository configuration: </p>
<div>
<pre>$ hg -R bob config --edit --local
Expand Down Expand Up @@ -523,6 +537,10 @@ <h3>
<p> [figure SG07: 2:e011 now public not obsolete, 4:fe88 now bumped] </p>
</blockquote>
<p> As usual when there’s trouble in your repository, the solution is to evolve it: </p>
<div>
<pre>$ hg evolve --all
</pre>
</div>
<p> Figure 8 illustrates Bob’s repository after evolving away the bumped changeset. Ignoring the obsolete changesets, Bob now has a nice, clean, simple history. His amendment of Alice’s bug fix lives on, as changeset 5:227d—albeit with a software-generated commit message. (Bob should probably amend that changeset to improve the commit message.) But the important thing is that his repository no longer has any troubled changesets, thanks to <tt><span>evolve</span></tt>. </p>
<blockquote>
<p> [figure SG08: 5:227d is new, formerly bumped changeset 4:fe88 now hidden] </p>
Expand Down
3 changes: 3 additions & 0 deletions test/test-pages/qq/expected.html
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
<div id="readability-page-1" class="page">
<div id="C-Main-Article-QQ">
<div bosszone="titleDown">
<p><span><span bosszone="jgname">TNW中文站</span></span><span>2016年10月14日07:17</span></p>
</div>
<div id="Cnt-Main-Article-QQ" bosszone="content" accesskey="3" tabindex="-1">
<div>
<p><span>转播到腾讯微博</span></p>
Expand Down
10 changes: 10 additions & 0 deletions test/test-pages/royal-road/expected-metadata.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"title": "ONE HUNDRED TWO: What kind of wordchain? - Super Supportive",
"byline": "Follow Author",
"dir": null,
"lang": "en",
"excerpt": "102 “Were you expecting the competition for the showers to be the highest drama part of gym class?” Alden asked Haoyu as the two of them headed (...)",
"siteName": "Royal Road",
"publishedTime": null,
"readerable": true
}
Loading