Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashes on all Pinterest and many other websites, minimal reproduction #836

Closed
koresar opened this issue Jan 19, 2024 · 6 comments
Closed

Comments

@koresar
Copy link

koresar commented Jan 19, 2024

const html = `
<div><div><div>
More than 25 characters!!
</div></div></div>
<div><meta name="twitter:title" content="1"></div>`;

const doc = new DOMParser().parseFromString(html, 'text/html')
const base = doc.createElement('base')
base.setAttribute('href', url)
doc.head.appendChild(base)
const reader = new Readability(doc, {
    keepClasses: true,
})

const result = reader.parse(); // crashes here

Stack trace:

TypeError: Cannot read properties of null (reading 'tagName')
    at Readability._grabArticle (node_modules/@extractus/article-extractor/node_modules/@mozilla/readability/Readability.js:1147:37)
    at Readability.parse (node_modules/@extractus/article-extractor/node_modules/@mozilla/readability/Readability.js:2265:31)

Pages to test on:

I couldn't fix myself. Sorry.

@gijsk
Copy link
Contributor

gijsk commented Jan 23, 2024

@koresar Thanks for the report. What environment do you run your minimal testcase in? In nodejs, it produces DOMParser is not defined errors.

For the pinterest urls, I see that readability doesn't output anything (when running npm run generate-testcase pinterest "https://www.pinterest.ca/variamsingh87/") but I don't see "crashes" or an error with the stack trace you're mentioning. Are you sure it's the same problem?

What version of readability are you using?

@koresar
Copy link
Author

koresar commented Jan 24, 2024

Apologies @gijsk
I forgot to bring more details.

That's Node.js. All Readability versions affected (I tried them all).

npm i @mozilla/readability linkedom

test.mjs:

import { Readability } from "@mozilla/readability";
import { DOMParser } from "linkedom";

const url = "https://www.pinterest.ca/variamsingh87/";
const html = `
<div><div><div>
More than 25 characters!!
</div></div></div>
<div><meta name="twitter:title" content="1"></div>`;

const doc = new DOMParser().parseFromString(html, "text/html");
const base = doc.createElement("base");
base.setAttribute("href", url);
doc.head.appendChild(base);

const reader = new Readability(doc, {
  keepClasses: true,
});
const result = reader.parse() ?? {};

console.log(result.textContent ? result.content : null);

Run:

node test.mjs

Output:

/Users/vasyl/code/killme/node_modules/@mozilla/readability/Readability.js:1150
        while (parentOfTopCandidate.tagName !== "BODY") {
                                    ^

TypeError: Cannot read properties of null (reading 'tagName')
    at Readability._grabArticle (/Users/vasyl/code/killme/node_modules/@mozilla/readability/Readability.js:1150:37)
    at Readability.parse (/Users/vasyl/code/killme/node_modules/@mozilla/readability/Readability.js:2277:31)
    at file:///Users/vasyl/code/killme/readability.mjs:18:23
    at ModuleJob.run (node:internal/modules/esm/module_job:193:25)
    at async Promise.all (index 0)
    at async ESMLoader.import (node:internal/modules/esm/loader:530:24)
    at async loadESM (node:internal/process/esm_loader:91:5)
    at async handleMainPromise (node:internal/modules/run_main:65:12)

Reading your comment I am starting to think that it's the linkedom bug. :(

@gijsk
Copy link
Contributor

gijsk commented Jan 24, 2024

No worries. It's possible it's a linkedom problem. I'm a bit swamped at the moment and don't have a lot of time to look myself - if it's easy for you to check, you could see if the same thing happens if you run the same testcase but use DOMParser from jsdom? That would help narrow down the problem.

linkedom has a very... opinionated... approach to the DOM, and basically claims that it'd be better if some DOM mechanisms worked differently in order to improve performance. It requires consumers to not rely in any way on those DOM mechanisms working the way the browsers, jsdom and the spec say they do (and in exchange, in principle performance should be better). We've previously fixed some issues with linkedom compatibility and if it's straightforward to do so here I would be open to doing it, but we'd need to understand a bit more what exactly is happening differently with linkedom vs all the other DOM implementations.

@koresar
Copy link
Author

koresar commented Jan 25, 2024

Looks like it doesn't make much sense to fix linkedom issues in readability. Closing now.
Thank you!

@koresar koresar closed this as not planned Won't fix, can't repro, duplicate, stale Jan 25, 2024
@SettingDust
Copy link

After some research. It's caused by the input HTML. WebReflection/linkedom#147
As linkedom side points out, you need to wrap your element in the body.
The input in your example(#836 (comment)) will be parsed to html like below

`
<div><head><base href="https://www.pinterest.ca/variamsingh87/"></head><body></body><div><p>
More than 25 characters!!
</p></div></div>
<P><meta name="twitter:title" content="1"></P>`

So that readability won't accept that since the body isn't actual body

@SettingDust
Copy link

But for Pinterest it's a bit different. I doubt readability can handle it properly.
图片

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants