Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashes on Pinterest and a lot of other websites #380

Closed
koresar opened this issue Jan 24, 2024 · 16 comments
Closed

Crashes on Pinterest and a lot of other websites #380

koresar opened this issue Jan 24, 2024 · 16 comments

Comments

@koresar
Copy link

koresar commented Jan 24, 2024

Pages to test on:

Code:

import { extract } from '@extractus/article-extractor'
const input = 'https://www.pinterest.ca/variamsingh87/'
await extract(input)

Error:

TypeError: Cannot read properties of null (reading 'tagName')
    at Readability._grabArticle (/Users/vasyl/code/killme/node_modules/@mozilla/readability/Readability.js:1150:37)
    at Readability.parse (/Users/vasyl/code/killme/node_modules/@mozilla/readability/Readability.js:2277:31)
    at default (file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/utils/extractWithReadability.js:18:25)
    at file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/utils/parseFromHtml.js:88:14
    at file:///Users/vasyl/code/killme/node_modules/bellajs/src/utils/pipe.js:4:38
    at file:///Users/vasyl/code/killme/node_modules/bellajs/src/utils/pipe.js:4:40
    at file:///Users/vasyl/code/killme/node_modules/bellajs/src/utils/pipe.js:4:40
    at default (file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/utils/parseFromHtml.js:98:19)
    at extract (file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/main.js:24:10)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)

I presume that the bug is somewhere inside the linkedom package, DOMParser class.

@ndaidong
Copy link
Collaborator

@koresar unfortunately, this library works best with simple article formats. Complex layouts like lists and grids can be tricky for it to handle.

@koresar
Copy link
Author

koresar commented Jan 24, 2024

I believe it would work fine if the linkedom was less buggy.

@SettingDust
Copy link
Member

I believe it would work fine if the linkedom was less buggy.

maybe you can try to use happydom for testing

@ndaidong
Copy link
Collaborator

@SettingDust @koresar glad to hear about happydom, I will try it.

@koresar
Copy link
Author

koresar commented Jan 25, 2024

People generally recommend using jsdom with readability. FYI

@SettingDust
Copy link
Member

People generally recommend using jsdom with readability. FYI

It's too heavy. We used to use it. The Happy DOM slogan is JSDOM alternative

Happy DOM focuses heavily on performance and can be used as an alternative to JSDOM.

@SettingDust
Copy link
Member

After some tests. Happy DOM doesn't support many pseudo classes like :not, :is make it isn't an option for now. I'll look into the linkedom and readability for this issue

@SettingDust
Copy link
Member

Like mozilla/readability#836 (comment) said. The pinterest returned html is wrong
图片

Readability won't work as expected I think.

I'm going to try catch the content parser and don't include content property when error. How are you thinking about this? @ndaidong

@ndaidong
Copy link
Collaborator

@SettingDust thank you. I've tried happydom in another project but not happy with it!

Anw, if the HTML structrure is not well-formed, no library can help. And yes, I think we should catch the error like this.

@SettingDust
Copy link
Member

SettingDust commented Jan 26, 2024

Anw, if the HTML structrure is not well-formed, no library can help. And yes, I think we should catch the error like this.

Browser, JSDOM, or happy dom will handle them. But not linkedom
JSDOM and happydom will wrap the tags in body XD

linkedom just keep them

I've tried happydom in another project but not happy with it!

Aw. I'm a little curious, what's the problem?

@koresar
Copy link
Author

koresar commented Jan 29, 2024

I would like to use JSDOM, if it behaves well with pinterest.
(Not sure how heavy it is, but I do not care much. I'd better have reliable slow code rather unreliable but fast.)

Or can I pass my DOM object instead of the HTML as a string?

@ndaidong
Copy link
Collaborator

Aw. I'm a little curious, what's the problem?

@SettingDust I'm crawling content from some websites. When I replace linkedom with happydom, the rate of failed cases will increase about 30%. So I had to continue going with linkedom. No time to try new solution in this short project.

Screenshot from 2024-01-29 09-34-54

@koresar the main purpose of this lib is extract article content only. If you want to extract a special kind of content, such as Pinterest, it's better to fetch the content directly and parse them with any DOM utils you can.

@koresar
Copy link
Author

koresar commented Jan 29, 2024

I am not after special kind of content. I'm after very random pages on the internet. :) It fails not only on the pinterest, but a large amount of other websites.

I don't see JSDOM as a performance bottleneck. Do you?

@SettingDust
Copy link
Member

Or can I pass my DOM object instead of the HTML as a string?

That's a point I mentioned before. I'm going to do that in new extractus project.

@SettingDust
Copy link
Member

SettingDust commented Jan 29, 2024

I am not after special kind of content. I'm after very random pages on the internet. :) It fails not only on the pinterest, but a large amount of other websites.

Can you provide the list of failed websites plz?

I don't see JSDOM as a performance bottleneck. Do you?

JSDOM is slower and larger memory alloc than linkedom. https://github.com/WebReflection/linkedom#benchmarks

 linkedom  benchmark for  ./html.html             

initial heap: 7.73 MB

 parsing cold: 1.282s

document heap: 347.37 MB

 html.normalize() cold: 31.981ms

 crawling childNodes cold: 91.346ms
 crawling childNodes  hot: 77.313ms
 total childNodes 580234

 crawling children cold: 54.539ms
 crawling children  hot: 50.959ms
 total children 270313

after crawling heap: 350.01 MB

 html.cloneNode(true) cold: 1.129s
 cloning: OK
 outerHTML: OK
 outcome: OK

after cloning heap: 829.71 MB

 querySelectorAll("div") cold: 37.924ms
 querySelectorAll("div")  hot: 35.58ms
 total div 3714

 getElementsByTagName("p") cold: 34.354ms
 getElementsByTagName("p")  hot: 32.025ms
 total p 18800

after querying heap: 839.56 MB

 removing divs: 2.116ms

 div count cold: 19.291ms
 total div 0

 p count cold: 22.322ms
 total p 15212

after removing divs heap: 837.12 MB

 html.cloneNode(true) cold: 600.18ms
 cloneNode: OK

 html.outerHTML cold: 190.948ms
 html.outerHTML  hot: 177.362ms
 outcome: OK

 html.innerHTML: 5.756s

total benchmark time: 11.510s
total heap memory: 1383.01 MB
 linkedom cached  benchmark for  ./html.html             

initial heap: 7.06 MB

 parsing cold: 1.340s

document heap: 348.8 MB

 html.normalize() cold: 57.576ms

 crawling childNodes cold: 212.709ms
 crawling childNodes  hot: 100.948ms
 total childNodes 580234

 crawling children cold: 151.29ms
 crawling children  hot: 50.785ms
 total children 270313

after crawling heap: 415.36 MB

 html.cloneNode(true) cold: 1.329s
 cloning: OK
 outerHTML: OK
 outcome: OK

after cloning heap: 891.2 MB

 querySelectorAll("div") cold: 43.107ms
 querySelectorAll("div")  hot: 0.02ms
 total div 3714

 getElementsByTagName("p") cold: 49.835ms
 getElementsByTagName("p")  hot: 0.092ms
 total p 18800

after querying heap: 894.28 MB

 removing divs: 3.51ms

 div count cold: 26.572ms
 total div 0

 p count cold: 31.736ms
 total p 15212

after removing divs heap: 901.92 MB

 html.cloneNode(true) cold: 642.068ms
 cloneNode: OK

 html.outerHTML cold: 211.345ms
 html.outerHTML  hot: 202.012ms
 outcome: OK

 html.innerHTML: 9.180s

total benchmark time: 15.569s
total heap memory: 1468.02 MB

 JSDOM  benchmark for  ./html.html             

initial heap: 21.84 MB

 parsing cold: 5.250s

document heap: 1144.18 MB

 html.normalize() cold: 440.291ms

 crawling childNodes cold: 1.408s
 crawling childNodes  hot: 602.199ms
 total childNodes 580243

 crawling children cold: 631.032ms
 crawling children  hot: 311.591ms
 total children 270322

after crawling heap: 1542.26 MB

 html.cloneNode(true) cold: 2.320s
 cloning: OK
 outerHTML: OK
 outcome: OK

after cloning heap: 2642.37 MB

 querySelectorAll("div") cold: 180.664ms
 querySelectorAll("div")  hot: 38.099ms
 total div 3714

 getElementsByTagName("p") cold: 177.417ms
 getElementsByTagName("p")  hot: 4.433ms
 total p 18800

after querying heap: 2549.03 MB

 removing divs: 8.520s

 div count cold: 61.379ms
 total div 0

 p count cold: 66.741ms
 total p 15212

after removing divs heap: 2561.58 MB

 html.cloneNode(true) cold: 1.646s
 cloneNode: OK

 html.outerHTML cold: 278.583ms
 html.outerHTML  hot: 276.089ms
 outcome: OK

 html.innerHTML: 6.017s

total benchmark time: 30.761s
total heap memory: 3368.16 MB

JSDOM is slower than happydom https://github.com/capricorn86/happy-dom#performance

@koresar
Copy link
Author

koresar commented Jan 29, 2024

Indeed. The numbers are wild. Okay. Got ya.

@ndaidong ndaidong closed this as completed May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants