-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crashes on Pinterest and a lot of other websites #380
Comments
@koresar unfortunately, this library works best with simple article formats. Complex layouts like lists and grids can be tricky for it to handle. |
I believe it would work fine if the |
maybe you can try to use happydom for testing |
@SettingDust @koresar glad to hear about happydom, I will try it. |
People generally recommend using |
After some tests. Happy DOM doesn't support many pseudo classes like |
Like mozilla/readability#836 (comment) said. The pinterest returned html is wrong Readability won't work as expected I think. I'm going to try catch the content parser and don't include content property when error. How are you thinking about this? @ndaidong |
@SettingDust thank you. I've tried happydom in another project but not happy with it! Anw, if the HTML structrure is not well-formed, no library can help. And yes, I think we should catch the error like this. |
Browser, JSDOM, or happy dom will handle them. But not linkedom linkedom just keep them
Aw. I'm a little curious, what's the problem? |
I would like to use JSDOM, if it behaves well with pinterest. Or can I pass my DOM object instead of the HTML as a string? |
@SettingDust I'm crawling content from some websites. When I replace linkedom with happydom, the rate of failed cases will increase about 30%. So I had to continue going with linkedom. No time to try new solution in this short project. @koresar the main purpose of this lib is extract article content only. If you want to extract a special kind of content, such as Pinterest, it's better to fetch the content directly and parse them with any DOM utils you can. |
I am not after special kind of content. I'm after very random pages on the internet. :) It fails not only on the pinterest, but a large amount of other websites. I don't see JSDOM as a performance bottleneck. Do you? |
That's a point I mentioned before. I'm going to do that in new extractus project. |
Can you provide the list of failed websites plz?
JSDOM is slower and larger memory alloc than linkedom. https://github.com/WebReflection/linkedom#benchmarks
JSDOM is slower than happydom https://github.com/capricorn86/happy-dom#performance |
Indeed. The numbers are wild. Okay. Got ya. |
Pages to test on:
Code:
Error:
I presume that the bug is somewhere inside the
linkedom
package,DOMParser
class.The text was updated successfully, but these errors were encountered: