Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don’t remove <html> element from DOM tree #835

Closed
danburzo opened this issue Jan 18, 2024 · 1 comment
Closed

Don’t remove <html> element from DOM tree #835

danburzo opened this issue Jan 18, 2024 · 1 comment

Comments

@danburzo
Copy link
Contributor

On this page, the <html> element has the class sidebar-visible, which flags it for removal from DOM. Due to the way the code happens to work, this seems inconsequential to fetching the readable contents of the page, but when using Readability programmatically, in causes documentElement on the passed Document become null.

I believe the <html> element should probably be spared in most (all?) cases.

@gijsk
Copy link
Contributor

gijsk commented Feb 9, 2024

Unfortunately, what readability does/doesn't do to the original document is not really a documented/supported part of the library. Arbitrary web pages can also do this (ie remove documentElement from the DOM and keep running script in the background, and put it back at some later point). It's a little strange, but browsers and libraries have to cope with this (and occasionally it trips us up because it's such a weird situation).

I'm a little hesitant to add logic to try to avoid this situation. It's not a given that the html node is the root element anyway (for the same reason), and it will be a relatively involved change to update all the code in readability that strips content out to doublecheck that it's not removing the root node. (and then, why stop there? Should we also keep body? Or other parts?)

The fact that you are apparently relying on this under-documented bit of how readability behaves (ie what it does/doesn't strip out) also means that there may be others relying on this behaviour, and so touching it feels risky.

I say "under" rather than "un" because the readme does actually state:

This removes some elements in the web page, which may be undesirable. You can avoid this by passing the clone of the document object to the Readability constructor

and this would probably fix your problem.

All in all, I think I would prefer not to touch this, so I'm going to close this out. I could be convinced otherwise, but for that it'd be helpful to understand more about why you think this is important and worth doing.

@gijsk gijsk closed this as completed Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants