-
Notifications
You must be signed in to change notification settings - Fork 190
Description
In the README, it says:
We expect to be supplied with well-formatted HTML (closing elements for every applicable open element, nested correctly) and so we do not focus on repairing badly nested or incomplete HTML.
I'd like to sanitize markdown (processed with goldmark) that allows users to include raw html, but if they provide text like </div></div> they can break page rendering.
Given that bluemonday is already processing a token stream and tracking elements, I believe the necessary logic would be opt-in and quite simple:
- maintain a stack of open elements
- when a start tag is emitted, push it to the stack
- when an end tag is reached, check if it matches the top of stack. If not, check if there is a matching start tag in the stack at all. If there is, pop start tags from the stack and emit end tags until it matches, then emit the end tag. If there isn't a matching start tag at all, don't emit the end tag.
- if the stack is non-empty at the end of the document, pop from the stack and emit end tags until it's empty.
One minor complication might be not supporting optional tags properly, but omitting it for simplicity is probably the better option for a first version. The full HTML tree construction algorithm is a fair amount more complex, but could probably be implemented in a streaming mode in the future.
Would a PR for this functionality be theoretically acceptable? I'm imagining a very simple p.RequireMatchingElements(true) interface to opt-in.