Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some HTML is parsed incorrectly #53

Closed
shonfeder opened this issue Jul 12, 2020 · 5 comments
Closed

Some HTML is parsed incorrectly #53

shonfeder opened this issue Jul 12, 2020 · 5 comments

Comments

@shonfeder
Copy link

I hit this while working on transforming Omd to Tyxml (noted here ocaml-community/omd#211 (comment)).

It looks like Lambdasoup is giving invalid results for some HTML. E.g.

# let parsed = Soup.parse {|<p><a href="foo\nbar"></p>
|};;
val parsed : Soup.soup Soup.node = <abstr>
# Soup.to_string parsed;;
- : string = "<p><a href=\"foo\\nbar\"></a></p><a href=\"foo\\nbar\">\n</a>"

If the newline in the href attribute is removed, the duplication doesn't occur.

I'm happy to put in some work towards a fix, if that would be useful.

@shonfeder
Copy link
Author

I should also note: I have yet dug in far enough to know whether it might not be an upstream issue (with markup).

@aantron aantron transferred this issue from aantron/lambdasoup Jul 12, 2020
@aantron
Copy link
Owner

aantron commented Jul 12, 2020

I would be happy to review a fix. The only thing to be careful of is that it complies with the HTML5 spec. Here are some relevant pieces:

Are you sure that it is the newline inside the href specifically that is causing the problem? There is also a newline at the end of your input. Is that newline present in both of your tests?

My first guess as to what is happening here is that the unclosed <a> spanning the end of <p> is triggering the "reconstruct the active formatting elements" algorithm (called on by the in body insertion mode rules I linked to above). That causes formatting elements like <a> to be duplicated in multiple block elements like <p> or <body> if the formatting elements are unclosed, and there is any more input, like that final newline.

If that's correct, then this result is according to the spec (i.e. Chrome and other browsers would produce the same interpretation). Have you tried your inputs in a browser, inspecting the resulting DOMs?

You can also try parse5 online on this input: https://astexplorer.net/#/gist/e4a59515f46e8d7a1ae7ea5286730c30/341f1f298e3ce58e377ff13f497424c36fce818d. If you expand the AST, you will see that parse5 produces the same output, and the "duplication" depends on the final newline. The choice of parse5 is important, because most other JS HTML parsers (all that I have checked) don't actually comply with HTML5, and only parse an HTML-like input language without the full error recovery specified in HTML5.

@shonfeder
Copy link
Author

Good cal! It does appear to be the newline at the end that triggers it:

# Soup.(parse {|<p><a href="foo bar"></p>\n|} |> to_string);;
- : string = "<p><a href=\"foo bar\"></a></p><a href=\"foo bar\">\\n</a>"

Thanks very much for the pointers and detailed response. I am not an HTML expert (obviously :)). If I understand correctly, this may actually be the "correct" behavior, at least as per the spec? If so, I'm happy with just closing the issue. It may at least help instruct future users who hit this behavior!

@aantron
Copy link
Owner

aantron commented Jul 12, 2020

Yes, this is the correct behavior according to the spec :)

The spec asks a compliant parser to sometimes produce outputs that may be counterintuitive. That's the result of trade-offs that were made in order to intuitively parse certain other malformed inputs that the authors of HTML5 found to be more common or otherwise important. This issue is, unfortunately, one of the cases that got "penalized" by these trade-offs.

@shonfeder
Copy link
Author

Thanks very much for helping me walk through this :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants