Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace()-ing or remove()-ing self-closing tags breaks outputted markup #215

Open
ssokolow opened this issue Apr 26, 2024 · 9 comments
Open

Comments

@ssokolow
Copy link

ssokolow commented Apr 26, 2024

I'm using lol-html as a way to postprocess some custom tags in pulldown-cmark output that's less fragile than quick-xml with pairing enforcement turned off (which is, itself, an order of magnitude faster than markup5ever) and I find it very frustrating that lol-html is so wedded to "once a self-closing tag, always a self-closing tag".

Even if I pull in my templating engine early and manually generate a properly escaped <span class="currency" title="$77.84 CAD, including tax">87.95 CAD + tax</span> string to .replace() my <price amount="77.84" currency="cad_tax" />, I wind up with broken markup and a spurious &gt; at the next occurrence of the tag.

Isn't this a bit of a hazard? Shouldn't it assume that all bets are off about whether something is self-closing when you replace() it?

Hell, if it weren't so wedded to that and had elem.set_self_closing(false); (or just automatically toggled it when you do elem.set_inner_content), then I wouldn't need the templating engine at all and could do something like this:

elem.set_self_closing(false);
elem.set_tag_name("span")?;
let attrs: Vec<_> = elem.attributes().iter().map(|x| x.name()).collect();
for attr in attrs {
    elem.remove_attribute(&attr);
}
elem.set_attribute("class", "currency")?;
elem.set_attribute("title", &title)?;
elem.set_inner_content(&body, ContentType::Text);

...or, if it were smart enough to parse the contents of .replace() and re-attach elem to the first element in the replacement string:

elem.replace("<span class='currency'></span>", ContentType::Html);
elem.set_attribute("title", &title)?;
elem.set_inner_content(&body, ContentType::Text);

As-is, my only choice appears to be to bail on the input file with an error message that blames lol-html and points to this issue (which is really terrible UX) or switch back to quick-xml and pay the runtime cost to run an Ammonia pass first to ensure that the user hasn't included any <script>, <style>, or <textarea> tags that would confuse it.

@ssokolow
Copy link
Author

ssokolow commented Apr 26, 2024

OK, this is looking like a more serious problem. I haven't made a minimal reproducer yet, but it looks like calling el.remove() on a self-closing tag discards the entire rest of the document.

...and using elem.replace() on it too, apparently. I really do have no option but to panic!("Using <price /> self-closing tag syntax trips bug #215 in lol-html");.

@ssokolow
Copy link
Author

ssokolow commented Apr 26, 2024

OK, here's a minimal reproducer based on the README example:

use lol_html::{element, HtmlRewriter, Settings};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut output = vec![];
    let mut rewriter = HtmlRewriter::new(
        Settings {
            element_content_handlers: vec![element!("hello", |el| {
                eprintln!("Is Self-Closing: {}", el.is_self_closing());
                el.replace("Hello,", lol_html::html_content::ContentType::Text);
                Ok(())
            })],
            ..Settings::default()
        },
        |c: &[u8]| output.extend_from_slice(c),
    );

    rewriter.write(b"<hello></hello> House!\n<hello /> Mouse!\nHello, World!\n")?;
    rewriter.end()?;

    assert_eq!(
        String::from_utf8(output)?,
        "Hello, House!\nHello, Mouse!\nHello, World!\n"
    );
    Ok(())
}
ssokolow@monolith-tng lolhtml_215_reproducer [master] % cargo run      
   Compiling lolhtml_215_reproducer v0.1.0 (/home/ssokolow/src/lolhtml_215_reproducer)
    Finished dev [optimized] target(s) in 3.33s
     Running `target/debug/lolhtml_215_reproducer`
Is Self-Closing: false
Is Self-Closing: true
thread 'main' panicked at src/main.rs:20:5:
assertion `left == right` failed
  left: "Hello, House!\nHello,"
 right: "Hello, House!\nHello, Mouse!\nHello, World!\n"
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@ssokolow ssokolow changed the title replace()-ing self-closing tags breaks outputted markup replace()-ing or remove()-ing self-closing tags breaks outputted markup Apr 26, 2024
@bglw
Copy link

bglw commented Apr 29, 2024

👋 I'm not a maintainer here, but jumping in with some info.

(you can see a similar case and reply from me here: #207 (comment))

The crux is that in HTML, "self-closing" elements aren't denoted by the /> syntax, despite what React & co would have you believe. A fixed set of elements are always self closing regardless of syntax (void elements, e.g. <br>, often styled as <br />). Everything else must always be paired.

Even more specifically, the / character is ignored altogether when parsing HTML — <price> and <price /> are identical, the / is " unnecessary and has no effect of any kind".

All of that to say, in this case lol-html sees you open a <price> element and never close it, thus that element consumes the rest of the document and everything is removed.

Browsers have more error protection and will force an end tag onto the <price> element at some point, but lol-html can't do that while streaming the document — it assumes and relies on valid HTML passing through.


tl;dr <price /> is not a self-closing element and lol-html's handling of it is not a bug.

@ssokolow
Copy link
Author

ssokolow commented Apr 29, 2024

I'd find that more convincing if lol-html didn't have a functioning Element::is_self_closing method with no deprecation warnings or other warnings of any kind in its API documentation.

In its current state, that just comes across as "lol-html knows what you mean, but it's going to engage in malicious spec-compliance and not even warn you that it will do so because it just doesn't like you".

I know it's functioning because I currently use it to implement panic!("Please replace your <price /> with a <price></price>").

@bglw
Copy link

bglw commented Apr 29, 2024

is_self_closing is essentially a synonym for "is void element". The job of that function is to tell you whether it's a self closing element type — it doesn't use syntax at all. (e.g. <br> is always self closing and can never have content).

@ssokolow
Copy link
Author

is_self_closing is essentially a synonym for "is void element". The job of that function is to tell you whether it's a self closing element type — it doesn't use syntax at all. (e.g. <br> is always self closing and can never have content).

Note the example output from my reproducer. It's returning true for <price />.

@bglw
Copy link

bglw commented Apr 29, 2024

Oh got you — yes looks like it is using the slash. That's not a great implementation from lol-html.

@bglw
Copy link

bglw commented Apr 29, 2024

Tracing this back a little more:

lol-html is pretty directly implementing the tokenizer state machine, which is where that flag is coming from: https://www.w3.org/TR/2011/WD-html5-20110113/tokenization.html#self-closing-start-tag-state — which does "Set the self-closing flag of the current tag token" — though that flag should only be for tokenization afaict.

In lol-html it looks to be setting that flag on the element, which is then exposed via Element::is_self_closing, which is wrong, since it's passing through the meaning of "element's start tag was self-closing when tokenized", which doesn't actually affect the element in HTML's sense.

@ssokolow
Copy link
Author

ssokolow commented Apr 29, 2024

As far as lol-html goes, my biggest concern is ensuring that it's not such a footgun.

Given that my own use is working around Markdown's lack of a generic inline "lang extension" syntax akin to what reStructuredText supports (I already use custom "languages" like svgbob on code blocks for block-level extensions), I'm sure I could switch back from lol-html to quick-xml with pairing enforcement turned off, as I used to use, and just have it bail on encountering the handful of tags that require embedding a parser for a different language to parse correctly, since I see no valid reason for them to show up in post/article bodies.

Hell, I know it can work that way without taking advantage of how pulldown-cmark's token stream lets me run it only on chunks of embedded HTML without having to deal with the HTML generated from the Markdown constructs. (I just wish I could hook into the token stream before the built-in optional syntax extensions run so I don't have to choose between the smart punctuation and strikethrough extensions, the ability to implement Compose key shortcodes like :e': → é, and maintaining my own fork of either the whole parser or just the extensions that I need to hook in before.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants