-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WP_HTML_Tag_Processor: Inject dynamic data to block HTML markup in PHP #42485
Conversation
do you think we really need the updater function for |
It's either this or a more sophisticated behavior for remove_class/add_class to support multiple class updates for the same element. I think you have a point so I'll play with the latter. |
@dmsnell @gziolo I updated this quite a bit! Test, docs, code standards are still missing. On the up side I made it more internally consistent, simplified a bunch of code paths, limited the available public methods, and added a fresh usage example with a more involved HTML snippet at the bottom of the file. This one is ready for more in-depth comments now. |
I have some general feedback to share. I checked existing issues and tickets related to HTML scanning in PHP for block attributes stored as block content – aka sourced attributes. In the past, it was primarily discussed in relation to REST API so you could get the block data as part of the post content object. The original issue is #4763 which I moved to WordPress Trac at some point ticket#53603. Two existing implementations try a different approach to scanning the HTML of the block, which you might find interesting:
The most complex attributes sourcing in core blocks is probably used with the Table block: gutenberg/packages/block-library/src/table/block.json Lines 14 to 124 in 4ed1c11
I don’t know if that’s something you considered when discussing HTML processing on the server, but it feels very close despite being on a completely different level of abstraction. I checked the current implementation, and it looks like the |
@gziolo Two pieces are missing for us to be able to do that:
Perhaps it wouldn't be too hard to add both in one of the future iterations. |
f6ef968
to
5bf0c28
Compare
This PR needs your feedback to progress!It proposes a lean HTML processor for PHP: $w = new WP_HTML_Walker('<div id="first"><img /></div>');
$w->next_tag('img');
$w->set_attribute('src', '/wp-content/logo.png');
echo $w;
// <div id="first"><img src="/wp-content/logo.png" /></div> It will be a tremendous help for working with blocks in PHP. See the description for more details and use-cases. The only way to move forward now is to discuss the proposed public interface and the idea itself. Please comment with your thoughts on both! @dmsnell @spacedmonkey @aristath @schlessera @fabiankaegy @carolinan @ntsekouras @gziolo @scruffian @draganescu @getdave @swissspidy @azaozz @hellofromtonya @peterwilsoncc @youknowriad @mcsf @ellatrix @Mamaduka @noisysocks @mtias @youknowriad @andronocean @ciampo @aaronrobertshaw @c4rl0sbr4v0 @jorgefilipecosta @paaljoachim @talldan @oandregal @ntsekouras @ndiego @MaggieCabrera @sarayourfriend @ajlende |
Is there more background about the design of this API and why it is needed? What other alternatives have been considered? What about the security implications of writing your own HTML modifier like this? If the primary use case of this is HTML modifications in blocks, why does (a tiny wrapper around) Aside: Instead of pinging dozens of people I'd suggest bringing this up in dev chat and writing a make/core post to get more feedback :) |
@swissspidy Here's just a few instances where this would be useful:
Using DOMDocument was extensively discussed – it's not installed on every host so you'd have to provide a polyfill. Even if it was available everywhere, it can't match and modify imperfect HTML that well. Ad-hoc per-use-case regular expressions were also discussed and explored. It is hard to come up with a correct one and it's easy to run into problems like worrying about the specific sequence of whitespace used in the markup. They're also a maintenance and duplication burden and prone to errors.
Outputting a malformed HTML would create an XSS attack vector so we better test for all kinds of corner cases, e.g. using the same test examples as KSES. To stay safe, the API is restricted to adding, removing, and replacing HTML attributes. We know where they start and end by following the HTML parsing specification to the letter. This API can inject a string like In my testing, this PR handles weird inputs in the same way a browser would. Still, we need an extensive test suite to be extra sure about that. It is also worth noting that a typical input to |
@adamziel opened #44410 to track follow-up issues as we plan to land this PR in its current shape as soon as CI becomes green. We have been working on this task for over two months and got a ton of great feedback here and in https://make.wordpress.org/core/2022/08/19/a-new-system-for-simply-and-reliably-updating-html-attributes/ that helped to shape the current solution. The plan now is to include this code in the Gutenberg plugin that's going to be included in 14.3 release (in 3 weeks) with the intent to ship it in WordPress 6.2 (Q1 of the next year). |
Props to everyone who worked so hard to land this one 👏 |
As I commented a month ago (in a comment that GH very unhelpfully hides) this is not ready for "prime time" yet. It's lacking clarity and documentation about what security features this new API has or doesn't have. Imho this is a blocker. There are several From what I understand it looks like this new API is going to require extenders to properly escape all new and replacement strings. If that's the case, there at least should be a sufficient section in the docblock(s) explaining what and how to do that, and perhaps mentioning the "danger zone" / edge cases. These sections will also need examples of how exceptions should e handled, strings like |
This commit pulls in the HTML Tag Processor from the Gutenbeg repository. The Tag Processor attempts to be an HTML5-spec-compliant parser that provides the ability in PHP to find specific HTML tags and then add, remove, or update attributes on that tag. It provides a safe and reliable way to modify the attribute on HTML tags. ```php // Add missing `rel` attribute to links. $p = new WP_HTML_Tag_Processor( $block_content ); if ( $p->next_tag( 'A' ) && empty( $p->get_attribute( 'rel' ) ) ) { $p->set_attribute( 'noopener nofollow' ); } return $p->get_updated_html(); ``` Introduced originally in WordPress/gutenberg#42485 and developed within the Gutenberg repository, this HTML parsing system was built in order to address a persistent need (properly modifying HTML tag attributes) and was motivated after a sequence of block editor defects which stemmed from mismatches between actual HTML code and expectectations for HTML input running through existing naive string-search-based solutions. The Tag Processor is intended to operate fast enough to avoid being an obstacle on page render while using as little memory overhead as possible. It is practically a zero-memory-overhead system, and only allocates memory as changes to the input HTML document are enqueued, releasing that memory when flushing those changes to the document, moving on to find the next tag, or flushing its entire output via `get_updated_html()`. Rigor has been taken to ensure that the Tag Processor will not be consfused by unexpected or non-normative HTML input, including issues arising from quoting, from different syntax rules within `<title>`, `<textarea>`, and `<script>` tags, from the appearance of rare but legitimate comment and XML-like regions, and from a variety of syntax abnormalities such as unbalanced tags, incomplete syntax, and overlapping tags. The Tag Processor is constrained to parsing an HTML document as a stream of tokens. It will not build an HTML tree or generate a DOM representation of a document. It is designed to start at the beginning of an HTML document and linearly scan through it, potentially modifying that document as it scans. It has no access to the markup inside or around tags and it has no ability to determine which tag openers and tag closers belong to each other, or determine the nesting depth of a given tag. It includes a primitive bookmarking system to remember tags it has previously visited. These bookmarks refer to specific tags, not to string offsets, and continue to point to the same place in the document as edits are applied. By asking the Tag Processor to seek to a given bookmark it's possible to back up and continue processsing again content that has already been traversed. Attribute values are sanitized with `esc_attr()` and rendered as double-quoted attributes. On read they are unescaped and unquoted. Authors wishing to rely on the Tag Processor therefore are free to pass around data as normal strings. Convenience methods for adding and removing CSS class names exist in order to remove the need to process the `class` attribute. ```php // Update heading block class names $p = new WP_HTML_Tag_Processor( $html ); while ( $p->next_tag() ) { switch ( $p->get_tag() ) { case 'H1': case 'H2': case 'H3': case 'H4': case 'H5': case 'H6': $p->remove_class( 'wp-heading' ); $p->add_class( 'wp-block-heading' ); break; } return $p->get_updated_html(); ``` The Tag Processor is intended to be a reliable low-level library for traversing HTML documents and higher-level APIs are to be built upon it. Immediately, and in Core Gutenberg blocks it is meant to replace HTML modification that currently relies on RegExp patterns and simpler string replacements. See the following for examples of such replacement: WordPress/gutenberg@1315784 https://github.com/WordPress/gutenberg/pull/45469/files#diff-dcd9e1f9b87ca63efe9f1e834b4d3048778d3eca41aa39c636f8b16a5bb452d2L46 WordPress/gutenberg#46625 Co-Authored-By: Adam Zielinski <[email protected]> Co-Authored-By: Bernie Reiter <[email protected]> Co-Authored-By: Grzegorz Ziolkowski <[email protected]>
Description
Introduces
WP_HTML_Tag_Processor
: A PHP class to update HTML markup (overview issue).Dynamic blocks often need to inject a CSS class name or set
<img src />
in the rendered block HTML markup but lack the means to do so:At the moment, every block comes up with a custom-tailored regular expression. It is limiting, error-prone, and demands code duplication. It also leads to PRs such as strip whitespaces in render_block_core_cover.
Here's just a few instances where this would be useful:
This PR reuses a lot of the great @dmsnell's explorations from #42507.
Architecture explained
WP_HTML_Tag_Processor
provides a standard and highly-restricted HTML modification API. It tokenizes an HTML document to find specific tags. Then, it transforms those tags by adding, removing, or updating the values of the HTML attributes within that tag (opener). Importantly, it does not fully parse HTML or recurse into the HTML structure. Instead,WP_HTML_Tag_Processor
moves forward linearly through a document and only parses the HTML tag openers.In practical terms this API purposefully tries to address only the subset of HTML modifications that are commonly needed when modifying stored HTML from server-side filters and code, such as when adding class names to block markup.
The most significant benefits of the approach taken:
WP_HTML_Tag_Processor
works fast and has a tiny memory footprint (for a PHP program). Adding a class and an attribute to 300k tags in a 12MB HTML document takes ~800ms and 24 MB of memory on my Mac Pro.The following constraints were applied to satisfy all of the above:
WP_HTML_Tag_Processor
doesn't try to understand the relationship between tags. This prevents us from extrapolating into more complex queries on the HTML structure. This is a perk because it allows us to avoid handling all the complexity and overhead of semantic HTML parsing (as needed when building a DOM tree).<p>
tag) would require more overhead in the parser and would require handling the array of situations where we find malformed HTML, such as with unmatched or overlapping tags).Feedback request
Please reply to this PR with your thoughts about:
WP_HTML_Tag_Processor
to corePlease ignore the failing CI checks, missing docstrings, and implementation details for now.
Can this really parse HTML like a browser?
This PR includes an HTML tokenizer and an HTML tag parser. It doesn't build a DOM tree, reason about entities, or handle mismatched, overlapping, or malformed tags.
The walker linearly scans the input HTML and visits every tag opener (e.g.
<p>
or<div class="wp-block-group">
). By implementing the WHATWG HTML parsing spec we maintain parity with how browsers will parse the HTML, both in how they decode normal HTML as well as how they handle abnormalities, such as with unexpected HTML attributes like<div =5 title="<br />" a="yes"b=no>
The public API proposed in this PR
Here's what an interface would look like:
Usage examples
Setting an attribute:
Updating multiple tags:
Finding a div with a specific class name:
No match found:
Updating every tag:
Alternative ideas considered
Using DOMDocument was extensively discussed – it's not installed on every host so you'd have to provide a polyfill. Even if it was available everywhere, it's based on libxml2 designed to parse XML. Libxml2 does not implement the WHATWG HTML parsing spec and does not support HTML5.
Ad-hoc per-use-case regular expressions were also discussed and explored. It is hard to come up with a correct one and it's easy to run into problems like worrying about the specific sequence of whitespace used in the markup. They're also a maintenance and duplication burden and prone to errors.
Alternative API shapes considered
There was an attempt at a fluid iterface:
But the streaming nature of this code made it inconvenient. This class must communicate whether it matched a tag and there are two ways to do it:
In the end, @dmsnell, @gziolo, and I realized that a fluid interface is about as verbose as the lack of it:
Since the latter makes for a simpler code, it became the version included in this proposal.
Security implications
Outputting a malformed HTML would create an XSS attack vector so we better test for all kinds of corner cases, e.g. using the same test examples as KSES.
To stay safe, the API is restricted to adding, removing, and replacing HTML attributes in tag openers. We know where they start and end by following the HTML parsing specification to the letter. This API processes the data in a linear fashion and does not recurse or reason about the tree structure at all. It can inject a string like
id="$escaped_id_value"
after a tag start, but it cannot do anything else like change the content or create new tags.In my testing, this PR handles weird inputs in the same way a browser would. Still, we need an extensive test suite to be extra sure about that.
It is also worth noting that a typical input to
WP_HTML_Tag_Processor
will have been already filtered through KSES.Related resources
Testing
Run unit tests with:
npm run test-unit-php -- --group html # Or without docker: ./vendor/bin/phpunit --no-configuration ./phpunit/html/wp-html-tag-processor-test.php
Next steps
on*
attributes such asonClick
. As @peterwilsoncc mentioned: