Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML-API: Introduce minimal HTML Processor. #4519

Closed

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented May 29, 2023

Trac ticket: 58517-trac

Prior work:

For review, please focus on significant matters about the design and implementation of this new subsystem. Many things could change between now and merge. Here are some things that could use your insightful feedback:

  • We want to prevent accidental support being added to the class, whereas someone might add a new HTML element to the "big switch statement" and overlook that it also needs to pull in support in other functions. I've created a special test file that links the functions needing that support with the HTML tags/elements which demand it. Are there other ways to do this?
  • This class inherits what I believe is the minimum interface to fully support the HTML5 spec (apart from a few exceptions where it won't add support: does not support quirks mode, does not support pausing the parser to execute JavaScript, etc…). That doesn't mean it supports the entire spec. Because of this, much of the code looks superfluous, but it has to support enough of the spec to know when it's okay to bail. The long-term vision is to fill out all the functionality but not change any interfaces. Some function arguments exist that are ignored. All of this "hollow" code is scaffolding to communicate and guide further work, which can happen incrementally.

Status

  • Finish documenting the main HTML Processor class
  • Can we inhibit people from using the constructor directly?
    • Can't make it private because the tag processor constructor isn't private
    • Added _doing_it_wrong() to warn people
  • Prepare explanation of what it supports and how development is planned.
    • Block seeking after modifying? We can add this later, but for now it seems good before we can provide the reliability guarantees needed to ensure it.
    • No set_inner_html() or set_outer_html() for release because it introduces new problems (mostly around seeking and bookmarking)
    • Only support using breadcrumbs to find a tag, basically an extension to the Tag Processor's query language.
  • Try and find HTML cases that break existing support.
  • Review bookmark garbage collection, I think there are some leaks when creating them outside of the stack, which might only exist in the inner/outer HTML work.
    • Could we finally create an internal bookmark class, that when it disappears from scope can release the bookmark?
    • If we do this, I think we would need to mirror the bookmarking system, but it would result in natural garbage collection, or would we have to hook in to the bookmarking so that the outer class calls into the bookmarks.
    • Derp, rely on the WP_HTML_Token class and release the bookmark in the destructor.
  • Formalize after_push and after_pop, as they will be valuable for turning nested iteration into a "free" flag
    • Removed for now, but these will be a handy addition when we get back to implementing things like working with inner/outer content. They can come in their own explained changeset.
  • Add a last_error property to determine why a failed run failed.

Lines of code counts as of 76c4ab8:

  • Source lines 678
  • Comment lines 1211
  • Files 6
  • Test lines 567

@adamziel
Copy link
Contributor

adamziel commented May 31, 2023

To get a list of all tags mentioned in rules starting with A start tag whose tag name is, run this in dev tools while on https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody

function extractQuotedStrings(inputString) {
    const regex = /"([^"]*)"/g;
    let match;
    let matches = [];

    while ((match = regex.exec(inputString)) !== null) {
        matches.push(match[1]);
    }

    return matches;
}
let startTagRules = Array.from(document.querySelectorAll('dt'))
    .map(e=>e.innerText)
    .filter(t=>t.includes('A start tag whose tag name is'))
    .flatMap(extractQuotedStrings)

startTagRules = Array.from(new Set(startTagRules))
startTagRules.map(t=>t.toUpperCase()).sort().join(", ")

It yields:

A, ADDRESS, APPLET, AREA, ARTICLE, ASIDE, B, BASE, BASEFONT, BGSOUND, BIG, BLOCKQUOTE, BODY, BR, BUTTON, CAPTION, CENTER, CODE, COL, COLGROUP, COLOR, DD, DETAILS, DIALOG, DIR, DIV, DL, DT, EM, EMBED, FACE, FIELDSET, FIGCAPTION, FIGURE, FONT, FOOTER, FORM, FRAME, FRAMESET, H1, H2, H3, H4, H5, H6, HEAD, HEADER, HGROUP, HR, HTML, I, IFRAME, IMAGE, IMG, INPUT, KEYGEN, LI, LINK, LISTING, MAIN, MARQUEE, MATH, MENU, META, NAV, NOBR, NOEMBED, NOFRAMES, NOSCRIPT, OBJECT, OL, OPTGROUP, OPTION, P, PARAM, PLAINTEXT, PRE, RB, RP, RT, RTC, RUBY, S, SCRIPT, SEARCH, SECTION, SELECT, SIZE, SMALL, SOURCE, SPAN, STRIKE, STRONG, STYLE, SUB, SUMMARY, SUP, SVG, TABLE, TBODY, TD, TEMPLATE, TEXTAREA, TFOOT, TH, THEAD, TITLE, TR, TRACK, TT, U, UL, VAR, WBR, XMP

I also did end tags for good measure:

A, ADDRESS, APPLET, ARTICLE, ASIDE, B, BIG, BLOCKQUOTE, BODY, BR, BUTTON, CAPTION, CENTER, CODE, COL, COLGROUP, DD, DETAILS, DIALOG, DIR, DIV, DL, DT, EM, FIELDSET, FIGCAPTION, FIGURE, FONT, FOOTER, FORM, FRAMESET, H1, H2, H3, H4, H5, H6, HEAD, HEADER, HGROUP, HTML, I, LI, LISTING, MAIN, MARQUEE, MENU, NAV, NOBR, NOSCRIPT, OBJECT, OL, OPTGROUP, OPTION, P, PRE, S, SARCASM, SCRIPT, SEARCH, SECTION, SELECT, SMALL, STRIKE, STRONG, SUMMARY, TABLE, TBODY, TD, TEMPLATE, TFOOT, TH, THEAD, TR, TT, U, UL

@dmsnell dmsnell force-pushed the html-api/minimal-html-processor branch 2 times, most recently from 924961a to 216bcfc Compare June 2, 2023 10:46
@dmsnell
Copy link
Member Author

dmsnell commented Jun 11, 2023

I've now explored wrap/unwrap functions and set_inner_html/set_outer_html functions. Both are not that bad with the current state of the PR. In fact, it's quite incredible how these functions emerge after having the stack of open elements properly built and maintained.

public function set_inner_html( $html ) {
	parent::set_bookmark( 'start' );
	$end = $this->step_while_token_open( $this->current_token );

	$this->lexical_updates[] = new WP_HTML_Text_Replacement(
		$this->bookmarks['start']->end + ( $this->bookmarks['start']->start === $this->bookmarks['start']->end ? 0 : 1),
		$this->bookmarks[ $end ]->start,
		$html
	);
}

public function step_while_token_open( $token ) {
	if ( self::is_void( $token->node_name ) ) {
		return $this->bookmark_tag();
	}

	$after_pop = $this->state->stack_of_open_elements->after_pop;
	$removed_token = false;
	$this->state->stack_of_open_elements->after_pop = function ( $bookmark_name ) use ( $after_pop, &$removed_token, $token ) {
		$this->arc_dec( $bookmark_name );
		if ( $bookmark_name === $token->bookmark_name ) {
			$removed_token = true;
			$this->state->stack_of_open_elements->after_pop = $after_pop;
		}
	};
	while ( $this->step() ) {
		if ( $removed_token ) {
			if ( $this->get_tag() !== $token->node_name || ! $this->is_tag_closer() ) {
				$name = $this->bookmark_tag();
				$this->bookmarks[ $name ]->end = $this->bookmarks[ $name ]->start;
				return $name;
			}
			return $this->bookmark_tag();
		}
	}

	return null;
}

After exploring this, however, I want to defer this functionality, primarily because of the way it interact with bookmarks and seeking. We might actually be fine with how it is, or we might want to impose some additional arbitrary limits to ensure safety, and so for now I'd rather just not have to focus on that.

@dmsnell dmsnell force-pushed the html-api/minimal-html-processor branch from fd23c97 to 7d781a8 Compare June 12, 2023 13:30
Copy link
Contributor

@costdev costdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @dmsnell! I've left mostly documentation feedback in this review, as well as some I18N feedback, and a question regarding assertions not called directly within test methods.

*
* @param WP_HTML_Token $token Look for this node in the stack.
* @return bool Whether the referenced node is in the stack of open elements.
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*

src/wp-includes/html-api/class-wp-html-open-elements.php Outdated Show resolved Hide resolved
}

/**
* @ticket 58517
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Needs a test description 🔢 Applies elsewhere in the PR
  • Needs an @covers annotation here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question from above: given that that test name itself forms the test description, and that the vast majority of existing tests rely on this, do we really need a duplicated description in every test?

private function ensure_support_is_added_everywhere( $tag_name ) {
$p = WP_HTML_Processor::createFragment( "<$tag_name>" );

$this->assertFalse( $p->step(), "Must support terminating elements in specific scope check before adding support for the {$tag_name} element." );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hellofromtonya Correct me if wrong, but should assertions only be performed inside the test methods / closures within test methods?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this would be a good place to examine, as this is a fairly novel kind of test file. I have explored a few situations. the point here is to actively warn someone who wants to add additional support to the HTML parser class.

given that we have hundreds of tests that all do the same thing, I figured a minor abstraction here would be valuable to have, despite generally agreeing that test abstractions are things to be wary of.

if there's a way to ensure this won't be called outside of a test method that works; this was my intent on not naming it with the test_ prefix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively we could create a new assertion, which is essentially what this does, so I will look into that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the more I look into this the more I'd rather special-case this from the linting rules than create a new test assertion. creating a new assertion is going to pull in more complexity and make things more convoluted than they already are to debug and understand. having this single function, while different than in other cases, at least is self-contained and straightforward.

so I'd rather we do whatever we need to do to make sure this doesn't get abstracted into something much greater in scope than is required and allow this unique class to be unique.

more important than enforcing this styling rule, I'd love some feedback on automating the support rules, to ensure that nobody unintentionally partially adds support for HTML tags that require additional support. for example, we don't want someone to simply add a new BUTTON clause to the big switch statement in the html processor. if they do that, they will subvert other rules that have to be added to the adoption agency algorithm.

*/
public function push( $token ) {
/*
* > If there are already three elements in the list of active formatting elements after the last marker,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed the * > format through a lot of the inline comments in the PR. This commenting style only appears to exist within the HTML API, and should likely be removed for consistency with the rest of Core.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are quotes from the HTML specification and are important for associating the logic with the spec document, for being able to quickly find where in the spec this particular code is providing its functionality.

How should docblocks represent quotes? I will update as long as we can retain the clarity that they are quotes and as long as we don't reword them, since that would break the ability to link the spec with the code.

Of note, not every part of the spec we reference has an associated link, unfortunately, so these are incredibly valuable for quickly cross-referencing between the two sides.

Copy link
Contributor

@costdev costdev Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I understand now. In that case, I'd suggest we either keep the >, or remove them and start the quote with As stated by the HTML specification:. I don't have a preference there, so if you want to keep it as-is, that works for me!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to continue exploring ways to present these, but I think the > is succinct and helpful. Given how often we want to quote the spec I think As stated by the HTML specification: could get a bit distracting.

Where possible we have a link to the spec. Unfortunately in some of these places there are large sections that we split up in the logic, where each subsection doesn't have a unique URL.

Gutenberg mini-documents instead of Markdown docblocks? Could eliminate all the hassle over ordering and styles.

public function remove_node( $token ) {
foreach ( $this->walk_up() as $position => $item ) {
if ( $token->bookmark_name === $item->bookmark_name ) {
array_splice( $this->stack, $this->count() - $position - 1, 1 );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Recurring) nit -- could do negative offset instead of subtraction:

Suggested change
array_splice( $this->stack, $this->count() - $position - 1, 1 );
array_splice( $this->stack, - $position - 1, 1 );

Not sure it's that much nicer 🤷‍♂️

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm. maybe if others chime in. I'm confused more by negative positions, particularly when they also include math. I with the function would reject negative offsets so that subtraction and negative offsets couldn't overlap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a negative offset with - $position - 1 is harder to read. What about 1 - $position? It would avoid the call to $this->count(), and while it's personally easier for me to read, I'm not sure if it would be as clear to others.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 - $position is even less clear to me. I'm not worried about $this->count() as a call. although in this case it's a function call, we could decide if we measure that it's a problem to switch it to count( $this->stack ) which is essentially the same as a normal variable lookup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rearranged this a bit to try and help. the loop index is now called $position_from_end to better indicate that it's not the normal array index. we're also using a helper variable $position_from_start to give a better name to the computation.

Comment on lines +110 to +121
public function has_element_in_specific_scope( $tag_name, $termination_list ) {
foreach ( $this->walk_up() as $node ) {
if ( $node->node_name === $tag_name ) {
return true;
}
}

return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the $termination_list arg for? Are we missing some code (or a TODO note)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might need to copy a comment from another function. it's unsupported by in the spec, and will eventually be filled out. thanks for noticing.

Comment on lines 139 to 185
/**
* Returns whether a particular element is in BUTTON scope.
*
* @since 6.4.0
*
* @see https://html.spec.whatwg.org/#has-an-element-in-button-scope
*
* @param string $tag_name Name of tag to check.
* @return bool Whether given element is in scope.
*/
public function has_element_in_button_scope( $tag_name ) {
return $this->has_element_in_specific_scope(
$tag_name,
array(
// No termination elements are currently supported.
)
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I've totally grasped this, but shouldn't this be something like

Suggested change
/**
* Returns whether a particular element is in BUTTON scope.
*
* @since 6.4.0
*
* @see https://html.spec.whatwg.org/#has-an-element-in-button-scope
*
* @param string $tag_name Name of tag to check.
* @return bool Whether given element is in scope.
*/
public function has_element_in_button_scope( $tag_name ) {
return $this->has_element_in_specific_scope(
$tag_name,
array(
// No termination elements are currently supported.
)
);
}
/**
* Returns whether a particular element is in BUTTON scope.
*
* @since 6.4.0
*
* @see https://html.spec.whatwg.org/#has-an-element-in-button-scope
*
* @return bool Whether given element is in BUTTON scope.
*/
public function has_element_in_button_scope() {
return $this->has_element_in_specific_scope(
'BUTTON',
array(
// No termination elements are currently supported.
)
);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good guess; it's confusing. We want to check if a given element, for example, a P, is in BUTTON scope. The BUTTON part is implied in the linked spec to define the set of termination elements, so BUTTON would actually end up in the second parameter, the array.

As it stands, it's not possible for the processor to encounter this because if it encounters a BUTTON it currently bails.

This odd tension between wanting something that will eventually support the full spec and something that can be reviewed leads to these oddities, and to things like the tests that ensure nobody adds BUTTON support without adjusting the corresponding functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think it makes sense to add a PHPDoc @TODO or something of that sort, or would that also become unwieldy quickly due to the nature of these classes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is one of the things I've been trying to explore here. people don't seem to like @TODO (I personally find them useful in cases like this), I have tried ### HTML Support segments in docblocks, but those also don't seem great. this is just a small note attempting to indicate why the thing isn't wrong even though it appears wrong.

as you are pointing out, the wording is lacking, but that's a different issue IMO, something I will clean up

@dmsnell dmsnell force-pushed the html-api/minimal-html-processor branch from 8be04d3 to 8e418d3 Compare June 13, 2023 16:44
@dmsnell
Copy link
Member Author

dmsnell commented Jun 13, 2023

@costdev thanks for your thorough code style scrutiny. I appreciate it! given the state of this PR right now, I would also be extremely grateful for higher-level feedback about the design and implementation and overall structure given that lots of code might change between now and merge. I've done my best to try and get ahead of having a hundred comments about code style, but obviously I missed a lot of cases. still, all the comments about small styling differences can make it harder to review the parts that matter with respect to function, performance, security, design.

would it help next time if I add something in the description to say this isn't ready for meticulous styling scrutiny? what would be a good way to communicate this?

@costdev
Copy link
Contributor

costdev commented Jun 13, 2023

@dmsnell No problem and I totally understand! When folks are already weighing in on the design/implementation side of things in a PR, I tend to look for the rest so that the PR gets good coverage in reviews, but I get that when there's a lot of iteration anticipated, this can make things hard to follow.

As we don't have three stages of PR (such as "Draft", "Ready for technical review" and "Ready for additional review"), it may be worth keeping a PR in "Draft" if a lot of iteration is anticipated prior to merge. That way, design discussions and such can take priority until the PR is marked "Ready for Review", then final reviews (including code style) can be carried out when it's marked "Ready for Review", so nothing gets missed before commit.

What do you think?

@dmsnell
Copy link
Member Author

dmsnell commented Jun 13, 2023

What do you think?

Maybe something to iterate on. I don't anticipate a lot of iteration because I have been careful and we went through a lot of internal review already, but if someone were to point out something at this time it could change things is more what I meant.

I've added a note at the top. Thanks again for your thoroughness; it was actually that attention to detail with the Tag Processor that led most of the functions here to have the appropriate docblock ordering already.

@dmsnell dmsnell force-pushed the html-api/minimal-html-processor branch 9 times, most recently from a03bd17 to 5ad776e Compare July 4, 2023 11:10
@dmsnell dmsnell force-pushed the html-api/minimal-html-processor branch 2 times, most recently from 4c43bc2 to a7a058e Compare July 6, 2023 17:25
@dmsnell dmsnell force-pushed the html-api/minimal-html-processor branch from a7a058e to b44869f Compare July 13, 2023 12:11
return false;
}

if ( ! parent::set_bookmark( ++$this->bookmark_counter ) ) {
Copy link
Contributor

@ockham ockham Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably doesn't matter much, but to aid debugging, we could use a naming convention that includes the tag name? Something like

$tag_name = $this->get_tag;
$i = 0;
do {
	$bookmark_name = $tag_name . '-' . $i++;
} while ( $this->has_bookmark( $bookmark_name ) );

Copy link
Contributor

@ockham ockham Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could add the tag name, but there's a tradeoff: when I've been debugging the number name is clear and terse while longer names take up more space, and if you have a slide of 20 DIVs the tag name isn't that helpful.

having a number was surprisingly helpful because the index corresponds to the tag in the sequence they appear.

what is your "oops" comment and link? is it the recommendation against programmatic names? that's not to warn against creating tag names in the bookmark name, but to warn against the very thing we're doing anyway with numbers.

I've taken the stance here that in this very controlled environment there's not a practical way to avoid doing that if we want to parse HTML without creating a second bookmarking system (which was explored). it's more like, "avoid doing this because it carries these risks. if you are aware of the risks and know what you are doing, fine."

case 'IMAGE':
/*
* > A start tag whose tag name is "image"
* > Change the token's tag name to "img" and reprocess it. (Don't ask.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤐

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thing to keep in mind when coming across these is that they are direct quotes from the HTML spec. There are some good ones (look for An end tag whose tag name is "sarcasm")

Comment on lines +436 to +441
$this->current_token = new WP_HTML_Token(
$this->bookmark_tag(),
$this->get_tag(),
$this->is_tag_closer(),
$this->release_internal_bookmark
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$this->current_token seems to be internal state that's only ever set here (in step()) for subsequent usage in step_in_body() (which is private, and only called from here); plus it is read in seek() to determine direction.

Any chance we can avoid this? Getting all the bookkeeping right for $lexical_updates and $classname_updates in the Tag Processor was surprisingly complex (even though the concepts seemed deceptively simple), so I'd love if we could avoid introducing this piece of internal state here (and make more explicit who is going to be the consumer of current_token). For step_in_body() (and its prospective friends for other contexts), it seems easy enough to pass $current_token as an arg. Can we also find some sort of workaround for seek()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a tough one, and I'm not sure I entirely follow the association with $lexical_updates and $classname_updates

I've always had a version of the current token because so much of the parsing algorithm depends on properties of the current element/node/token. it has made sense to make it globally available within the processor.

of note, it only gets set in the one place that all scanning takes place (everything flows through $this->step()) and ultimately it's just a bookmark.

I'd like to better understand the ways this could get out of sync. maybe it does when seeking backwards? I will try and think up a test to confirm that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in f67e051 I added a new test which I actually though would break, but it seems to be fine. I don't have solid confidence that it's not still possible to break though.

Comment on lines +245 to +246
$this->assertTrue( $p->get_attribute( 'supported' ), 'Did not find required supported element.' );
$this->assertFalse( $p->step(), "Didn't properly reject unsupported markup: {$description}" );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart 👍

Comment on lines 323 to 351
'Simple IMG tag' => array( '<img target>', array( 'HTML', 'BODY', 'IMG' ), 1 ),
'Two sibling IMG tags' => array( '<img><img target>', array( 'HTML', 'BODY', 'IMG' ), 2 ),
'Three sibling IMG tags, an IMAGE in last place' => array( '<img><img><image target>', array( 'HTML', 'BODY', 'IMG' ), 3 ),
'IMG inside a DIV' => array( '<div><img target></div>', array( 'HTML', 'BODY', 'DIV', 'IMG' ), 1 ),
'DIV inside a DIV' => array( '<div><div target></div>', array( 'HTML', 'BODY', 'DIV', 'DIV' ), 1 ),
'IMG inside many DIVS' => array( '<div><div><div><div><img target></div></div></div></div>', array( 'HTML', 'BODY', 'DIV', 'DIV', 'DIV', 'DIV', 'IMG' ), 1 ),
'DIV inside DIV after IMG' => array( '<div><img><div target></div></div>', array( 'HTML', 'BODY', 'DIV', 'DIV' ), 1 ),
'IMG after DIV' => array( '<div></div><img target>', array( 'HTML', 'BODY', 'IMG' ), 1 ),
'IMG after two DIVs' => array( '<div></div><div></div><img target>', array( 'HTML', 'BODY', 'IMG' ), 1 ),
'IMG after two DIVs with nesting' => array( '<div><div><img></div></div><div></div><img target>', array( 'HTML', 'BODY', 'IMG' ), 1 ),
'IMG after invalid DIV closer' => array( '</div><img target>', array( 'HTML', 'BODY', 'IMG' ), 1 ),
'EM inside DIV' => array( '<div>The weather is <em target>beautiful</em>.</div>', array( 'HTML', 'BODY', 'DIV', 'EM' ), 1 ),
'EM after closed EM' => array( '<em></em><em target></em>', array( 'HTML', 'BODY', 'EM' ), 2 ),
'EM after closed EMs' => array( '<em></em><em><em></em></em><em></em><em></em><em target></em>', array( 'HTML', 'BODY', 'EM' ), 6 ),
'EM after unclosed EM' => array( '<em><em target></em>', array( 'HTML', 'BODY', 'EM', 'EM' ), 1 ),
'EM after unclosed EM after DIV' => array( '<em><div><em target>', array( 'HTML', 'BODY', 'EM', 'DIV', 'EM' ), 1 ),
// This should work for all formatting elements, but if two work, the others probably do too.
'CODE after unclosed CODE after DIV' => array( '<code><div><code target>', array( 'HTML', 'BODY', 'CODE', 'DIV', 'CODE' ), 1 ),
'P after unclosed P' => array( '<p><p target>', array( 'HTML', 'BODY', 'P' ), 2 ),
'Unclosed EM inside P after unclosed P' => array( '<em><p><p><em target>', array( 'HTML', 'BODY', 'EM', 'P', 'EM' ), 1 ),
'P after closed P' => array( '<p><i>something</i></p><p target>This one</p>', array( 'HTML', 'BODY', 'P' ), 2 ),
'A after unclosed A' => array( '<a><a target>', array( 'HTML', 'BODY', 'A' ), 2 ),
'A after unclosed A, after a P' => array( '<p><a><a target>', array( 'HTML', 'BODY', 'P', 'A' ), 2 ),
'Large HTML document with deep P' => array(
'<div><div><div><div><div><div><div><div><p></p><p></p><p><div><strong><em><code></code></em></strong></div></p></div></div></div></div></div></div></div></div><div><div><div><div><div><div><div><div><p></p><p></p><p><div><strong><em><code target></code></em></strong></div></p></div></div></div></div></div></div></div></div>',
array( 'HTML', 'BODY', 'DIV', 'DIV', 'DIV', 'DIV', 'DIV', 'DIV', 'DIV', 'DIV', 'DIV', 'STRONG', 'EM', 'CODE' ),
2,
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this deserves a comment. I intentionally want to push in the tests situations that are deeper than specified in some rules relating to adoption and reconstruction, even if they don't relate specifically to each tag in the test.

with all the tests otherwise being small I figured we might miss certain things on a real document if we don't push the stack depth.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment in 639a01f

Copy link
Contributor

@ockham ockham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a number of comments, but nothing major.

This is an impressive feat. The inline documentation and comments help a great deal, but I will say that it's still not trivial to grasp the underlying logic always -- which is probably natural due to the inherent complexity of the spec (that I'm still not that intimately familiar with). Also nice job on the naming of helper methods (reconstruct_active_formatting_elements , run_adoption_agency_algorithm, etc), and on the code structure of step_in_body which uses all those and is quite straight-forward to read.

Short of attempting to fully understand algorithms such as Adoption Agency and how it's implemented, it's the impressive test coverage that gives me the confidence to give this my 👍 .

As is reflected in some of my earlier comments, I've struggled a bit with understanding some of the methods that are still in stub state -- possibly since I started by reviewing the helper classes first -- notably e.g. has_element_in_button_scope, whose implementation currently makes no reference to button at all 😅

I'm happy to give the benefit of the doubt that they will make more sense once the handling of the relevant element is implemented, and there's plenty of documentation in the HTML Processor class and test, but I wonder if a cursory reader of e.g. the Open Elements class might feel similarly puzzled as I did. I'm not sure I have a perfect solution to this -- I'd maybe just suggest even more inline docs/comments/@todos/@sees.

Anyway: LGTM 👍

@dmsnell dmsnell force-pushed the html-api/minimal-html-processor branch 3 times, most recently from fe81fae to 63ea501 Compare July 19, 2023 23:11
@dmsnell
Copy link
Member Author

dmsnell commented Jul 19, 2023

@ockham I think your comments are all valid. How about we get this in trunk so we can start iterating on it, and before 6.4.0 is cut, make sure that everything we want is in place, be it state simplification, TODO items (which I intentionally left out because I was afraid people would reject @TODO comments), and additional tests.

Thanks for all your help with this.

@ockham ockham force-pushed the html-api/minimal-html-processor branch from 63ea501 to b6a517b Compare July 20, 2023 13:13
@ockham
Copy link
Contributor

ockham commented Jul 20, 2023

Committed to Core trunk in https://core.trac.wordpress.org/changeset/56274.

@ockham ockham closed this Jul 20, 2023
@dmsnell dmsnell deleted the html-api/minimal-html-processor branch July 21, 2023 19:04
@luisherranz
Copy link
Member

Hey, feel free to ignore it, and I'm sure this is not the bottleneck, but wouldn't it be preferable from a performance point of view to order the tags by how common they are?

return (
  'ADDRESS' === $tag_name ||
  'APPLET' === $tag_name ||
  'AREA' === $tag_name ||
  'ARTICLE' === $tag_name ||
  // ...
switch ( $op ) {
  case '+BLOCKQUOTE':
  case '+DIV':
  case '+FIGCAPTION':
  case '+FIGURE':
  case '+P':
  // ...

Maybe also store those tag name strings in variables? (I'm not so sure if that one makes any difference).

@dmsnell
Copy link
Member Author

dmsnell commented Jul 26, 2023

wouldn't it be preferable from a performance point of view to order the tags by how common they are?

I don't know. Once this is spec-complete I'd like to run some benchmarks to better understand. A switch is also possible, but we need to measure and see. For the moment I want to keep things clear from an implementation standpoint, preferring a correspondence to the HTML spec's organization over conjectured micro performance. If we discover this is a significant penalty then I'd reconsider.

Maybe also store those tag name strings in variables? (I'm not so sure if that one makes any difference).

PHP interns string literals so there's no benefit to storing them in variables.

Originally I had a more performant solution here, which involved a WP_HTML_Spec class, but for the minimal processor I removed that again for reasons of clarity and maintainability over raw performance.

I'm still not sure how fast or slow this is because it's hard to create a meaningful test document that's fully supported. My guess is that this is marginally slower than the Tag Processor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants