-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Introduce class for sourcing block attributes from HTML #46345
base: trunk
Are you sure you want to change the base?
Conversation
} | ||
|
||
public static function parse_selector( $s, $at = 0 ) { | ||
$selectors = explode( ',', $s ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd treat the separator more carefully in case this gets applied to a selector like a[href*=","]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the explode will be executed every time this function runs, including the call in line 254 below where we already expect to deal with a single selector. only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, thanks. still very rough, and in fact I want to move a lot of code around here, but I wanted to get some basic structure together to see how to build this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved support for this into the main parser function and am aborting on unsupported attribute queries, so we shouldn't have this bug anymore
} | ||
|
||
// @TODO: Hashes don't have to start with `nmstart` so this might reject valid hash names. | ||
$identifier = self::parse_css_identifier( $s, $at ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it only consumes the first css identifier at the moment, like figure
in figure video
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks right 😉
though to be fair if you look up at line 275 above it, it's acknowledged that we don't support the descendant selector yet
while ( $tags->next_tag() ) { | ||
foreach ( $selector as $s ) { | ||
if ( 'element' === $s['type'] && $tags->get_tag() === strtoupper( $s['identifier'] ) ) { | ||
return $tags; | ||
} | ||
|
||
// @TODO: $tags->has_class() would be _really_ handy here. | ||
if ( 'class' === $s['type'] && preg_match( "~\b{$s['identifier']}\b~", $tags->get_attribute( 'class' ) ) ) { | ||
return $tags; | ||
} | ||
|
||
if ( 'hash' === $s['type'] && $s['identifier'] === $tags->get_attribute( 'id' ) ) { | ||
return $tags; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This effectively reimplements $tags->next_tag
. I'm tempted to say it could be included in the next_tag API as follows:
$criteria = [];
foreach ( $selectors as $s ) {
$criterion = [];
if ( 'element' === $s['type'] ) {
$criteria[] = [ 'tag_name' => strtoupper( $s['identifier'] ) ];
} else if ( 'class' === $s['type'] ) {
$criteria[] = [ 'class_name' => $s['identifier'] ];
} else if ( 'hash' === $s['type'] ) {
$criteria[] = [ 'id' => $s['identifier'] ];
}
}
$tags->next_tag( [
'or' => $criteria
] );
However, I can see how this may need more CSS-specific matchers that don't make sense in the core API.
At the same time, this method has duplicate logic for a special case when only one selector is received. I'd do this instead:
public static function select( $selector, $html ) {
$tags = new WP_HTML_Naive_Processor( $html );
if ( ! array_is_list( $selector ) ) {
$selector = [ $selector ];
}
while ( $tags->next_tag() ) {
foreach ( $selector as $s ) {
// ...
}
}
return null;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on one note, it's not duplicate logic, as in the case of the individual selector we can tell the tag processor to jump to the next tag you specify.
on another note, if you squint…
there's something very potentially powerful here that could be moved into the tag processor. spent some time today reading the CSS selector spec…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for another reason I've actually collapsed it to your suggestion; getting has_attribute
support (such as with a[download]
) was going to start getting cumbersome, so now it's just one big loop.
14a5626
to
d81f912
Compare
ab3591a
to
89a3b32
Compare
} | ||
list( $inner_selector, $next_at ) = $inner; | ||
$inner_selector['combinator'] = $combinator; | ||
$selector['then'] = $inner_selector; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the output like a linked list of selectors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's right. we need to represent selectors in some way that captures the recursive or monadic behavior of the combinators. we have to use the first segment, or simple selector sequence, to find a given location in the document, and then from that starting point and in whatever relation to it is appropriate given the combinator, find the next one.
we could probably do this as a list, but I already had set the top level as a list to represent the comma-separated selectors_group
, which is special because the OR operation here only exists at the top level.
recursing seemed easiest since I could presumably pass a single object to the query generator and it would contain the next simple selector sequence in the chain, or nothing at all.
haven't built that part of the query generator yet though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about maybe a list of lists, as in
[
[$segment1, $segment2, $segment3],
[$segment1, $segment2, $segment3]
]
We'll still need to use nesting, though, for things like :not(span, a)
, so maybe better to embrace it right away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet sold on supporting :not()
at this stage.
The list works, but the nesting is convenient because we don't have to track where in the list we are.
What do you see as the benefit of the flat list?
* nonascii [\240-\377] | ||
* escape {unicode}|\\[^\r\n\f0-9a-f] | ||
* unicode \\{h}{1,6}(\r\n|[ \t\r\n\f])? | ||
* h [0-9a-f] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
h -> hex? For readability – I got confused for a moment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just a direct copy from the spec. yes, h
is there for a hex digit
* not, that `<p>` itself is a special element. | ||
*/ | ||
|
||
class WP_HTML_Processor extends WP_HTML_Tag_Processor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also works with tags, it's even in the method names – I'd be slightly confused to see two classes with such similar names doing similar things. If there isn't anything Unsafe
here, should we just bring these methods to the parent class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not unsafe right now because the functions have been renamed to imply they are making assumptions (and those names might change)
but there's still a difference between this and the tag processor that I wrestle with. this is "safe" in that it will preserve the HTML syntax by relying on the tag processor. this is "unsafe" in that it allows one to break the semantic of the HTML by performing operations that would require multiple steps in a DOM, such as removing a range of HTML tags or tag openers or tag closers. it's also "unsafe" in that even though it preserves the syntax, it may not do what you expect it to, even in the renamed functions.
<p>Before<p>Inside</p>
For this input there's nothing that will trap the first paragraph and any balanced tag operation will fail after reaching the end of input. Hopefully that will change in time, and we'd properly find <p>Before
as one outer content and <p>Inside</p>
as another, but even then our performance domain has shifted greatly from the tag processor.
I'm thinking that as this class develops people may not generally call the tag processor and it will remain as a low-level library. That's why I renamed it to WP_HTML_Processor
, because I think this could end up as the "entry point" interface for working with HTML. Sistering this is the as-of-yet-still-un-renamed-to WP_PHPQ
(for lack of a better name) which might go inside here or still might live separately as a class for turning a CSS selector into an HTML query. The HTML class could expose that query instead of our next_tag()
interface.
$p = new WP_HTML_Processor( $html );
while ( $p->find( 'section + h2 > a[href^="#"]' ) ) {
$title = $p->get_inner_text();
}
It's still not a DOM, we can't call ->parent_node()
or walk backwards (even though we can manually do this with bookmarking), but we can jump to specific places in the document, mangle the attributes, and mangle the content inside it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's also "unsafe" in that even though it preserves the syntax, it may not do what you expect it to, even in the renamed functions.
And then it allows you to find an element by a class name, even if it's table
, and then insert a div
inside it – which the browsers can handle, but is likely not what you want.
That's why I renamed it to WP_HTML_Processor, because I think this could end up as the "entry point" interface for working with HTML.
Oh I see, with this insight I'd probably rename the Tag_Processor to something else just eo avoid the confusion – but at this point I don't think it's worth it. 🤷
The HTML class could expose that query instead of our next_tag() interface.
I like that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably rename the Tag_Processor
I don't know about this. I think based on how careful we were with naming that it will retain the meaning we gave it.
What do you find contradictory about that class, if we introduce an HTML processor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not as much that I see a contradiction, as that I worry about confusing everyone with two very similarly named classes (in a similar way as folks get confused about slice vs splice in JS)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep thinking about it then. I want to keep the classes separate because of the different performance implications, but I don't know how to address what you're suggesting (apart from Unsafe
which I'm currently skeptical of using)
👋 I'm now using One thing that caught me a bit by surprise is that |
Given that I thought that this might also unblock work on a (I can file that PR if you're okay with this plan @dmsnell @adamziel) |
@ockham consider everything in this PR tentative at best and more-than-likely to change. I don't even know if we'll end up with `get_content_inside_balanced_tags() or have something else instead. Feel free to write your own PRs and explore different approaches and share here. This PR is more like my own notes as I work through the problem and try to understand where things belong and what the different parts of the system are. |
@@ -87,6 +87,7 @@ public function get_content_inside_balanced_tags() { | |||
} | |||
|
|||
$content = $this->content_inside_bookmarks( $start_name, $end_name ); | |||
$this->seek( $start_name ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
$tags->seek( 'start' ); | ||
$max = 100; | ||
while ( --$max > 0 ) { | ||
$next = $s; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an idea – take a selector like #main > .wrapper a span
.
This class is now going top to bottom and matching #main
, then the immediate descendant .wrapper
, then a descendant a
and then a descendant span
.
What if it went bottom to top? So we start by looking for the first span
in the document and then investigate the parents stack to see if any matches a
. Now we don't need to jump around the document to decide if it's a match.
There's still the issue of ~
and +
combinators, but we could keep track of a tree of relevant siblings. For example in a HTML like
<div id="main">
<div class="wrapper">
<a>
<div></div>
</a>
<a>
<span></span>
<span></span> <--- the cursor is here
</a>
</div>
</div>
We'd only store the following data structure (without the children of the initial a
)
div#main
div.wrapper
a
a
span
span
If we want to support matching arbitrary attributes beyond ids and classes, we could keep a bookmark for each parent and backtrack if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this seems worth a try. If we aren't going to fully parse the HTML then I don't see the same value add because quickly eliminating potential matches would happen before diving deep into the structure (and maybe that can still happen). If on the other hand, we're going to do a full document parse (and carry the weight of all that implies) then this method seems approachable.
The one major difference I see at this point is that if we're looking for the inner-most element first then we have to run $tags->matches()
on every node and run a full check on every node. In the approach I've started taking we prune out most nodes out the door and are able to iterate with less careful checking, because at that point we're mostly just checking tag name and whether it's an opener or closer. In practice it may not matter, but I did want to point out something different from our HTML-based approach vs. what might be happening in the DOM land in the browser.
9e752e5
to
01d31ac
Compare
Tests are broken because the attribute sourcer needs rework given this different approach for navigating tags.
For sibling navigation we have to continue processing outside of balanced_next().
679dce7
to
6103139
Compare
Flaky tests detected in 6103139. 🔍 Workflow run URL: https://github.com/WordPress/gutenberg/actions/runs/3858581592
|
All the test cases that I thought up are passing as of 79c24bd, but if you pressed me to explain how or why I wouldn't be able to say 🙃 Current thought is to continue exploring cases and see how this now interacts with other tag processor operations and modifications, see how deeply we can nest. See about how this does when given example Core block output. Please for all who review this consider it a draft that's likely to be unlike a final version. Would like to convert the iteration into recursion and remove the |
What?
Part of #44410.
Makes it possible to source some block attributes from a chunk of HTML given the attribute definitions (as passed in
registerBlock()
)Why?
Who doesn't want sourced block attributes from the PHP side?
How?
Uses the
WP_HTML_Tag_Processor
and some basic assumptions about HTML structure to source attributes iff: