-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
XPath3.1: mimic handling of multiple root element nodes #2351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 12 commits
8e1f170
1f776ff
bf5c2c7
9f0cb35
879d0b2
ed2aaf4
dd8b4fe
fbd5512
20195e7
220f484
e84b9f1
60777e4
e325e02
6a2e1cf
55b2c6c
93a9585
e6b13c9
2e3e781
c295c5e
5acd31f
de7b66b
66a7dae
4d266ca
ebf7fd4
26e4a58
dbf4e87
7cd764f
48a5aa2
3619877
c79d88e
827f81a
23c6471
e6ac285
0a0f281
3223820
93950c0
0e66cb0
889fdbb
4043e9a
1743ca0
912470f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -201,3 +201,61 @@ def test_trips(html_content, xpath, answer): | |
html_content = html_tools.xpath_filter(xpath, html_content, append_pretty_line_formatting=True) | ||
assert type(html_content) == str | ||
assert answer in html_content | ||
|
||
DOM_violation_two_html_root_element = """<!DOCTYPE html> | ||
<html> | ||
<body> | ||
<h1>Hello world</h1> | ||
<p>First paragraph.</p> | ||
</body> | ||
</html> | ||
<html> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The second html root element. |
||
<body> | ||
<h1>Hello world</h1> | ||
<p>Browsers parse this part by fixing it but lxml doesn't and returns two root element node</p> | ||
<p>Therefore, if the path is /html/body/p[1], lxml(libxml2) returns two element nodes not one.</p> | ||
</body> | ||
</html>""" | ||
@pytest.mark.parametrize("html_content", [DOM_violation_two_html_root_element]) | ||
@pytest.mark.parametrize("xpath, answer", [ | ||
("/html/body/p[1]", "First paragraph."), | ||
("/html/body/p[1]", "Browsers parse this part by fixing it but lxml doesn't and returns two root element node"), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the critical point. why do I choose one element in the browser inspect window, but lxml returns two? Because there are two html tag elements and two body tag elements. |
||
("//html/body/p[1]", "First paragraph."), | ||
("//html/body/p[1]", "Browsers parse this part by fixing it but lxml doesn't and returns two root element node"), | ||
("//body/p[1]", "First paragraph."), | ||
("//body/p[1]", "Browsers parse this part by fixing it but lxml doesn't and returns two root element node"), | ||
("/html[2]/body/p[1]", "First paragraph."), | ||
("/html[2]/body/p[1]", "Browsers parse this part by fixing it but lxml doesn't and returns two root element node"), | ||
("//html[2]/body/p[1]", "First paragraph."), | ||
("//html[2]/body/p[1]", "Browsers parse this part by fixing it but lxml doesn't and returns two root element node"), | ||
]) | ||
def test_trips(html_content, xpath, answer): | ||
|
||
|
||
# In normal situation, DOM's root element node is only one. So when DOM violation happens, Exception occurs. | ||
with pytest.raises(Exception): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I intentionally add this test to reproduce the problem. |
||
from lxml import etree, html | ||
import elementpath | ||
from elementpath.xpath3 import XPath3Parser | ||
parser = etree.HTMLParser() | ||
tree = html.fromstring(bytes(html_content, encoding='utf-8'), parser=parser) | ||
# just example xpath | ||
# Error will occur. | ||
r = elementpath.select(tree, xpath.strip(), namespaces={'re': 'http://exslt.org/regular-expressions'}, parser=XPath3Parser) | ||
|
||
html_content = html_tools.xpath_filter(xpath, html_content, append_pretty_line_formatting=True) | ||
assert type(html_content) == str | ||
assert answer in html_content | ||
|
||
@pytest.mark.parametrize("html_content", [DOM_violation_two_html_root_element]) | ||
@pytest.mark.parametrize("xpath, answer", [ | ||
("/html[2]/body/p[1]", "First paragraph."), | ||
("//html[2]/body/p[1]", "First paragraph."), | ||
]) | ||
def test_trips(html_content, xpath, answer): | ||
# In normal situation, DOM's root element node is only one. So when DOM violation happens, Exception occurs. | ||
|
||
html_content = html_tools.xpath_filter(xpath, html_content, append_pretty_line_formatting=True) | ||
assert type(html_content) == str | ||
# check the answer is not in the html_content | ||
assert answer not in html_content |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -55,7 +55,7 @@ beautifulsoup4 | |
lxml >=4.8.0,<6 | ||
|
||
# XPath 2.0-3.1 support - 4.2.0 broke something? | ||
elementpath==4.1.5 | ||
elementpath==4.4.0 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is time to upgrade? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Sure, if the tests pass it's OK There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this change was required to fix this PR? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since this PR(#2351) uses fragment=True option, >=4.1.5 won't work. and 4.2.0 has another problem. So minimum is 4.2.1 |
||
|
||
selenium~=4.14.0 | ||
|
||
|
Uh oh!
There was an error while loading. Please reload this page.