Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"<" character in regexp body breaks parsing of XML #414

Closed
sindy39 opened this issue Aug 25, 2019 · 12 comments
Closed

"<" character in regexp body breaks parsing of XML #414

sindy39 opened this issue Aug 25, 2019 · 12 comments

Comments

@sindy39
Copy link

sindy39 commented Aug 25, 2019

Hello,
when trying to isolate the uri alone (i.e. no display name), I was used to use regexp="<(.*)>" on the To:, From:, Contact: etc. headers. With 3.6 stable, the opening < is treated as a new tag, so when the closing "/>" for the "<ereg" tag (or the ">" if trying that way) is reached, I get an error like </(.*)> was expected.

Examples:

<action>
<ereg regexp=" *<(sip:.*)>" search_in="hdr" header="Contact:" assign_to="blackhole,call_contact" />
</action>
yields
Unexpected </action> (expected </ereg>)

<action>
<ereg regexp=" *<(sip:.*)>" search_in="hdr" header="Contact:" assign_to="blackhole,call_contact" ></ereg>
</action>
yields
Unexpected </ereg> (expected </(sip:.*)>)

@wdoekes
Copy link
Member

wdoekes commented Aug 26, 2019

Ouch! That is an unforeseen effect of the quick fixes to tackle invalid-xml detection.

Thanks for the report.

@wdoekes wdoekes added the bug label Aug 26, 2019
@sergey-safarov
Copy link

Thine here need encode XML entities.
please try

 *&lt;(sip:.*)&gt;

Encoder online

@sergey-safarov
Copy link

Think it is not bug.
That expected behavior because that is really broken XML.

@wdoekes
Copy link
Member

wdoekes commented Aug 27, 2019

Hm. Thanks for clarifying @sergey-safarov. You are right.

The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings " & " and " < " respectively. The right angle bracket (>) may be represented using the string " > ", and must, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.
https://www.w3.org/TR/REC-xml/#syntax
https://www.w3.org/TR/REC-xml/#NT-AttValue

So indeed. & and < were not allowed. (But > is.. although it would be confusing to use.)

If I amend some docs/FAQ then that should resolve this issue.

@sindy39
Copy link
Author

sindy39 commented Aug 27, 2019

Yeah, another case of "but it has been working that way ever since the stone age!", so although I know that I have to escape them in html, I didn't give a thought to this being potentially related, given that it's between quotes and given that in SIP context the "<" is used quite often.

So I think a fat red WARNING in the part of the manual which explains the use of regexp will resolve that (to the extent of people reading manuals).

Off topic, when talking about the manual, I think the remark that an IPv6 extension of cygwin from win6.jp is required for successful compilation is obsolete given that it happily compiles without it (and that the extension in question went missing), most likely because the extension has become an intrinsic part of cygwin in the meantime.

@sergey-safarov
Copy link

As suggestion we can call XML validator before parsing XML files.

xmllint --noout ./docs/phrase/phrase_nl.xml

If provided XML file is broken, then used will get reference to error.
If XML file correct,then command executed without any output.
More details

As option we ca use same XML validator library that used in xmllint.

@sindy39
Copy link
Author

sindy39 commented Aug 27, 2019

That sounds great, as it will pinpoint the actual error rather than giving a confusing output.

@league55
Copy link

Good day, we also encountered this problem recently and it was a little bit unexpected for us.

My personal point of view that whether this is a bug or not, it could be very nice of you to mention it in release notes with other breaking changes, as this worked in a previouce version.

@wdoekes
Copy link
Member

wdoekes commented Sep 10, 2019

A reasonable request 👍

@hmoghani
Copy link

@wdoekes can you please point me to the file/line for this change in 3.6.0 (or the PR for this fix)? I would like to revert that part back to 3.5. We have over 200 broken xmls which were working fine before this fix. I know that wasn't the right way to write the xmls, but I really don't want to spend time changing each xml one by one. Though I would start creating new ones with this in mind.

@wdoekes
Copy link
Member

wdoekes commented Oct 22, 2019

I don't expect you to rewrite 200 files manually no. But perhaps a oneliner would help.

$ cat bad.xml 
<!-- blah -->
<this is bad="regex <.*>"/><this is not/><but this="<is>"></but>
<blah/>
<and this="<is>"></and>

And:

$ python3 -c 'import cgi,re,sys;r=re.compile('\''"([^"]*)"'\'');print(r.sub((lambda x:cgi.escape(x.group(0))),sys.stdin.read()),end="")' < bad.xml 
<!-- blah -->
<this is bad="regex &lt;.*&gt;"/><this is not/><but this="&lt;is&gt;"></but>
<blah/>
<and this="&lt;is&gt;"></and>

wdoekes added a commit that referenced this issue Nov 11, 2019
@wdoekes wdoekes pinned this issue Apr 29, 2020
@wdoekes
Copy link
Member

wdoekes commented Apr 29, 2020

I think the commits got this covered. Leaving it pinned for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants