Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PrettyPrintWriter fails to serialize characters in the Unicode Supplementary Multilingual Plane in XML 1.0 mode and XML 1.1 mode #337

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

basil
Copy link
Contributor

@basil basil commented May 2, 2023

PrettyPrintWriter fails to properly serialize characters in the Unicode Supplementary Multilingual Plane (SMP) in XML 1.0 mode and XML 1.1 mode (quirks mode works) with the following exception:

com.thoughtworks.xstream.io.StreamException: Invalid character 0xd83e in XML stream
        at com.thoughtworks.xstream.io.xml.PrettyPrintWriter.writeText(PrettyPrintWriter.java:250)
        at com.thoughtworks.xstream.io.xml.PrettyPrintWriter.writeText(PrettyPrintWriter.java:205)
        at com.thoughtworks.xstream.io.xml.PrettyPrintWriter.setValue(PrettyPrintWriter.java:187)
        at com.thoughtworks.xstream.io.xml.PrettyPrintWriterTest.testSupportsSupplementaryMultilingualPlaneInXml1_0Mode(PrettyPrintWriterTest.java:310)

The root cause of the problem is incorrect iteration over Unicode code points. The current implementation iterates over the UTF-16 representation of the characters rather than iterating over each code point. Characters in the Supplementary Multilingual Plane are encoded in UTF-16 as two digits. For example U+1F98A is encoded in UTF-16 as 0xD83E 0xDD8A. Java provides a dedicated API to iterate over code points, but XStream makes the erroneous assumption that a code point and a character are equivalent, likely because it was never tested outside of quirks mode with characters in the Supplementary Multilingual Plane. This PR fixes the problem by using the Java API for iterating over code points, thus removing the faulty assumption that a code point and a character are equivalent.

The new quirks mode test passes before and after the changes to PrettyPrintWriter. The new XML 1.0 mode and XML 1.1 mode tests fail before the changes to PrettyPrintWriter with the exception given above. The new XML 1.0 mode and XML 1.0 mode tests pass after the changes to PrettyPrintWriter.

Fixes #336

…mentary Multilingual Plane in XML 1.0 mode and XML 1.1 mode
final int length = text.length();
for (int i = 0; i < length; i++) {
final char c = text.charAt(i);
text.codePoints().forEach(c -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess 1fcfa0b makes this (@since 9) safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what you are talking about in this review comment. The method is present in Java 8.

Copy link
Contributor

@jglick jglick May 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps. Was just going by https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/String.html#codePoints() which says 9. At any rate I would hope the CI build would fail if this were not permitted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps

Do you have any evidence for this claim which is casting doubt on the correctness of this change and potentially making it harder for subsequent reviewers to approve? If you do not, I would suggest that you refrain from making such review comments.

https://docs.oracle.com/javase/8/docs/api/java/lang/CharSequence.html#codePoints--

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above link. It seems the @since tags are contradictory, unless the JDK team has a policy of noting when an override of a default method was added (which would seem strange to me since that should not change the API surface).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.oracle.com/javase/8/docs/api/java/lang/CharSequence.html#codePoints-- is present in Java 8 and this code compiles successfully on Java 8. As far as I can tell there is no action item here, and this whole review comment was unnecessary and served only to chew up some of my time to refute an unverified claim as well as potentially confusing future reviewers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XStream 1.5.x will target Java 11. No point any longer to use Java 8 as minimum.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But XStream 1.4 still uses Java 8, and we want this critical bug fix in that line. Anyway, this change works in Java 8, so this whole thread is pointless. I have no idea why this review feedback was left in the first place.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codePoints() was added to CharSequence interface as a default method in Java 8.
In Java 9, an override of this method was added to String (which implements CharSequence).

So, it should work for both Java 8 and 9, but it can be slightly faster for Strings in Java 9+ due to optimised version added to String in Java 9.

@@ -238,7 +236,7 @@ private void writeText(final String text, final boolean isAttribute) {
case '\t':
case '\n':
if (!isAttribute) {
writer.write(c);
writer.write(Character.toChars(c));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Unnecessary in this case I think.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary in this case I think.

How would it compile without this hunk?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I just meant in this case we know the character will be a single char. Not important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, and I knew that when deciding to use Character.toChars(c) in this case and the case below rather than prematurely optimizing by casting the int to a char.

This review comment was unnecessary in this case I think.

@@ -251,7 +249,7 @@ private void writeText(final String text, final boolean isAttribute) {
+ " in XML stream");
}
}
writer.write(c);
writer.write(Character.toChars(c));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this could be slightly less efficient since it allocates a char[]. It does not seem that the method overall is optimized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an action item here? If not, then what is the purpose of this comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not an action item, solely to note for any other reviewers that this change could affect performance, if that is even a consideration.

@joehni joehni self-assigned this May 3, 2023
@joehni joehni added this to the 1.5.x milestone May 3, 2023
@basil
Copy link
Contributor Author

basil commented May 3, 2023

Why is this not assigned to the 1.4 milestone? This is a critical bug fix that we want in 1.4.

@joehni
Copy link
Member

joehni commented May 13, 2023

Because 1.5.x is dropping compatibility to Java 10 to 1.4.

@basil
Copy link
Contributor Author

basil commented May 15, 2023

I think it would make more sense for the 1.4.x line to require Java 8 or newer or to backport this fix to the 1.4.x line with a for-loop based implementation that can run on Java 7 or earlier.

@basil
Copy link
Contributor Author

basil commented Nov 8, 2024

Any plans to merge this PR and release version 1.5.x requiring Java 11 or newer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PrettyPrintWriter cannot write emoji in XML 1.1 mode
4 participants