PrettyPrintWriter fails to serialize characters in the Unicode Supplementary Multilingual Plane in XML 1.0 mode and XML 1.1 mode#337
Conversation
…mentary Multilingual Plane in XML 1.0 mode and XML 1.1 mode
| final int length = text.length(); | ||
| for (int i = 0; i < length; i++) { | ||
| final char c = text.charAt(i); | ||
| text.codePoints().forEach(c -> { |
There was a problem hiding this comment.
I have no idea what you are talking about in this review comment. The method is present in Java 8.
There was a problem hiding this comment.
Perhaps. Was just going by https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/String.html#codePoints() which says 9. At any rate I would hope the CI build would fail if this were not permitted.
There was a problem hiding this comment.
Perhaps
Do you have any evidence for this claim which is casting doubt on the correctness of this change and potentially making it harder for subsequent reviewers to approve? If you do not, I would suggest that you refrain from making such review comments.
https://docs.oracle.com/javase/8/docs/api/java/lang/CharSequence.html#codePoints--
There was a problem hiding this comment.
The above link. It seems the @since tags are contradictory, unless the JDK team has a policy of noting when an override of a default method was added (which would seem strange to me since that should not change the API surface).
There was a problem hiding this comment.
https://docs.oracle.com/javase/8/docs/api/java/lang/CharSequence.html#codePoints-- is present in Java 8 and this code compiles successfully on Java 8. As far as I can tell there is no action item here, and this whole review comment was unnecessary and served only to chew up some of my time to refute an unverified claim as well as potentially confusing future reviewers.
There was a problem hiding this comment.
XStream 1.5.x will target Java 11. No point any longer to use Java 8 as minimum.
There was a problem hiding this comment.
But XStream 1.4 still uses Java 8, and we want this critical bug fix in that line. Anyway, this change works in Java 8, so this whole thread is pointless. I have no idea why this review feedback was left in the first place.
There was a problem hiding this comment.
codePoints() was added to CharSequence interface as a default method in Java 8.
In Java 9, an override of this method was added to String (which implements CharSequence).
So, it should work for both Java 8 and 9, but it can be slightly faster for Strings in Java 9+ due to optimised version added to String in Java 9.
| case '\n': | ||
| if (!isAttribute) { | ||
| writer.write(c); | ||
| writer.write(Character.toChars(c)); |
There was a problem hiding this comment.
(Unnecessary in this case I think.)
There was a problem hiding this comment.
Unnecessary in this case I think.
How would it compile without this hunk?
There was a problem hiding this comment.
Right, I just meant in this case we know the character will be a single char. Not important.
There was a problem hiding this comment.
Right, and I knew that when deciding to use Character.toChars(c) in this case and the case below rather than prematurely optimizing by casting the int to a char.
This review comment was unnecessary in this case I think.
| } | ||
| } | ||
| writer.write(c); | ||
| writer.write(Character.toChars(c)); |
There was a problem hiding this comment.
Note that this could be slightly less efficient since it allocates a char[]. It does not seem that the method overall is optimized.
There was a problem hiding this comment.
Is there an action item here? If not, then what is the purpose of this comment?
There was a problem hiding this comment.
It is not an action item, solely to note for any other reviewers that this change could affect performance, if that is even a consideration.
|
Why is this not assigned to the 1.4 milestone? This is a critical bug fix that we want in 1.4. |
|
Because 1.5.x is dropping compatibility to Java 10 to 1.4. |
|
I think it would make more sense for the 1.4.x line to require Java 8 or newer or to backport this fix to the 1.4.x line with a for-loop based implementation that can run on Java 7 or earlier. |
|
Any plans to merge this PR and release version 1.5.x requiring Java 11 or newer? |
|
Sorry for the long delay... |
PrettyPrintWriterfails to properly serialize characters in the Unicode Supplementary Multilingual Plane (SMP) in XML 1.0 mode and XML 1.1 mode (quirks mode works) with the following exception:The root cause of the problem is incorrect iteration over Unicode code points. The current implementation iterates over the UTF-16 representation of the characters rather than iterating over each code point. Characters in the Supplementary Multilingual Plane are encoded in UTF-16 as two digits. For example U+1F98A is encoded in UTF-16 as 0xD83E 0xDD8A. Java provides a dedicated API to iterate over code points, but XStream makes the erroneous assumption that a code point and a character are equivalent, likely because it was never tested outside of quirks mode with characters in the Supplementary Multilingual Plane. This PR fixes the problem by using the Java API for iterating over code points, thus removing the faulty assumption that a code point and a character are equivalent.
The new quirks mode test passes before and after the changes to
PrettyPrintWriter. The new XML 1.0 mode and XML 1.1 mode tests fail before the changes toPrettyPrintWriterwith the exception given above. The new XML 1.0 mode and XML 1.0 mode tests pass after the changes toPrettyPrintWriter.Fixes #336