Skip to content

Ambiguity on what a character is (regardless of encoding, in two usages) #31

@cigitia

Description

@cigitia

The specification refers to “characters” in two separate places:

  • Usage 1: The c scalar type, which is an extension type of the s type.
  • Usage 2: “Scalar values have single-character tags and composite values have multi-character tags.”

But there are are many definitions of “character”, so both of these usages are ambiguous.

Usage 1

Usage 1 has the following ambiguities/issues:

  • Many programming languages (Java, C#, Dart, etc.) are strongly coupled to 16-byte code points from the time before the Unicode code space was extended by orders of magnitude. Since those languages’ native char types allow only Basic Plane (BMP) code points (since those are what can be represented by single 16-byte code units), their Transit libraries’ current implementations interpret Transit c scalars as those char types when reading Transit data.
  • Allowing only single BMP code points to be “characters” also unfortunately excludes many now-commonly used characters in the Supplementary Planes (SMPs) from being interchanged as c values. SMP characters include, in particular, many symbols and emoji in use, such as 𝄫, 😀, 🐴, and 📞, and are supported by languages such as Go (with its native rune type), as well as quasi-supported by any language that uses strings for characters (JavaScript, Python, Ruby, etc.). Support for SMP characters may be especially important in internationalization projects (indeed, for many of my own projects).
  • Also of note is that some languages do not support even BMP code points, let alone the Unicode code space in general. OCaml’s Character and String types, for instance, use eight-bit bytes, which essentially covers only ASCII. Ruby’s String class also splits into eight-bit bytes, but since it has no concept of a “Character” class, this is mostly moot anyway.
  • Lastly, the EDN specification also leaves this ambiguous, but Clojure itself, being hosted on the JVM (whose char type is BMP-only / strongly coupled to 16 bytes), does not allow Supplementary Plane characters.

So, question 1 is: What values is the c scalar Transit type allowed to contain (by its read and write handlers)? I see at least three options:

  • Option A: A Transit c value is any single Unicode code point (from U+0 to U+10FFFF). This allows characters in Supplementary Planes such as emoji to be easily interchanged between programming languages whose “character” types support them (Python, sort of Ruby and JavaScript). However, it also will necessitate Transit readers in languages with 16-byte, BMP-only char types to throw errors or use other data types if they encounter any SMP characters. (Note that this is already a problem in general: for instance, if a Transit UUID is invalid, a runtime error may still occur in some readers.)
  • Option B1: A Transit c value is any single 16-byte, BMP-only (from 0000 to FFFF). No Transit program can interchange SMP characters such as emoji using the core c type; however, 16-bit-char programming languages such as Java and C# are guaranteed to accept any c value. (Of note is that, in this case, people who need to use also SMP characters can define an extension type, but unfortunately this ceases to be universal.)
  • Option B2: The same as Option B1, except that another Transit core scalar type is added, “C” or “y” or something, which extends s and which represents a single, potentially Supplementary Unicode code point between U+0 and U+10FFFF. Languages that map the already-existing c type to their 16-bit-char types would map this new type to their string types or something.

I personally anticipate option B1 to be chosen, since it's what Clojure itself does and takes the least work, but I'm still throwing option A and B2 in the hopes that they too would be considered I now prefer that chars be clearly equated to UTF-16 code units after reading the Unicode FAQs discuss preferring UTF-16 code units for low-level indexing and strings for everything else. Any way would create more work for someone, but the question is which one is most worth it, and the specification probably should clarify this matter in any case.

Question 2: If the answer for question 1 is “16-byte, BMP code points only / no SMP characters allowed in Transit c values”, then should Transit writers (in those languages that support SMP characters) ensure that no Supplementary code points are ever written into Transit data as Transit c values?

Usage 2

For usage 2, there are multiple questions to be clarified:

Question 3: Are only 16-byte/BMP code points or any Unicode code point allowed to be used as scalar-type tags?

Question 4: If a single SMP character is used as a type tag, is it a scalar tag (because it is a single Unicode code point) or is it a composite tag (because it is two 16-byte surrogate units)? (This is essentially equivalent to question 2.)

Question 5: Are whitespace characters allowed in type tags?

Question 6: Are control characters allowed in type tags?

Question 7: Are noncharacter code points (such as U+FDD0 and U+FFFE) allowed in type tags?

Question 8: If the answers for question 3 is “only 16-bytes/BMP points for scalar tags” or for questions 5–7 is “no”, then should writers ensure that that the prohibited code points are never used?

These are fastidious, technical questions, but I think they're important to disambiguating Transit's behavior. Question 1 especially affects how people like me use Transit. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions