Skip to content
This repository was archived by the owner on Aug 31, 2025. It is now read-only.
This repository was archived by the owner on Aug 31, 2025. It is now read-only.

Please include the actual ranges mapped to, not just the bit-operator-code to get there #12

@ghost

Description

Currently I find the spec easy to read for implementing it, but very bad to read for understanding what the actual effect is in code point space. Particularly, I have trouble figuring out the actual ranges in the code point space the invalid surrogates are mapped to, and I find this suboptimal given Unicode already has a lot of confusing ranges.

The problem IMHO is in particular this section, or the lack of concrete ranges given afterwards:

  1. Potentially ill-formed UTF-16

A sequence of 16-bit code units is potentially ill-formed UTF-16 if it is intended to be interpreted as UTF-16, but is not necessarily well-formed in UTF-16. It effectively encodes a sequence of code points that do not contain any surrogate code point pair.

Note: Like UTF-16, potentially ill-formed UTF-16 can not represent a surrogate code point pair since the corresponding surrogate 16-bit code unit pair would instead represent a supplementary code point. Unlike well-formed UTF-16, it might contain isolated surrogate code points.

Any sequence of 16-bit code units has an interpretation as potentially ill-formed UTF-16.

WTF-16 is sometimes used as a shorter name for potentially ill-formed UTF-16, especially in the context of systems were originally designed for UCS-2 and later upgraded to UTF-16 but never enforced well-formedness, either by neglect or because of backward-compatibility constraints.

Don't get me wrong, this formal definition is nice, but this is only followed up by actual encoding steps that are on the other extreme end and way too practical, filled with bit transform ops that don't make it obvious what ranges are actually used.

What I would have expected in the 4. Potentially ill-formed UTF-16 section is something like this addition (possibly incorrect, this is my best guess & what I would have liked to have properly spelled out):

previous stuff removed -> see next comment for revised, better suggestion

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions