Skip to content

Handle U+2028 LINE SEPARATOR. #384

Open
@xorgy

Description

@xorgy

Unicode has a specific codepoint intended to signal an unambiguous line separator, U+2028 LINE SEPARATOR often abbreviated LS (along with U+2029 PARAGRAPH SEPARATOR, often abbreviated PS), which is currently not rendered as such by Parley.

Unicode's TR13 recommends this procedure for handling line and paragraph separators:

Converting from Other Character Code Sets

  1. If the exact usage of any NLF is known, convert it to LS or PS.
    a. If the exact usage of any NLF is unknown, remap it to the platform NLF.

Rule R1a does not really help in interpreting Unicode text unless the implementer is the only source of that text, because another implementer may have left in LF, CR, CRLF, or NEL.

Interpreting Characters in Text

  1. Always interpret PS as paragraph separator and LS as line separator.
    a. In word processing, interpret any NLF the same as PS.
    b. In simple text editors, interpret any NLF the same as LS.
    c. In parsing, choose the safest interpretation.

For example, in rule R2c an implementer dealing with sentence break heuristics would reason in the following way that it is safer to interpret any NLF as a LS:

  • Suppose an NLF were interpreted as LS, when it was meant to be PS. Because
    most paragraphs are terminated with punctuation anyway, this would cause
    misidentification of sentence boundaries in only a few cases.
  • Suppose an NLF were interpreted as PS, when it was meant to be LS. In this
    case, line breaks would cause sentence breaks, which would result in significant
    problems with the sentence break heuristics.

Currently, Parley does not handle LS or PS in any principled way; it doesn't do nothing (it changes the line metrics, when I use it), but it also doesn't do what TR13 recommends, nor does it do what web browsers do†.

I think there should be some reasonable default for these, and maybe some configuration.

We ran into a conceptually-related problem in #381 where @DJMcNab was treating NLFs as ‘paragraphs’ in the way that TR13 describes as being common in ‘word processors’ (though in reality, most modern word processors use explicit paragraph separators, and support line breaking within paragraphs), where I was expecting the common definition of a paragraph in the ‘simple text editor’ case described by TR13. We were both right... about different conventions.

† Web browsers, because whitespace collapsing was specified without consideration for LS, collapse it as whitespace and do not render it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions