Description
Unicode has a specific codepoint intended to signal an unambiguous line separator, U+2028 LINE SEPARATOR
often abbreviated LS (along with U+2029 PARAGRAPH SEPARATOR
, often abbreviated PS), which is currently not rendered as such by Parley.
Unicode's TR13 recommends this procedure for handling line and paragraph separators:
Converting from Other Character Code Sets
- If the exact usage of any NLF is known, convert it to LS or PS.
a. If the exact usage of any NLF is unknown, remap it to the platform NLF.Rule R1a does not really help in interpreting Unicode text unless the implementer is the only source of that text, because another implementer may have left in LF, CR, CRLF, or NEL.
Interpreting Characters in Text
- Always interpret PS as paragraph separator and LS as line separator.
a. In word processing, interpret any NLF the same as PS.
b. In simple text editors, interpret any NLF the same as LS.
c. In parsing, choose the safest interpretation.For example, in rule R2c an implementer dealing with sentence break heuristics would reason in the following way that it is safer to interpret any NLF as a LS:
- Suppose an NLF were interpreted as LS, when it was meant to be PS. Because
most paragraphs are terminated with punctuation anyway, this would cause
misidentification of sentence boundaries in only a few cases.- Suppose an NLF were interpreted as PS, when it was meant to be LS. In this
case, line breaks would cause sentence breaks, which would result in significant
problems with the sentence break heuristics.
Currently, Parley does not handle LS or PS in any principled way; it doesn't do nothing (it changes the line metrics, when I use it), but it also doesn't do what TR13 recommends, nor does it do what web browsers do†.
I think there should be some reasonable default for these, and maybe some configuration.
We ran into a conceptually-related problem in #381 where @DJMcNab was treating NLFs as ‘paragraphs’ in the way that TR13 describes as being common in ‘word processors’ (though in reality, most modern word processors use explicit paragraph separators, and support line breaking within paragraphs), where I was expecting the common definition of a paragraph in the ‘simple text editor’ case described by TR13. We were both right... about different conventions.
† Web browsers, because whitespace collapsing was specified without consideration for LS, collapse it as whitespace and do not render it.