Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLDR-15948 Clean up well-formedness and/or validity constraints #4179

Merged
merged 1 commit into from
Nov 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
309 changes: 156 additions & 153 deletions docs/ldml/tr35-general.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ The LDML specification is divided into the following parts:
* [Unit Preference and Conversion Data](#Unit_Preference_and_Conversion)
* [Unit Identifiers](#Unit_Identifiers)
* [Nomenclature](#nomenclature)
* [Syntax](#syntax)
* [Unit Syntax](#unit-syntax)
* [Unit Identifier Uniqueness](#Unit_Identifier_Uniqueness)
* [Example Units](#Example_Units)
* [Compound Units](#compound-units)
Expand Down Expand Up @@ -902,157 +902,160 @@ As with other identifiers in CLDR, the American English spelling is used for uni

> In keeping with U.S. and International practice (see Sec. C.2), this Guide uses the dot on the line as the decimal marker. In addition this Guide utilizes the American spellings “meter,” “liter,” and “deka” rather than “metre,” “litre,” and “deca,” and the name “metric ton” rather than “tonne.”
#### Syntax

The formal syntax for identifiers is provided below.
Some of the constraints reference data from the unitIdComponents in [Unit_Conversion](tr35-info.md#Unit_Conversion).

<!-- HTML: no header -->

<table><tbody>
<tr><td><a name='unit_identifier' href='#unit_identifier'>unit_identifier</a></td><td>:=</td>
<td>core_unit_identifier<br/>
| mixed_unit_identifier<br/>
| long_unit_identifier</td></tr>

<tr><td><a name='core_unit_identifier' href='#core_unit_identifier'>core_unit_identifier</a></td><td>:=</td>
<td>product_unit ("-" per "-" product_unit)*<br/>
| per "-" product_unit ("-" per "-" product_unit)*
<ul><li><em>Examples:</em>
<ul><li>foot-per-second-per-second</li>
<li>per-second</li>
</ul></li>
<li><em>Note:</em> The normalized form will have only one "per"</li>
</ul></td></tr>

<tr><td>per</td><td>:=</td>
<td>"per"
<ul>
<li><em>Constraint:</em> The token 'per' is the single value in &lt;unitIdComponent type="per"&gt;</li>
</ul></td></tr>

<tr><td><a name='product_unit' href='#product_unit'>product_unit</a></td><td>:=</td>
<td>single_unit ("-" single_unit)* ("-" pu_single_unit)*<br/>
| pu_single_unit ("-" pu_single_unit)*
<ul><li><em>Example:</em> foot-pound-force</li>
<li><em>Constraint:</em> No pu_single_unit may precede a single unit</li>
</ul></td></tr>

<tr><td><a name='single_unit' href='#single_unit'>single_unit</a></td><td>:=</td>
<td>dimensionality_prefix? simple_unit | unit_constant
<ul><li><em>Examples: </em>square-kilometer, or 100</li></ul></td></tr>

<tr><td><a name='pu_single_unit' href='#pu_single_unit'>pu_single_unit</a></td><td>:=</td>
<td>"xxx-" single_unit | "x-" single_unit
<ul><li><em>Example:</em> xxx-square-knuts (a Harry Potter unit)</li>
<li><em>Note:</em> "x-" is only for backwards compatibility</li>
<li>See <a href="#Private_Use_Units">Private-Use Units</a></li>
</ul></td></tr>

<tr><td><a name='unit_constant' href='#unit_constant'>unit_constant</a></td><td>:=</td>
<td>[1-9][0-9]* ("e" [1-9][0-9]*)?
<ul><li><em>Examples:</em>
<ul><li>kilowatt-hour-per-100-kilometer</li>
<li>gallon-per-100-mile</li>
<li>per-200-pound</li>
<li>per-12</li>
</ul></li>
<li><em>Constraint:</em> The numeric value of the unit constant must be an integer greater than one.</li>
<li><em>Note:</em> The normal interpretation of <code>e</code> is used, where 2e6 = 2×10⁶.</li>
<li><em>Note:</em> The <code>e</code> notation is optional: per-100-kilometer and per-1e2-kilometer are equivalent unit_identifiers.</li>
<li><em>Note:</em> When constructing identifiers, exponents should be greater than 3 and multiples of 3, even though parsers must accept the wider range.</li>
</ul></td></tr>

<tr><td><a name='dimensionality_prefix' href='#dimensionality_prefix'>dimensionality_prefix</a></td><td>:=</td>
<td>"square-"<p>| "cubic-"<p>| "pow" ([2-9]|1[0-5]) "-"
<ul>
<li><em>Constraint:</em> must be value in: &lt;unitIdComponent type="power"&gt;.</li>
<li><em>Note:</em> "pow2-" and "pow3-" canonicalize to "square-" and "cubic-"</li>
<li><em>Note:</em> These are values in &lt;unitIdComponent type="power"&gt;</li>
</ul></td></tr>

<tr><td><a name='simple_unit' href='#simple_unit'>simple_unit</a></td><td>:=</td>
<td>(prefix_component "-")* (prefixed_unit | base_component) ("-" suffix_component)*<br/>
| currency_unit<br/>
| "em" | "g" | "us" | "hg" | "of"
<ul>
<li><em>Examples:</em> kilometer, meter, cup-metric, fluid-ounce, curr-chf, em</li>
<li><em>Note:</em> Three simple units are currently allowed as legacy usage, for tokens that wouldn’t otherwise be a base_component due to length (eg, "<strong>g</strong>-force").
We will likely deprecate those and add conformant aliases in the future: the "hg" and "of" are already only in deprecated simple_units.</li>
</ul></td></tr>

<tr><td><a name='prefixed_unit' href='#prefixed_unit'>prefixed_unit</a></td><td></td>
<td>prefix base_component<ul><li><em>Example: </em>kilometer</li></ul></td></tr>

<tr><td><a name='prefix' href='#prefix'>prefix</a></td><td></td>
<td>si_prefix | binary_prefix</td></tr>

<tr><td><a name='si_prefix' href='#si_prefix'>si_prefix</a></td><td>:=</td>
<td>"deka" | "hecto" | "kilo", …
<ul><li><em>Constraint:</em> Must be an attribute value of the <code>type</code> in: &lt;unitPrefix type='…' … power10='…'&gt;.
See also <a href="https://www.nist.gov/pml/special-publication-811">NIST special publication 811</a></li></ul></td></tr>

<tr><td><a name='binary_prefix' href='#binary_prefix'>binary_prefix</a></td><td>:=</td>
<td>"kibi", "mebi", …
<ul><li><em>Constraint:</em> Must be an attribute value of the <code>type</code> in: &lt;unitPrefix type='…' … power2='…'&gt;.
See also <a href="https://physics.nist.gov/cuu/Units/binary.html">Prefixes for binary multiples</a></li></ul></td></tr>

<tr><td><a name='prefix_component' href='#prefix_component'>prefix_component</a></td><td>:=</td>
<td>[a-z]{3,∞}
<ul><li><em>Constraint:</em> must be value in: &lt;unitIdComponent type="prefix"&gt;.</li></ul></td></tr>

<tr><td><a name='base_component' href='#base_component'>base_component</a></td><td>:=</td>
<td>[a-z]{3,∞}
<ul><li><em>Constraint:</em> must not be a value in any of the following:<br>
&lt;unitIdComponent type="prefix"&gt;<br>
or &lt;unitIdComponent type="suffix"&gt; <br>
or &lt;unitIdComponent type="power"&gt;<br>
or &lt;unitIdComponent type="and"&gt;<br>
or &lt;unitIdComponent type="per"&gt;.
</li>
<li><em>Constraint:</em> must not have a prefix as an initial segment.</li>
<li><em>Constraint:</em> no two different base_components will share the first 8 letters.
(<b>For more information, see <a href="#Unit_Identifier_Uniqueness">Unit Identifier Uniqueness</a>.)</b>
</li>
</ul>
</td></tr>

<tr><td><a name='suffix_component' href='#suffix_component'>suffix_component</a></td><td>:=</td>
<td>[a-z]{3,∞}
<ul>
<li><em>Constraint:</em> must be value in: &lt;unitIdComponent type="suffix"&gt;</li>
</ul></td></tr>

<tr><td><a name='mixed_unit_identifier' href='#mixed_unit_identifier'></a></td><td>:=</td>
<td>(single_unit | pu_single_unit) ("-" and "-" (single_unit | pu_single_unit ))*
<ul><li><em>Example: foot-and-inch</em></li>
</ul></td></tr>

<tr><td>and</td><td>:=</td>
<td>"and"
<ul>
<li><em>Constraint:</em> The token 'and' is the single value in &lt;unitIdComponent type="and"&gt;</li>
</ul></td></tr>

<tr><td><a name='long_unit_identifier' href='#long_unit_identifier'>long_unit_identifier</a></td><td>:=</td>
<td>grouping "-" core_unit_identifier</td></tr>

<tr><td>grouping</td><td>:=</td>
<td>[a-z]{3,∞}</td></tr>

<tr><td><a name='currency_unit' href='#currency_unit'>currency_unit</a></td><td>:=</td>
<td>"curr-" [a-z]{3}
<ul>
<li><em>Constraint:</em> The first part of the currency_unit is a standard prefix; the second part of the currency unit must be a valid <a href="tr35.md#UnicodeCurrencyIdentifier">Unicode currency identifier</a>.</li>
</ul>
<ul>
<li><em>Examples:</em> <b>curr-eur</b>-per-square-meter, or pound-per-<b>curr-usd</b></li>
<li><em>Note:</em> CLDR does not provide conversions for currencies; this is only intended for formatting.
The locale data for currencies is supplied in the <code>currencies</code> element, not in the <code>units</code> element.</li>
</ul>
</td></tr>

</tbody></table>
<a name="syntax"></a>
#### Unit Syntax

The formal syntax for identifiers is provided below, in [EBNF](tr35.md#ebnf).
Some of the constraints reference data from various elements in the unit conversion data [units.xml](https://github.com/unicode-org/cldr/blob/main/common/supplemental/units.xml).
These may be either element values or element attribute values.
macchiati marked this conversation as resolved.
Show resolved Hide resolved
See [Unit_Conversion](tr35-info.md#Unit_Conversion).

<a name='unit_identifier' href='#unit_identifier'>unit_identifier</a>
<br/>:= core_unit_identifier
<br/>   | mixed_unit_identifier
<br/>   | long_unit_identifier

<a name='core_unit_identifier' href='#core_unit_identifier'>core_unit_identifier</a>
<br/>:= product_unit ("-" per "-" product_unit)\*
<br/>   | per "-" product_unit ("-" per "-" product_unit)\*
* *Examples:*
* foot-per-second-per-second
* per-second
* *Notes:*
* The normalized form will have only one "per"

per
<br/>:= "per"
* [ wfc: The token 'per' is the single value in \<unitIdComponent type="per"\> ]

<a name='product_unit' href='#product_unit'>product_unit</a>
<br/>:= single_unit ("-" single_unit)* ("-" pu_single_unit)*
<br/>   | pu_single_unit ("-" pu_single_unit)*
* [ wfc: No pu\_single\_unit may precede a single unit ]
* *Examples:*
* foot-pound-force

<a name='single_unit' href='#single_unit'>single_unit</a>
<br/>:= dimensionality_prefix? simple_unit
<br/>   | unit_constant
* *Examples:*
* square-kilometer
* 100

<a name='pu_single_unit' href='#pu_single_unit'>pu_single_unit</a>
<br/>:= "xxx-" single_unit
<br/>   | "x-" single_unit
* *Examples:*
* xxx-square-knuts (a Harry Potter unit)
* *Notes:*
* "x-" is only for backwards compatibility; it is deprecated and should not be generated
* See [Private-Use Units](https://github.com/unicode-org/cldr/edit/main/docs/ldml/tr35-general.md#Private_Use_Units)

<a name='unit_constant' href='#unit_constant'>unit_constant</a>
<br/>:= [1-9][0-9]* ("e" [1-9][0-9]*)?
* *Examples:*
* kilowatt-hour-per-100-kilometer
* gallon-per-100-mile
* per-200-pound
* per-12
* [ wfc: The numeric value of the unit constant must be an integer greater than one. ]
* *Notes:*
* The normal interpretation of `e` is used, where 2e6 \= 2×10⁶.
* The `e` notation is optional: per-100-kilometer and per-1e2-kilometer are equivalent unit\_identifiers.
* When constructing identifiers, exponents should be greater than 3 and multiples of 3, even though parsers must accept the wider range.

<a name='dimensionality_prefix' href='#dimensionality_prefix'>dimensionality_prefix</a>
<br/>:= "square-"
<br/>   | "cubic-"
<br/>   | "pow" ([2-9]|1[0-5]) "-"
* [ wfc: Must be value in: \<unitIdComponent type="power"\>]
* *Notes:*
* "pow2-" and "pow3-" canonicalize to "square-" and "cubic-"

<a name='simple_unit' href='#simple_unit'>simple_unit</a>
<br/>:= (prefix_component "-")* (prefixed_unit
<br/>   | base_component) ("-" suffix_component)*
<br/>   | currency_unit
<br/>   | ("em" | "g" | "us" | "hg" | "of")
* *Examples:*
* kilometer
* meter
* cup-metric
* fluid-ounce
* curr-chf
* em
* *Notes:*
* Five simple units are currently allowed as legacy usage, for tokens that wouldn’t otherwise be a base\_component due to length (eg, "g-force").Those are likely to be deprecated in teh future, with conformant aliases added: the "hg" and "of" are already only in deprecated simple\_units.

<a name='prefixed_unit' href='#prefixed_unit'>prefixed_unit</a>
prefix base_component
* *Examples:*
* kilometer

<a name='prefix' href='#prefix'>prefix</a>
<br/>:= si_prefix
<br/>   | binary_prefix

<a name='si_prefix' href='#si_prefix'>si_prefix</a>
<br/>:= "deka"
<br/>   | "hecto"
<br/>   | "kilo", …
* [ wfc: Must be an attribute value of the `type` in: \<unitPrefix type='…' … power10='…'\> ]
* *Notes:*
* See also [NIST special publication 811](https://www.nist.gov/pml/special-publication-811)

<a name='binary_prefix' href='#binary_prefix'>binary_prefix</a>
<br/>:= "kibi", "mebi", …
* [ wfc: Must be an attribute value of the `type` in: \<unitPrefix type='…' … power2='…'\>]
* *Notes:*
* See also [Prefixes for binary multiples](https://physics.nist.gov/cuu/Units/binary.html)

<a name='prefix_component' href='#prefix_component'>prefix_component</a>
<br/>:= [a-z]{3,}
* [ vc: must be value in: \<unitIdComponent type="prefix"\>]
* *Notes:*
* The set of prefix components often expands in new releases, so the requirement to be one of these attribute values is a validity constraint, not a well-formedness constraint. *

<a name='base_component' href='#base_component'>base_component</a>
<br/>:= [a-z]{3,}
* [ wfc: must not have a prefix as an initial segment. ]
* [ wfc: must not be a value in \<unitIdComponent type="X"\> for X in \{prefix, suffix, power, and, per} ]
* [ vc: Must be an attribute value of the `source` in: \<convertUnit source='…' …\> or the `type` in \<unitAlias type="…" replacement="…" …\> ]
* *Notes:*
* The set of base components typically expands in new releases, so the requirement to be one of these attribute values is a validity constraint, not a well-formedness constraint.
* The base-components in unitAlias `type` are deprecated, should be converted to their replacement values.
* No two different base\_components will share the first 8 letters; see [Unit Identifier Uniqueness](https://github.com/unicode-org/cldr/edit/main/docs/ldml/tr35-general.md#Unit_Identifier_Uniqueness).) ]

<a name='suffix_component' href='#suffix_component'>suffix_component</a>
<br/>:= [a-z]{3,}
* [ vc: must be value in: \<unitIdComponent type="suffix"\> ]
* *Notes:*
* The set of suffix components often expands in new releases, so the requirement to be one of these attribute values is a validity constraint, not a well-formedness constraint.

<a name='mixed_unit_identifier' href='#mixed_unit_identifier'>mixed_unit_identifier</a>
<br/>:= (single_unit | pu_single_unit) ("-" and "-" (single_unit | pu_single_unit ))*
* *Examples:*
* foot-and-inch

and
<br/>:= "and"
* [ wfc: The token 'and' is the single value in \<unitIdComponent type="and"\> ]

<a name='long_unit_identifier' href='#long_unit_identifier'>long_unit_identifier</a>
<br/>:= grouping "-" core_unit_identifier

grouping
<br/>:= [a-z]{3,}

<a name='currency_unit' href='#currency_unit'>currency_unit</a>
<br/>:= "curr-" [a-z]{3}
* [ wfc: The first part of the currency\_unit is a standard prefix; the second part of the currency unit must be a valid [Unicode currency identifier](https://github.com/unicode-org/cldr/blob/main/docs/ldml/tr35.md#UnicodeCurrencyIdentifier)]
* *Examples:*
* curr-eur-per-square-meter
* pound-per-curr-usd
* *Notes:*
* CLDR does not provide conversions for currencies; this is only intended for formatting.
* The locale data for currency display names is supplied in the `currencies` element, not in the `units` element.

Note that while the syntax allows for unit_constants in multiple places, the typical use case is only one instance, after a "-per-".
The normalized form of a unit identifier has at most one unit_constant in the numerator and one in the denominator.
Expand Down Expand Up @@ -3143,4 +3146,4 @@ The authors, contributors, and publishers have taken care in the preparation of
but make no express or implied representation or warranty of any kind and assume no responsibility or liability for errors or omissions or for consequential or incidental damages that may arise therefrom.
This publication is provided “AS-IS” without charge as a convenience to users.

Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.
Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.
14 changes: 12 additions & 2 deletions docs/ldml/tr35-info.md
Original file line number Diff line number Diff line change
Expand Up @@ -1208,9 +1208,19 @@ Instructions for use are supplied in the header of the file.

Different locales have different preferences for which unit or combination of units is used for a particular usage, such as measuring a person’s height. This is more fine-grained than merely a preference for metric versus US or UK measurement systems. For example, one locale may use meters alone, while another may use centimeters alone or a combination of meters and centimeters; a third may use inches alone, or (informally) a combination of feet and inches.

The determination of preferred units uses the user preference data in [units.xml](https://github.com/unicode-org/cldr/blob/main/common/supplemental/units.xml) together with **input unit**, the **input unit usage**, and the **input locale identifer**.
* The _well-formed_ and _valid_ **units** are defined according to [Unit Syntax](tr35-general.html#unit-syntax).
* The _well-formed_ **unit usages** are of the form [a-z0-9]{3-8}("-" [a-z0-9]{3-8})*.
The _valid_ **unit usages** are the union of the set of `NMTOKENS` in the `usage` attribute value for the `unitPreferences` element in [units.xml](https://github.com/unicode-org/cldr/blob/main/common/supplemental/units.xml).
For example, the following `unitPreferences` elements produce the set {default, floor, geograph, land}.
* \<unitPreferences category="area" usage="default">
* \<unitPreferences category="area" usage="geograph land">
* \<unitPreferences category="area" usage="floor">
* There are currently no deprecated **unit usages**.
Should there be any in the future, for backwards compatibility the above definition would be expanded to include unitUsageAlias elements.

### <a name="Unit_Preferences_Overrides" href="#Unit_Preferences_Overrides">Unit Preferences Overrides</a>

The determination of preferred units uses the user preference data together with **input unit**, the **input usage**, and the **input locale identifer**.
Within the locale identifier, the subtags that can affect the result are:
* the value of the keys mu, ms, and rg
* the region in the locale identifier (if there is one)
Expand Down Expand Up @@ -1473,4 +1483,4 @@ The authors, contributors, and publishers have taken care in the preparation of
but make no express or implied representation or warranty of any kind and assume no responsibility or liability for errors or omissions or for consequential or incidental damages that may arise therefrom.
This publication is provided “AS-IS” without charge as a convenience to users.
Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.
Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.
Loading
Loading