From 1fa099e31fe28be1986a85d79d1754e72c9b4f2c Mon Sep 17 00:00:00 2001 From: Mark Davis Date: Thu, 7 Nov 2024 02:00:57 +0000 Subject: [PATCH] CLDR-15948 Clean up well-formedness and/or validity constraints See #4179 --- docs/ldml/tr35-general.md | 309 +++++++++++++++++++------------------- docs/ldml/tr35-info.md | 14 +- docs/ldml/tr35.md | 25 ++- 3 files changed, 185 insertions(+), 163 deletions(-) diff --git a/docs/ldml/tr35-general.md b/docs/ldml/tr35-general.md index c02953df8a4..cf744bcc0e0 100644 --- a/docs/ldml/tr35-general.md +++ b/docs/ldml/tr35-general.md @@ -68,7 +68,7 @@ The LDML specification is divided into the following parts: * [Unit Preference and Conversion Data](#Unit_Preference_and_Conversion) * [Unit Identifiers](#Unit_Identifiers) * [Nomenclature](#nomenclature) - * [Syntax](#syntax) + * [Unit Syntax](#unit-syntax) * [Unit Identifier Uniqueness](#Unit_Identifier_Uniqueness) * [Example Units](#Example_Units) * [Compound Units](#compound-units) @@ -902,157 +902,160 @@ As with other identifiers in CLDR, the American English spelling is used for uni > In keeping with U.S. and International practice (see Sec. C.2), this Guide uses the dot on the line as the decimal marker. In addition this Guide utilizes the American spellings “meter,” “liter,” and “deka” rather than “metre,” “litre,” and “deca,” and the name “metric ton” rather than “tonne.” -#### Syntax - -The formal syntax for identifiers is provided below. -Some of the constraints reference data from the unitIdComponents in [Unit_Conversion](tr35-info.md#Unit_Conversion). - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
unit_identifier:=core_unit_identifier
- | mixed_unit_identifier
- | long_unit_identifier
core_unit_identifier:=product_unit ("-" per "-" product_unit)*
- | per "-" product_unit ("-" per "-" product_unit)* -
  • Examples: -
    • foot-per-second-per-second
    • -
    • per-second
    • -
  • -
  • Note: The normalized form will have only one "per"
  • -
per:="per" -
    -
  • Constraint: The token 'per' is the single value in <unitIdComponent type="per">
  • -
product_unit:=single_unit ("-" single_unit)* ("-" pu_single_unit)*
- | pu_single_unit ("-" pu_single_unit)* -
  • Example: foot-pound-force
  • -
  • Constraint: No pu_single_unit may precede a single unit
  • -
single_unit:=dimensionality_prefix? simple_unit | unit_constant -
  • Examples: square-kilometer, or 100
pu_single_unit:="xxx-" single_unit | "x-" single_unit -
  • Example: xxx-square-knuts (a Harry Potter unit)
  • -
  • Note: "x-" is only for backwards compatibility
  • -
  • See Private-Use Units
  • -
unit_constant:=[1-9][0-9]* ("e" [1-9][0-9]*)? -
  • Examples: -
    • kilowatt-hour-per-100-kilometer
    • -
    • gallon-per-100-mile
    • -
    • per-200-pound
    • -
    • per-12
    • -
  • -
  • Constraint: The numeric value of the unit constant must be an integer greater than one.
  • -
  • Note: The normal interpretation of e is used, where 2e6 = 2×10⁶.
  • -
  • Note: The e notation is optional: per-100-kilometer and per-1e2-kilometer are equivalent unit_identifiers.
  • -
  • Note: When constructing identifiers, exponents should be greater than 3 and multiples of 3, even though parsers must accept the wider range.
  • -
dimensionality_prefix:="square-"

| "cubic-"

| "pow" ([2-9]|1[0-5]) "-" -

    -
  • Constraint: must be value in: <unitIdComponent type="power">.
  • -
  • Note: "pow2-" and "pow3-" canonicalize to "square-" and "cubic-"
  • -
  • Note: These are values in <unitIdComponent type="power">
  • -
simple_unit:=(prefix_component "-")* (prefixed_unit | base_component) ("-" suffix_component)*
- | currency_unit
- | "em" | "g" | "us" | "hg" | "of" -
    -
  • Examples: kilometer, meter, cup-metric, fluid-ounce, curr-chf, em
  • -
  • Note: Three simple units are currently allowed as legacy usage, for tokens that wouldn’t otherwise be a base_component due to length (eg, "g-force"). - We will likely deprecate those and add conformant aliases in the future: the "hg" and "of" are already only in deprecated simple_units.
  • -
prefixed_unitprefix base_component
  • Example: kilometer
prefixsi_prefix | binary_prefix
si_prefix:="deka" | "hecto" | "kilo", … -
binary_prefix:="kibi", "mebi", … -
prefix_component:=[a-z]{3,∞} -
  • Constraint: must be value in: <unitIdComponent type="prefix">.
base_component:=[a-z]{3,∞} -
  • Constraint: must not be a value in any of the following:
    - <unitIdComponent type="prefix">
    - or <unitIdComponent type="suffix">
    - or <unitIdComponent type="power">
    - or <unitIdComponent type="and">
    - or <unitIdComponent type="per">. -
  • -
  • Constraint: must not have a prefix as an initial segment.
  • -
  • Constraint: no two different base_components will share the first 8 letters. - (For more information, see Unit Identifier Uniqueness.) -
  • -
-
suffix_component:=[a-z]{3,∞} -
    -
  • Constraint: must be value in: <unitIdComponent type="suffix">
  • -
:=(single_unit | pu_single_unit) ("-" and "-" (single_unit | pu_single_unit ))* -
  • Example: foot-and-inch
  • -
and:="and" -
    -
  • Constraint: The token 'and' is the single value in <unitIdComponent type="and">
  • -
long_unit_identifier:=grouping "-" core_unit_identifier
grouping:=[a-z]{3,∞}
currency_unit:="curr-" [a-z]{3} -
    -
  • Constraint: The first part of the currency_unit is a standard prefix; the second part of the currency unit must be a valid Unicode currency identifier.
  • -
-
    -
  • Examples: curr-eur-per-square-meter, or pound-per-curr-usd
  • -
  • Note: CLDR does not provide conversions for currencies; this is only intended for formatting. - The locale data for currencies is supplied in the currencies element, not in the units element.
  • -
-
+ +#### Unit Syntax + +The formal syntax for identifiers is provided below, in [EBNF](tr35.md#ebnf). +Some of the constraints reference data from various elements in the unit conversion data [units.xml](https://github.com/unicode-org/cldr/blob/main/common/supplemental/units.xml). +These may be either element values or element attribute values. +See [Unit_Conversion](tr35-info.md#Unit_Conversion). + +unit_identifier +
:= core_unit_identifier +
   | mixed_unit_identifier +
   | long_unit_identifier + +core_unit_identifier +
:= product_unit ("-" per "-" product_unit)\* +
   | per "-" product_unit ("-" per "-" product_unit)\* +* *Examples:* + * foot-per-second-per-second + * per-second +* *Notes:* + * The normalized form will have only one "per" + +per +
:= "per" +* [ wfc: The token 'per' is the single value in \ ] + +product_unit +
:= single_unit ("-" single_unit)* ("-" pu_single_unit)* +
   | pu_single_unit ("-" pu_single_unit)* +* [ wfc: No pu\_single\_unit may precede a single unit ] +* *Examples:* + * foot-pound-force + +single_unit +
:= dimensionality_prefix? simple_unit +
   | unit_constant +* *Examples:* + * square-kilometer + * 100 + +pu_single_unit +
:= "xxx-" single_unit +
   | "x-" single_unit +* *Examples:* + * xxx-square-knuts (a Harry Potter unit) +* *Notes:* + * "x-" is only for backwards compatibility; it is deprecated and should not be generated + * See [Private-Use Units](https://github.com/unicode-org/cldr/edit/main/docs/ldml/tr35-general.md#Private_Use_Units) + +unit_constant +
:= [1-9][0-9]* ("e" [1-9][0-9]*)? +* *Examples:* + * kilowatt-hour-per-100-kilometer + * gallon-per-100-mile + * per-200-pound + * per-12 +* [ wfc: The numeric value of the unit constant must be an integer greater than one. ] +* *Notes:* + * The normal interpretation of `e` is used, where 2e6 \= 2×10⁶. + * The `e` notation is optional: per-100-kilometer and per-1e2-kilometer are equivalent unit\_identifiers. + * When constructing identifiers, exponents should be greater than 3 and multiples of 3, even though parsers must accept the wider range. + +dimensionality_prefix +
:= "square-" +
   | "cubic-" +
   | "pow" ([2-9]|1[0-5]) "-" +* [ wfc: Must be value in: \. ] +* *Notes:* + * "pow2-" and "pow3-" canonicalize to "square-" and "cubic-" + +simple_unit +
:= (prefix_component "-")* (prefixed_unit +
   | base_component) ("-" suffix_component)* +
   | currency_unit +
   | ("em" | "g" | "us" | "hg" | "of") +* *Examples:* + * kilometer + * meter + * cup-metric + * fluid-ounce + * curr-chf + * em +* *Notes:* + * Five simple units are currently allowed as legacy usage, for tokens that wouldn’t otherwise be a base\_component due to length (eg, "g-force").Those are likely to be deprecated in teh future, with conformant aliases added: the "hg" and "of" are already only in deprecated simple\_units. + +prefixed_unit + prefix base_component +* *Examples:* + * kilometer + +prefix +
:= si_prefix +
   | binary_prefix + +si_prefix +
:= "deka" +
   | "hecto" +
   | "kilo", … +* [ wfc: Must be an attribute value of the `type` in: \ ] +* *Notes:* + * See also [NIST special publication 811](https://www.nist.gov/pml/special-publication-811) + +binary_prefix +
:= "kibi", "mebi", … +* [ wfc: Must be an attribute value of the `type` in: \. ] +* *Notes:* + * See also [Prefixes for binary multiples](https://physics.nist.gov/cuu/Units/binary.html) + +prefix_component +
:= [a-z]{3,} +* [ vc: must be value in: \. ] +* *Notes:* + * The set of prefix components often expands in new releases, so the requirement to be one of these attribute values is a validity constraint, not a well-formedness constraint. * + +base_component +
:= [a-z]{3,} +* [ wfc: must not have a prefix as an initial segment. ] +* [ wfc: must not be a value in \ for X in \{prefix, suffix, power, and, per} ] +* [ vc: Must be an attribute value of the `source` in: \ or the `type` in \ ] +* *Notes:* + * The set of base components typically expands in new releases, so the requirement to be one of these attribute values is a validity constraint, not a well-formedness constraint. + * The base-components in unitAlias `type` are deprecated, should be converted to their replacement values. + * No two different base\_components will share the first 8 letters; see [Unit Identifier Uniqueness](https://github.com/unicode-org/cldr/edit/main/docs/ldml/tr35-general.md#Unit_Identifier_Uniqueness).) ] + +suffix_component +
:= [a-z]{3,} +* [ vc: must be value in: \ ] +* *Notes:* + * The set of suffix components often expands in new releases, so the requirement to be one of these attribute values is a validity constraint, not a well-formedness constraint. + +mixed_unit_identifier +
:= (single_unit | pu_single_unit) ("-" and "-" (single_unit | pu_single_unit ))* +* *Examples:* + * foot-and-inch + +and +
:= "and" +* [ wfc: The token 'and' is the single value in \ ] + +long_unit_identifier +
:= grouping "-" core_unit_identifier + +grouping +
:= [a-z]{3,} + +currency_unit +
:= "curr-" [a-z]{3} +* [ wfc: The first part of the currency\_unit is a standard prefix; the second part of the currency unit must be a valid [Unicode currency identifier](https://github.com/unicode-org/cldr/blob/main/docs/ldml/tr35.md#UnicodeCurrencyIdentifier). ] +* *Examples:* + * curr-eur-per-square-meter + * pound-per-curr-usd +* *Notes:* + * CLDR does not provide conversions for currencies; this is only intended for formatting. + * The locale data for currency display names is supplied in the `currencies` element, not in the `units` element. Note that while the syntax allows for unit_constants in multiple places, the typical use case is only one instance, after a "-per-". The normalized form of a unit identifier has at most one unit_constant in the numerator and one in the denominator. @@ -3143,4 +3146,4 @@ The authors, contributors, and publishers have taken care in the preparation of but make no express or implied representation or warranty of any kind and assume no responsibility or liability for errors or omissions or for consequential or incidental damages that may arise therefrom. This publication is provided “AS-IS” without charge as a convenience to users. -Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries. \ No newline at end of file +Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries. diff --git a/docs/ldml/tr35-info.md b/docs/ldml/tr35-info.md index 1dfc6b40297..8dc53c0e2a3 100644 --- a/docs/ldml/tr35-info.md +++ b/docs/ldml/tr35-info.md @@ -1208,9 +1208,19 @@ Instructions for use are supplied in the header of the file. Different locales have different preferences for which unit or combination of units is used for a particular usage, such as measuring a person’s height. This is more fine-grained than merely a preference for metric versus US or UK measurement systems. For example, one locale may use meters alone, while another may use centimeters alone or a combination of meters and centimeters; a third may use inches alone, or (informally) a combination of feet and inches. +The determination of preferred units uses the user preference data in [units.xml](https://github.com/unicode-org/cldr/blob/main/common/supplemental/units.xml) together with **input unit**, the **input unit usage**, and the **input locale identifer**. + * The _well-formed_ and _valid_ **units** are defined according to [Unit Syntax](tr35-general.html#unit-syntax). + * The _well-formed_ **unit usages** are of the form [a-z0-9]{3-8}("-" [a-z0-9]{3-8})*. +The _valid_ **unit usages** are the union of the set of `NMTOKENS` in the `usage` attribute value for the `unitPreferences` element in [units.xml](https://github.com/unicode-org/cldr/blob/main/common/supplemental/units.xml). +For example, the following `unitPreferences` elements produce the set {default, floor, geograph, land}. + * \ + * \ + * \ + * There are currently no deprecated **unit usages**. +Should there be any in the future, for backwards compatibility the above definition would be expanded to include unitUsageAlias elements. + ### Unit Preferences Overrides -The determination of preferred units uses the user preference data together with **input unit**, the **input usage**, and the **input locale identifer**. Within the locale identifier, the subtags that can affect the result are: * the value of the keys mu, ms, and rg * the region in the locale identifier (if there is one) @@ -1473,4 +1483,4 @@ The authors, contributors, and publishers have taken care in the preparation of but make no express or implied representation or warranty of any kind and assume no responsibility or liability for errors or omissions or for consequential or incidental damages that may arise therefrom. This publication is provided “AS-IS” without charge as a convenience to users. -Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries. \ No newline at end of file +Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries. diff --git a/docs/ldml/tr35.md b/docs/ldml/tr35.md index 506b70994d5..faa20392dcb 100644 --- a/docs/ldml/tr35.md +++ b/docs/ldml/tr35.md @@ -419,9 +419,9 @@ For example, the following is from a sample header: If an implementation overrides CLDR data, then various lines in the relevant test files may need to be modified correspondingly, or skipped. ### EBNF -The BNF syntax used in LDML is a variant of the Extended Backus-Naur Form (EBNF) notation used in [W3C XML Notation](https://www.w3.org/TR/REC-xml/#sec-notation). The main differences are: +The EBNF syntax used in LDML is a variant of the Extended Backus-Naur Form (EBNF) notation used in [W3C XML Notation](https://www.w3.org/TR/REC-xml/#sec-notation). The main differences are: -1. Bounded repetition following Perl regex syntax is allowed, such as `alphanum{3,8}`. +1. Bounded repetition following Perl regex syntax is allowed, such as `digit{3}` for 3 digits, `digit{3,5}` for 3 to 5 digits, and `digit{3,}` for 3 or more digits. 2. Whitespace inside bracketed enumerations and ranges is ignored. * eg., `[A-Z a-z]` is the same as `[A-Za-z]` 3. A backslash may be used to escape a following "x"-prefixed hexadecimal code point or the immediately following character. @@ -436,7 +436,7 @@ In the text, this is sometimes referred to as "EBNF (Perl-based)". Before diving into the XML structure, it is helpful to describe the model behind the structure. People do not have to subscribe to this model to use data in LDML, but they do need to understand it so that the data can be correctly translated into whatever model their implementation uses. -The first issue is basic: _what is a locale?_ In this model, a locale is an identifier (id) that refers to a set of user preferences that tend to be shared across significant swaths of the world. Traditionally, the data associated with this id provides support for formatting and parsing of dates, times, numbers, and currencies; for measurement units, for sort-order (collation), plus translated names for time zones, languages, countries, and scripts. The data can also include support for text boundaries (character, word, line, and sentence), text transformations (including transliterations), and other services. +The first issue is basic: _what is a locale?_ In this model, a locale is an identifier (id) that refers to a set of user preferences that tend to be shared across significant swaths of the world. Traditionally, the data associated with this id provides support for formatting and parsing of dates, times, numbers, and currencies; for measurement units, for sort-order (collation), plus translated names for time zones, languages, countries (regions), and scripts. The data can also include support for text boundaries (character, word, line, and sentence), text transformations (including transliterations), and other services. Locale data is not cast in stone: the data used on someone's machine generally may reflect the US format, for example, but preferences can typically set to override particular items, such as setting the date format for 2002.03.15, or using metric or Imperial measurement units. In the abstract, locales are simply one of many sets of preferences that, say, a website may want to remember for a particular user. Depending on the application, it may want to also remember the user's time zone, preferred currency, preferred character set, smoker/non-smoker preference, meal preference (vegetarian, kosher, and so on), music preference, religion, party affiliation, favorite charity, and so on. @@ -1030,8 +1030,12 @@ The BCP 47 form for keys and types is the canonical form, and recommended. Other "cu"
(currency) Currency type ISO 4217 code,

plus others in common use

-

Codes consisting of 3 ASCII letters that are or have been valid in ISO 4217, plus certain additional codes that are or have been in common use. The list of countries and time periods associated with each currency value is available in Supplemental Currency Data, plus the default number of decimals.

The XXX code is given a broader interpretation as Unknown or Invalid Currency.

- +

Well-formed codes are of the form [A-Za-z]{3}, with the canonical format being [A-Z]{3}. + The valid codes are ones that are or have been valid in ISO 4217, plus certain additional codes that are or have been in common use. + Supplemental Currency Data provides the list of countries (regions) and time periods associated with each currency code. + It also supplies the default number of decimals.

+

The XXX code is given a broader interpretation than in ISO 4217, as Unknown or Invalid Currency.

+ A Unicode Dictionary Break Exclusion Identifier specifies scripts to be excluded from dictionary-based text break (for words and lines). The valid values are of one or more items of type SCRIPT_CODE as specified in the name attribute value in the type element of @@ -1261,7 +1265,7 @@ Not all TZDB links are in CLDR aliases. CLDR purposefully does not exactly match the Link structure in the TZDB. 1. The links are maintained in the TZDB, and it would duplicate information that could fall out of sync (especially because the TZDB can be updated many times in a single month). -2. The TZDB went though a change a few years ago where it dropped the mappings to countries, whereas CLDR still maintains that distinction. +2. The TZDB went though a change a few years ago where it dropped the mappings to countries (regions), whereas CLDR still maintains that distinction. 3. Because there are several different timezones that all link together, that would make for a single long alias being an alias for several different short aliases. CLDR doesn't alias across country boundaries because countries are useful for timezone selection. @@ -4308,7 +4312,12 @@ Other contributors to CLDR are listed on the [CLDR Project Page](https://www.uni **Differences from LDML Version 46** -TBD +### Unit Modifications +- Updated the EBNF in [Unit Syntax](https://cldr-smoke.unicode.org/spec/main/ldml/tr35-general.html#syntax) to: + - Change the constraints into either well-formedness constraints or validity constraints. + - Add validity constraints for base-component. + - Reformat the EBNF to avoid using HTML tables. +- Updated the [Unit_Preferences](tr35-info.html#Unit_Preferences) to provide well-formedness and validity definitions. **Differences in LDML Version 45 (temporary reference while editing the above)** @@ -4375,4 +4384,4 @@ The authors, contributors, and publishers have taken care in the preparation of but make no express or implied representation or warranty of any kind and assume no responsibility or liability for errors or omissions or for consequential or incidental damages that may arise therefrom. This publication is provided “AS-IS” without charge as a convenience to users. -Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries. \ No newline at end of file +Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.