Skip to content

Commit

Permalink
CLDR-17830 BRS Added more info about likely subtags, emoji keywords
Browse files Browse the repository at this point in the history
  • Loading branch information
macchiati authored Sep 3, 2024
1 parent e190810 commit 48a4351
Showing 1 changed file with 46 additions and 8 deletions.
54 changes: 46 additions & 8 deletions docs/site/downloads/cldr-46.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,9 @@ title: CLDR 46 Release Note
## Overview

Unicode CLDR provides key building blocks for software supporting the world's languages.
CLDR data is used by all [major software systems](https://cldr.unicode.org/index#TOC-Who-uses-CLDR-) (including all mobile phones) for their software internationalization and localization, adapting software to the conventions of different languages.
CLDR data is used by all [major software systems](https://cldr.unicode.org/index#TOC-Who-uses-CLDR-)
(including all mobile phones) for their software internationalization and localization,
adapting software to the conventions of different languages.

The largest changes in this release were the updates to Unicode 16.0, substantial additions of Emoji search keyword data, and ‘upleveling’ the locale coverage.

Expand Down Expand Up @@ -60,9 +62,17 @@ For a full listing, see [Delta DTDs](https://unicode.org/cldr/charts/46/suppleme
2. Deprecated timezone ids. Altered the handling of: CST6CDT, EST, EST5EDT, MST7MDT, PST8PDT
3. Units
1. Added units: portion-per-1e9 (aka per-billion), night (for hotel stays), light (as a prefix for light-second, light-minute, etc.)
2. Changed preferred wind speed preference for some locales to meter-per-second
2. Changed preferred wind speed preference for some locales to meter-per-second
4. Updated: language IDs, likelySubtags, region gdp and language populations, etc.
1. Minimization for likelySubtags removes some additional redundant mappings
1. Minimization for likelySubtags removes many additional redundant mappings.
1. For example, the mapping acy_Grek → acy_Grek_CY is unnecessary, because the mapping acy → acy_Latn_CY is sufficient.
For the reason why, see the algorithm in [Likely Subtags](https://cldr-smoke.unicode.org/spec/main/ldml/tr35.html#likely-subtags).
4. The ordering in the file is more consistent; first the main mappings, then the mapping from region and/or script to likly language, then the data contributed by SIL.
5. The territories have been cleaned up: there are no ZZ entries, and 001 is limited to artifical languages such as Interlingua.
6. Language matching dropped Russian (ru) as a fallback language for Ukrainian.
1. A fallback language is used when the user's primary language is unavailable,
and either the user doesn't have any secondaries language in their settings (as on Android or iOS) or those secondary languages are also not available.
As a result of this change, when the primary and secondary languages are not available, the fallback language would be the system default instead of Russian.
5. Transforms.
1. Major update to Han → Latn, reflecting new data in Unicode 16.0
2. Fixes for Arabic numbers, a Farsi vowel
Expand All @@ -77,22 +87,50 @@ For a full listing, see [¤¤BCP47 Delta](https://unicode.org/cldr/charts/46/del
3. Revision of many search keywords to break up phrases
2. Major changes to Chinese collation, reflecting new data in Unicode 16.0
3. Other changes
1. Locales also had smaller improvements agreed to by translators.
1. Various locales also had smaller improvements agreed to by translators.
**TBD**

For a full listing, see [Delta Data](https://unicode.org/cldr/charts/46/delta/index.html)
### Emoji Search Keywords
The usage model for emoji search keywords is that
- The user types one or more words in an emoji search field. The order of words doesn't matter; nor does upper- versus lowercase.
- Each word successively narrows a number of emoji in a results box
- heart → 🥰 😘 😻 💌 💘 💝 💖 💗 💓 💞 💕 💟 ❣️ 💔 ❤️‍🔥 ❤️‍🩹 ❤️ 🩷 🧡 💛 💚 💙 🩵 💜 🤎 🖤 🩶 🤍 💋 🫰 🫶 🫀 💏 💑 🏠 🏡 ♥️ 🩺
- blue → 🥶 😰 💙 🩵 🫐 👕 👖 📘 🧿 🔵 🟦 🔷 🔹 🏳️‍⚧️
- therefore, [heart blue] → 💙 🩵
- A word that has no hits matches all the words that begin with it; if there are no such words hits, it is ignored.
- [heart | blue | confabulation] is equivalent to [heart | blue]
- Whenever the list is short enough to scan, the user will mouse-click on the right emoji - so it doesn't have to be narrowed too far.
Thus in the following, the user would just click on 🎉 if that works for them.
- celebrate → 🥳 🥂 🎈 🎉 🎊 🪅

In this release WhatsApp data has been incorporated, and the keywords have been simplified in most locales by breaking up multi-word keywords.
An example would be white flag (🏳️) formerly having 3 keyword phrases of [white waving flag | white flag | waving flag],
now being replaced by the simpler 3 single keywords [white | waving | flag].
The simpler version typically works as well or better in practice.

### Collation Data Changes
There are two significant changes to the CLDR root collation (CLDR default sort order).
#### Realigned With DUCET
The [DUCET](https://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table) is the Unicode Collation Algorithm default sort order. The [CLDR root collation](https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation) is a tailoring of the DUCET. These sort orders have differed in the relative order of groups of characters including extenders, currency symbols, and non-decimal-digit numeric characters.
The [DUCET](https://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table) is the Unicode Collation Algorithm default sort order.
The [CLDR root collation](https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation) is a tailoring of the DUCET.
These sort orders have differed in the relative order of groups of characters including extenders, currency symbols, and non-decimal-digit numeric characters.

Starting with CLDR 46 and Unicode 16.0, the order of these groups is the same. In both sort orders, non-decimal-digit numeric characters now sort after decimal digits, and the CLDR root collation no longer tailors any currency symbols (making some of them sort like letter sequences, as in the DUCET).
Starting with CLDR 46 and Unicode 16.0, the order of these groups is the same.
In both sort orders, non-decimal-digit numeric characters now sort after decimal digits,
and the CLDR root collation no longer tailors any currency symbols (making some of them sort like letter sequences, as in the DUCET).

These changes eliminate sort order differences among almost all regular characters between the CLDR root collation and the DUCET. See the [CLDR root collation](https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation) documentation for details.
These changes eliminate sort order differences among almost all regular characters between the CLDR root collation and the DUCET.
See the [CLDR root collation](https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation) documentation for details.

#### Improved Han Radical-Stroke Order
CLDR includes [data for sorting Han (CJK) characters in radical-stroke order](tr35-collation.md#File_Format_FractionalUCA_txt). It used to distinguish traditional and simplified forms of radicals on a higher level than sorting by the number of residual strokes. Starting with CLDR 46, the CLDR radical-stroke order matches that of the [Unicode Radical-Stroke Index (large PDF)](https://www.unicode.org/Public/UCD/latest/charts/RSIndex.pdf). [Its sorting algorithm is defined in UAX #38](https://www.unicode.org/reports/tr38/#SortingAlgorithm). Traditional vs. simplified forms of radicals are distinguished on a lower level than the number of residual strokes. This also has an effect on [alphabetic indexes](tr35-collation.md#Collation_Indexes) for radical-stroke sort orders, where only the traditional forms of radicals are now available as index characters.
CLDR includes [data for sorting Han (CJK) characters in radical-stroke order](tr35-collation.md#File_Format_FractionalUCA_txt).
It used to distinguish traditional and simplified forms of radicals on a higher level than sorting by the number of residual strokes.
Starting with CLDR 46, the CLDR radical-stroke order matches that of the [Unicode Radical-Stroke Index (large PDF)](https://www.unicode.org/Public/UCD/latest/charts/RSIndex.pdf).
[Its sorting algorithm is defined in UAX #38](https://www.unicode.org/reports/tr38/#SortingAlgorithm).
Traditional vs. simplified forms of radicals are distinguished on a lower level than the number of residual strokes.
This also has an effect on [alphabetic indexes](tr35-collation.md#Collation_Indexes) for radical-stroke sort orders,
where only the traditional forms of radicals are now available as index characters.

### JSON Data Changes

Expand Down

0 comments on commit 48a4351

Please sign in to comment.