From 8a22f670acc52c76eaf8ff2ed2720c001c13b3ae Mon Sep 17 00:00:00 2001 From: "Steven R. Loomis" Date: Sat, 21 Sep 2024 14:05:21 -0700 Subject: [PATCH] CLDR-16720 json: add transforms (#4036) --- docs/site/downloads/cldr-46.md | 48 +++---- .../java/org/unicode/cldr/json/CldrNode.java | 10 +- .../unicode/cldr/json/Ldml2JsonConverter.java | 127 +++++++++++++++++- .../unicode/cldr/json/LdmlConvertRules.java | 9 +- .../org/unicode/cldr/util/CLDRTransforms.java | 16 +++ .../cldr/json/JSON_config_transforms.txt | 2 + .../org/unicode/cldr/json/pathTransforms.txt | 8 +- 7 files changed, 184 insertions(+), 36 deletions(-) create mode 100644 tools/cldr-code/src/main/resources/org/unicode/cldr/json/JSON_config_transforms.txt diff --git a/docs/site/downloads/cldr-46.md b/docs/site/downloads/cldr-46.md index 97f8702ecdd..ff84a921188 100644 --- a/docs/site/downloads/cldr-46.md +++ b/docs/site/downloads/cldr-46.md @@ -15,15 +15,15 @@ It only covers the data, which is available at [release-46-alpha3](https://githu ## Overview Unicode CLDR provides key building blocks for software supporting the world's languages. -CLDR data is used by all [major software systems](https://cldr.unicode.org/index#TOC-Who-uses-CLDR-) -(including all mobile phones) for their software internationalization and localization, +CLDR data is used by all [major software systems](https://cldr.unicode.org/index#TOC-Who-uses-CLDR-) +(including all mobile phones) for their software internationalization and localization, adapting software to the conventions of different languages. The most significant changes in this release were: -- Updates to Unicode 16.0 (including major changes to collation), -- Further revisions to the Message Format 2.0 tech preview, -- Substantial additions and modifications of Emoji search keyword data, +- Updates to Unicode 16.0 (including major changes to collation), +- Further revisions to the Message Format 2.0 tech preview, +- Substantial additions and modifications of Emoji search keyword data, - β€˜Upleveling’ the locale coverage. ### Locale Coverage Status @@ -127,7 +127,7 @@ Full localization will await the next submission phase for CLDR. For a full listing, see [Delta Data](https://unicode.org/cldr/charts/46/delta/index.html) ### Emoji Search Keywords -The usage model for emoji search keywords is that +The usage model for emoji search keywords is that - The user types one or more words in an emoji search field. The order of words doesn't matter; nor does upper- versus lowercase. - Each word successively narrows a number of emoji in a results box - heart β†’ πŸ₯° 😘 😻 πŸ’Œ πŸ’˜ πŸ’ πŸ’– πŸ’— πŸ’“ πŸ’ž πŸ’• πŸ’Ÿ ❣️ πŸ’” ❀️‍πŸ”₯ β€οΈβ€πŸ©Ή ❀️ 🩷 🧑 πŸ’› πŸ’š πŸ’™ 🩡 πŸ’œ 🀎 πŸ–€ 🩢 🀍 πŸ’‹ 🫰 🫢 πŸ«€ πŸ’ πŸ’‘ 🏠 🏑 β™₯️ 🩺 @@ -139,11 +139,11 @@ The usage model for emoji search keywords is that Thus in the following, the user would just click on πŸŽ‰ if that works for them. - celebrate β†’ πŸ₯³ πŸ₯‚ 🎈 πŸŽ‰ 🎊 πŸͺ… -In this release WhatsApp emoji search keyword data has been incorporated. +In this release WhatsApp emoji search keyword data has been incorporated. In the process of doing that, the maximum number of search keywords per emoji has been increased, -and the keywords have been simplified in most locales by breaking up multi-word keywords. -An example would be white flag (🏳️), formerly having 3 keyword phrases of [white waving flag | white flag | waving flag], -now being replaced by the simpler 3 single keywords [white | waving | flag]. +and the keywords have been simplified in most locales by breaking up multi-word keywords. +An example would be white flag (🏳️), formerly having 3 keyword phrases of [white waving flag | white flag | waving flag], +now being replaced by the simpler 3 single keywords [white | waving | flag]. The simpler version typically works as well or better in practice. ### Collation Data Changes @@ -151,29 +151,29 @@ There are two significant changes to the CLDR root collation (CLDR default sort #### Realigned With DUCET The [DUCET](https://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table) is the Unicode Collation Algorithm default sort order. -The [CLDR root collation](https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation) is a tailoring of the DUCET. +The [CLDR root collation](https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation) is a tailoring of the DUCET. These sort orders have differed in the relative order of groups of characters including extenders, currency symbols, and non-decimal-digit numeric characters. -Starting with CLDR 46 and Unicode 16.0, the order of these groups is the same. -In both sort orders, non-decimal-digit numeric characters now sort after decimal digits, +Starting with CLDR 46 and Unicode 16.0, the order of these groups is the same. +In both sort orders, non-decimal-digit numeric characters now sort after decimal digits, and the CLDR root collation no longer tailors any currency symbols (making some of them sort like letter sequences, as in the DUCET). -These changes eliminate sort order differences among almost all regular characters between the CLDR root collation and the DUCET. +These changes eliminate sort order differences among almost all regular characters between the CLDR root collation and the DUCET. See the [CLDR root collation](https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation) documentation for details. #### Improved Han Radical-Stroke Order -CLDR includes [data for sorting Han (CJK) characters in radical-stroke order](https://cldr-smoke.unicode.org/spec/main/ldml/tr35-collation.md#File_Format_FractionalUCA_txt). -It used to distinguish traditional and simplified forms of radicals on a higher level than sorting by the number of residual strokes. -Starting with CLDR 46, the CLDR radical-stroke order matches that of the [Unicode Radical-Stroke Index (large PDF)](https://www.unicode.org/Public/UCD/latest/charts/RSIndex.pdf). -[Its sorting algorithm is defined in UAX #38](https://www.unicode.org/reports/tr38/#SortingAlgorithm). -Traditional vs. simplified forms of radicals are distinguished on a lower level than the number of residual strokes. -This also has an effect on [alphabetic indexes](tr35-collation.md#Collation_Indexes) for radical-stroke sort orders, +CLDR includes [data for sorting Han (CJK) characters in radical-stroke order](https://cldr-smoke.unicode.org/spec/main/ldml/tr35-collation.md#File_Format_FractionalUCA_txt). +It used to distinguish traditional and simplified forms of radicals on a higher level than sorting by the number of residual strokes. +Starting with CLDR 46, the CLDR radical-stroke order matches that of the [Unicode Radical-Stroke Index (large PDF)](https://www.unicode.org/Public/UCD/latest/charts/RSIndex.pdf). +[Its sorting algorithm is defined in UAX #38](https://www.unicode.org/reports/tr38/#SortingAlgorithm). +Traditional vs. simplified forms of radicals are distinguished on a lower level than the number of residual strokes. +This also has an effect on [alphabetic indexes](tr35-collation.md#Collation_Indexes) for radical-stroke sort orders, where only the traditional forms of radicals are now available as index characters. ### JSON Data Changes -1. Separate modern packages were dropped [CLDR-16465] -2. Adding transliteration rules [CLDR-16720] (In progress) +1. Separate modern packages were dropped [CLDR-16465] +2. Transliteration (transform) data is now available in the `cldr-transforms` package. The JSON file contains transform metadata, and the `_rulesFile` key indicates an external (`.txt`) file containing the actual rules. [CLDR-16720][]. ### Markdown ### @@ -185,7 +185,7 @@ This process should be completed before release. ### File Changes Most files added in this release were for new locales. -There were the following new test files: +There were the following new test files: **TBD*** @@ -215,3 +215,5 @@ Many people have made significant contributions to CLDR and LDML; see the [Ackno The Unicode [Terms of Use](https://unicode.org/copyright.html) apply to CLDR data; in particular, see [Exhibit 1](https://unicode.org/copyright.html#Exhibit1). For web pages with different views of CLDR data, see [http://cldr.unicode.org/index/charts](https://cldr.unicode.org/index/charts). + +[CLDR-16720]: https://unicode-org.atlassian.net/issues/CLDR-16720 diff --git a/tools/cldr-code/src/main/java/org/unicode/cldr/json/CldrNode.java b/tools/cldr-code/src/main/java/org/unicode/cldr/json/CldrNode.java index a6559730bcc..d272cee893d 100644 --- a/tools/cldr-code/src/main/java/org/unicode/cldr/json/CldrNode.java +++ b/tools/cldr-code/src/main/java/org/unicode/cldr/json/CldrNode.java @@ -23,7 +23,15 @@ public static CldrNode createNode( String fullTrunk = extractAttrs(fullPathSegment, node.nondistinguishingAttributes); if (!node.name.equals(fullTrunk)) { throw new ParseException( - "Error in parsing \"" + pathSegment + " \":\"" + fullPathSegment, 0); + "Error in parsing \"" + + pathSegment + + "\":\"" + + fullPathSegment + + " - " + + node.name + + " != " + + fullTrunk, + 0); } for (String key : node.distinguishingAttributes.keySet()) { diff --git a/tools/cldr-code/src/main/java/org/unicode/cldr/json/Ldml2JsonConverter.java b/tools/cldr-code/src/main/java/org/unicode/cldr/json/Ldml2JsonConverter.java index 35c413c1019..50924b0cfb2 100644 --- a/tools/cldr-code/src/main/java/org/unicode/cldr/json/Ldml2JsonConverter.java +++ b/tools/cldr-code/src/main/java/org/unicode/cldr/json/Ldml2JsonConverter.java @@ -23,6 +23,7 @@ import java.util.ArrayList; import java.util.Arrays; import java.util.Collections; +import java.util.HashSet; import java.util.Iterator; import java.util.LinkedList; import java.util.List; @@ -49,6 +50,7 @@ import org.unicode.cldr.util.CLDRLocale; import org.unicode.cldr.util.CLDRPaths; import org.unicode.cldr.util.CLDRTool; +import org.unicode.cldr.util.CLDRTransforms; import org.unicode.cldr.util.CLDRURLS; import org.unicode.cldr.util.CalculatedCoverageLevels; import org.unicode.cldr.util.CldrUtility; @@ -88,6 +90,7 @@ public class Ldml2JsonConverter { private static final String CLDR_PKG_PREFIX = "cldr-"; private static final String FULL_TIER_SUFFIX = "-full"; private static final String MODERN_TIER_SUFFIX = "-modern"; + private static final String TRANSFORM_RAW_SUFFIX = ".txt"; private static Logger logger = Logger.getLogger(Ldml2JsonConverter.class.getName()); enum RunType { @@ -98,7 +101,8 @@ enum RunType { rbnf(false, true), annotations, annotationsDerived, - bcp47(false, false); + bcp47(false, false), + transforms(false, false); private final boolean isTiered; private final boolean hasLocales; @@ -739,6 +743,8 @@ private int convertCldrItems( outFilename = filenameAsLangTag + ".json"; } else if (type == RunType.bcp47) { outFilename = filename + ".json"; + } else if (type == RunType.transforms) { + outFilename = filename + ".json"; } else if (js.section.equals("other")) { // If you see other-___.json, it means items that were missing from // JSON_config_*.txt @@ -775,11 +781,11 @@ private int convertCldrItems( if (type == RunType.main) { avl.full.add(filenameAsLangTag); } - } else if (type == RunType.rbnf) { - js.packageName = "rbnf"; - tier = ""; - } else if (type == RunType.bcp47) { - js.packageName = "bcp47"; + } else if (type == RunType.rbnf + || type == RunType.bcp47 + || type == RunType.transforms) { + // untiered, just use the name + js.packageName = type.name(); tier = ""; } if (js.packageName != null) { @@ -884,6 +890,24 @@ private int convertCldrItems( } } + if (item.getUntransformedPath() + .startsWith("//supplementalData/transforms")) { + // here, write the raw data + final String rawTransformFile = filename + TRANSFORM_RAW_SUFFIX; + try (PrintWriter outf = + FileUtilities.openUTF8Writer(outputDir, rawTransformFile)) { + outf.println(item.getValue()); + // note: not logging the write here- it will be logged when the + // .json file is written. + } + final String path = item.getPath(); + item.setPath(fixTransformPath(path)); + final String fullPath = item.getFullPath(); + item.setFullPath(fixTransformPath(fullPath)); + // the value is now the raw filename + item.setValue(rawTransformFile); + } + // some items need to be split to multiple item before processing. None // of those items need to be sorted. // Applies to SPLITTABLE_ATTRS attributes. @@ -943,7 +967,31 @@ private int convertCldrItems( outputUnitPreferenceData(js, theItems, out, nodesForLastItem); } - // closeNodes(out, nodesForLastItem.size() - 2, 0); + // Special processing for transforms. + if (type == RunType.transforms) { + final JsonObject jo = out.getAsJsonObject("transforms"); + if (jo == null || jo.isEmpty()) { + throw new RuntimeException( + "Could not get transforms object in " + filename); + } + @SuppressWarnings("unchecked") + final Entry[] s = jo.entrySet().toArray(new Entry[0]); + if (s == null || s.length != 1) { + throw new RuntimeException( + "Could not get 1 subelement of transforms in " + filename); + } + // key doesn't matter. + // move subitem up + out = s[0].getValue().getAsJsonObject(); + final Entry[] s2 = + out.entrySet().toArray(new Entry[0]); + if (s2 == null || s2.length != 1) { + throw new RuntimeException( + "Could not get 1 sub-subelement of transforms in " + filename); + } + // move sub-subitem up. + out = s2[0].getValue().getAsJsonObject(); + } // write JSON try (PrintWriter outf = FileUtilities.openUTF8Writer(outputDir, outFilename)) { @@ -990,6 +1038,51 @@ private int convertCldrItems( return totalItemsInFile; } + /** + * Fixup an XPathParts with a specific transform element + * + * @param xpp the XPathParts to modify + * @param attribute the attribute name, such as "alias" + */ + private static final void fixTransformPath(final XPathParts xpp, final String attribute) { + final String v = xpp.getAttributeValue(-2, attribute); // on penultimate element + if (v == null) return; + final Set aliases = new HashSet<>(); + final Set bcpAliases = new HashSet<>(); + for (final String s : v.split(" ")) { + final String q = Locale.forLanguageTag(s).toLanguageTag(); + if (s.equals(q)) { + // bcp47 round trips- add to bcp list + bcpAliases.add(s); + } else { + // different - add to other aliases. + aliases.add(s); + } + } + if (aliases.isEmpty()) { + xpp.removeAttribute(-2, attribute); + } else { + xpp.setAttribute(-2, attribute, String.join(" ", aliases.toArray(new String[0]))); + } + if (bcpAliases.isEmpty()) { + xpp.removeAttribute(-2, attribute + "Bcp47"); + } else { + xpp.setAttribute( + -2, attribute + "Bcp47", String.join(" ", bcpAliases.toArray(new String[0]))); + } + } + + /** + * Fixup a transform path, expanding the alias and backwardAlias into bcp47 and non-bcp47 + * attributes. + */ + private static final String fixTransformPath(final String path) { + final XPathParts xpp = XPathParts.getFrozenInstance(path).cloneAsThawed(); + fixTransformPath(xpp, "alias"); + fixTransformPath(xpp, "backwardAlias"); + return xpp.toString(); + } + private static String valueSectionsFormat(int values, int sections) { return MessageFormat.format( "({0, plural, one {# value} other {# values}} in {1, plural, one {# section} other {# sections}})", @@ -1453,6 +1546,24 @@ public void writeDefaultContent(String outputDir) throws IOException { outf.close(); } + public void writeTransformMetadata(String outputDir) throws IOException { + final String dirName = outputDir + "/cldr-" + RunType.transforms.name(); + final String fileName = RunType.transforms.name() + ".json"; + PrintWriter outf = FileUtilities.openUTF8Writer(dirName, fileName); + System.out.println( + PACKAGE_ICON + + " Creating packaging file => " + + dirName + + File.separator + + fileName); + JsonObject obj = new JsonObject(); + obj.add( + RunType.transforms.name(), + gson.toJsonTree(CLDRTransforms.getInstance().getJsonIndex())); + outf.println(gson.toJson(obj)); + outf.close(); + } + public void writeCoverageLevels(String outputDir) throws IOException { try (PrintWriter outf = FileUtilities.openUTF8Writer(outputDir + "/cldr-core", "coverageLevels.json"); ) { @@ -2225,6 +2336,8 @@ public void processDirectory(String dirName, DraftStatus minimalDraftStatus) if (Boolean.parseBoolean(options.get("packagelist").getValue())) { writePackageList(outputDir); } + } else if (type == RunType.transforms) { + writeTransformMetadata(outputDir); } } } diff --git a/tools/cldr-code/src/main/java/org/unicode/cldr/json/LdmlConvertRules.java b/tools/cldr-code/src/main/java/org/unicode/cldr/json/LdmlConvertRules.java index 7e890aa5052..d15e233e861 100644 --- a/tools/cldr-code/src/main/java/org/unicode/cldr/json/LdmlConvertRules.java +++ b/tools/cldr-code/src/main/java/org/unicode/cldr/json/LdmlConvertRules.java @@ -154,7 +154,14 @@ class LdmlConvertRules { "identity:variant:type", // in common/bcp47/*.xml - "keyword:key:name"); + "keyword:key:name", + + // transforms + + // transforms + "transforms:transform:source", + "transforms:transform:target", + "transforms:transform:direction"); /** * The set of element:attribute pair in which the attribute should be treated as value. All the diff --git a/tools/cldr-code/src/main/java/org/unicode/cldr/util/CLDRTransforms.java b/tools/cldr-code/src/main/java/org/unicode/cldr/util/CLDRTransforms.java index 2bcee0f7dd9..7f41cf3e577 100644 --- a/tools/cldr-code/src/main/java/org/unicode/cldr/util/CLDRTransforms.java +++ b/tools/cldr-code/src/main/java/org/unicode/cldr/util/CLDRTransforms.java @@ -1128,4 +1128,20 @@ static String parseDoubleColon(String x, Set others) { } return ""; } + + public class CLDRTransformsJsonIndex { + /** raw list of available IDs */ + public String[] available = + getAvailableIds().stream() + .map((String id) -> id.replace(".xml", "")) + .sorted() + .collect(Collectors.toList()) + .toArray(new String[0]); + } + + /** This gets the metadata (index file) exposed as cldr-json/cldr-transforms/transforms.json */ + public CLDRTransformsJsonIndex getJsonIndex() { + final CLDRTransformsJsonIndex index = new CLDRTransformsJsonIndex(); + return index; + } } diff --git a/tools/cldr-code/src/main/resources/org/unicode/cldr/json/JSON_config_transforms.txt b/tools/cldr-code/src/main/resources/org/unicode/cldr/json/JSON_config_transforms.txt new file mode 100644 index 00000000000..9734f36fe6a --- /dev/null +++ b/tools/cldr-code/src/main/resources/org/unicode/cldr/json/JSON_config_transforms.txt @@ -0,0 +1,2 @@ +section=transforms ; path=//cldr/supplemental/transforms/.* ; package=transforms ; packageDesc=Transform data +dependency=core ; package=transforms diff --git a/tools/cldr-code/src/main/resources/org/unicode/cldr/json/pathTransforms.txt b/tools/cldr-code/src/main/resources/org/unicode/cldr/json/pathTransforms.txt index 8457c44e3ac..6f97b92ce3f 100644 --- a/tools/cldr-code/src/main/resources/org/unicode/cldr/json/pathTransforms.txt +++ b/tools/cldr-code/src/main/resources/org/unicode/cldr/json/pathTransforms.txt @@ -130,10 +130,6 @@ < (.*(GMT|UTC).*/exemplarCity)(.*) > -# -< (.*/transforms/transform[^/]*)/(.*) -> $1/tRules/$2 - # < (.*)\[@territories="([^"]*)"\](.*)\[@alt="variant"\](.*) > $1\[@territories="$2-alt-variant"\] @@ -173,3 +169,7 @@ # ParentLocales < (.*/parentLocales)\[@component="([^"]*)"\]/(parentLocale)(.*)$ > $1/$2$4 + +# Transform - drop terminal tRule element +< //supplementalData/transforms/transform(.*)/tRule.*$ +> //supplementalData/transforms/transform$1/_rulesFile