CLDR-16720 json: add transforms #4036

srl295 · 2024-09-10T21:36:07Z

new package cldr-transforms
add manifest file transforms.json at the top level
each transform has a metadata file (transforms/ID.json) and a raw text file (transforms/ID.txt).
metadata has all of the keys from the transform rule
the _rulesFile key formally indicates the textfile's name (in case we need to massage the id for some reason in the future).

Sample data available at this branch: https://github.com/unicode-org/cldr-json/tree/cldr-16720/transforms/cldr-json/cldr-transforms

CLDR-16720

This PR completes the ticket.

ALLOW_MANY_COMMITS=true

- new package cldr-transforms - add manifest file transforms.json at the top level - each transform has a metadata file (transforms/ID.json) and a raw text file (transforms/ID.txt). - metadata has all of the keys from the transform rule - the _rulesFile key formally indicates the textfile's name (in case we need to massage the id for some reason in the future).

jira-pull-request-webhook · 2024-09-10T21:37:56Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

jira-pull-request-webhook · 2024-09-10T21:43:24Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

srl295 · 2024-09-10T22:18:31Z

Deploy will fail because I used my personal fork, so no preview URL.

Please review the Markdown change carefully.

tools/cldr-code/src/main/java/org/unicode/cldr/json/Ldml2JsonConverter.java

sffc · 2024-09-11T04:17:45Z

The mix of JSON and TXT files in the same directory might make it hard to parse. The file extension is the only way to tell the difference, which usually isn't ideal. Can you split them into separate directories?

Also, since the JSON files are basically metadata on the TXT files, I have a mild expectation that there would be a single index.json with all the transform metadata in one place.

robertbastian · 2024-09-11T08:17:07Z

Also, since the JSON files are basically metadata on the TXT files, I have a mild expectation that there would be a single index.json with all the transform metadata in one place.

This is per-locale data. Nowhere in CLDR-JSON multiple locales are merged in an index file, they always use the file system structure.

srl295 · 2024-09-11T17:03:41Z

The mix of JSON and TXT files in the same directory might make it hard to parse. The file extension is the only way to tell the difference, which usually isn't ideal. Can you split them into separate directories?

It's designed so clients don't need to crawl the directory or parse any filenames.

Look at the transforms.json file. It has a list of ids, with no extension

{
  "transforms": {
    "available": [
      "InterIndic-Bengali",
      "Oriya-Arabic",
      "my-t-my-d0-zawgyi",
      "tlh-am",

For each id, there is transforms/id.json with the metadata.

{
  "transforms": {
    "BGN": {
      "_visibility": "external",
      "_alias": "Amharic-Latin/BGN am-Latn-t-am-m0-bgn",
      "_source": "am",
      "_target": "am_Latn",
      "_direction": "forward",
      "_rulesFile": "Amharic-Latin-BGN.txt"
    }
  }
}

One of the metadata items is _rulesFile which has the path to the .txt file.

# Originally prepared by Michael Everson <[email protected]>
########################################################################
# MINIMAL FILTER: Amharic-Latin
:: [ሀ-᎙] ;
:: NFD (NFC) ;
$ejective = ’;
$glottal  = ’;
…

Also, since the JSON files are basically metadata on the TXT files, I have a mild expectation that there would be a single index.json with all the transform metadata in one place.

There is, it's transforms.json.

srl295 · 2024-09-11T17:10:31Z

I see a couple of bugs:

_source and _target need to be bcp47, not old IDs.
There's a bug in _alias that has some corruption.
the 2nd level key in the metadata .json has a problem (because that ID might have slashes in it)

sffc · 2024-09-11T17:33:17Z

Also, since the JSON files are basically metadata on the TXT files, I have a mild expectation that there would be a single index.json with all the transform metadata in one place.

This is per-locale data. Nowhere in CLDR-JSON multiple locales are merged in an index file, they always use the file system structure.

https://github.com/unicode-org/cldr-json/blob/main/cldr-json/cldr-core/supplemental/plurals.json

Look at the transforms.json file. It has a list of ids, with no extension

OK, that looks cool, thanks! I didn't see it the first time.

I still think I mildly favor not putting the JSON and TXT in the same directory, but I'll leave that to @robertbastian to weigh in on.

srl295 · 2024-09-11T17:34:54Z

I don't see why being in the same directory would be a problem.

As Mark suggested, we need docs on the formats. That will be a separate effort.

macchiati

I had approved this, but Steven mentioned 3 outstanding bugs:

_source and _target need to be bcp47, not old IDs.
There's a bug in _alias that has some corruption.
the 2nd level key in the metadata .json has a problem (because that ID might have slashes in it)

macchiati · 2024-09-17T17:04:34Z

Shane, are there any other blockers?

- properly use BCP47 for source/target - fix corruption in alias and slashes in output

- back out bcp47 - broke some source/target ids

srl295 · 2024-09-17T23:01:44Z

Please review sample data in https://github.com/unicode-org/cldr-json/tree/cldr-16720/transforms/cldr-json/cldr-transforms

I now think source and target should not be bcp47 as they aren't always locale IDs. The alias field contains a bcp47 alias.

macchiati · 2024-09-18T04:33:10Z

Right, as long as the alias data is preserved. Eg, <transform source="Any" target="Accents" direction="both" alias=" *und-t-d0-accents*" backwardAlias="*und-t-s0-accents*"> However it would probably be better to pull out the bcp47 items into separate fields. Could be done later, though.

…

On Tue, Sep 17, 2024, 15:45 Steven R. Loomis ***@***.***> wrote: ***@***.**** commented on this pull request. 'source' and 'target' aren't strictly locale IDs, so we cant just convert them to bcp47. The aliases field includes bcp47 aliases, so perhaps nothing more needs to be done. — Reply to this email directly, view it on GitHub <#4036 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMHYG7EPDH45POUJAXTZXCWHJAVCNFSM6AAAAABN7UFAY6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDGMJRGIYTKMJRGU> . You are receiving this because your review was requested.Message ID: ***@***.***>

sffc · 2024-09-18T06:15:24Z

What is the name of the second level key? In Amharic-Latin-BGN.json it is BGN

{
  "transforms": {
    "BGN": {
      "_value": "Amharic-Latin-BGN.txt",
      "_visibility": "external",
      "_alias": "Amharic-Latin/BGN am-Latn-t-am-m0-bgn",
      "_source": "am",
      "_target": "am_Latn",
      "_direction": "forward"
    }
  }
}

But in most other files it is just "transform".

I agree that it would be nice to name the files consistently with either the alias name or the BCP-47 name but not a mix as is currently in the branch.

robertbastian · 2024-09-18T08:14:17Z

I'm not a fan of the nesting in the JSON files.

{
  "transforms": {
    "BGN": {
      "_value": "Arabic-Latin-BGN.txt",
      "_visibility": "external",
      "_alias": "Arabic-Latin/BGN ar-Latn-t-ar-m0-bgn",
      "_source": "ar",
      "_target": "ar_Latn",
      "_direction": "forward"
    }
  }
}

could be represented as

{
  "_value": "Arabic-Latin-BGN.txt",
  "_visibility": "external",
  "_alias": "Arabic-Latin/BGN ar-Latn-t-ar-m0-bgn",
  "_source": "ar",
  "_target": "ar_Latn",
  "_variant": "BGN",
  "_direction": "forward"
}

As far as I can tell no JSON file will have multiple values in the transforms map.

macchiati · 2024-09-18T14:09:30Z

Sounds reasonable. The _value should be more meaningful, like: rules. It doesn't need a _ since it doesn't correspond to an attribute in XML.

…

On Wed, Sep 18, 2024, 01:14 Robert Bastian ***@***.***> wrote: I'm not a fan of the nesting in the JSON files. { "transforms": { "BGN": { "_value": "Arabic-Latin-BGN.txt", "_visibility": "external", "_alias": "Arabic-Latin/BGN ar-Latn-t-ar-m0-bgn", "_source": "ar", "_target": "ar_Latn", "_direction": "forward" } } } could be represented as { "_value": "Arabic-Latin-BGN.txt", "_visibility": "external", "_alias": "Arabic-Latin/BGN ar-Latn-t-ar-m0-bgn", "_source": "ar", "_target": "ar_Latn", "_variant": "BGN", "_direction": "forward" } As far as I can tell no JSON file will have multiple values in the transforms map. — Reply to this email directly, view it on GitHub <#4036 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMBDM2WBJXAP25L2RHTZXEY67AVCNFSM6AAAAABN7UFAY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJXHAYDCOBXGQ> . You are receiving this because your review was requested.Message ID: ***@***.***>

srl295 · 2024-09-19T05:04:27Z

What is the name of the second level key? In Amharic-Latin-BGN.json it is BGN

That's a bug. will fix

{
  "transforms": {
    "BGN": {
      "_value": "Amharic-Latin-BGN.txt",
      "_visibility": "external",
      "_alias": "Amharic-Latin/BGN am-Latn-t-am-m0-bgn",
      "_source": "am",
      "_target": "am_Latn",
      "_direction": "forward"
    }
  }
}
But in most other files it is just "transform".

I agree that it would be nice to name the files consistently with either the alias name or the BCP-47 name but not a mix as is currently in the branch.

The files are by the id name, which is neither the bcp47 nor the alias name. As explained.

srl295 · 2024-09-19T05:05:17Z

Agh. another bug. _value is supposed to be _ruleFile.

srl295 · 2024-09-19T05:08:14Z

I'll change it to:

{
  "transform":  { 
       "_source": "am",
       ...
  }
}

all of the json files have duck typed content similarly.

- hoist json content up 2 levels - fix 'BGN' in path

srl295 · 2024-09-19T20:54:06Z

please recheck sample data in https://github.com/unicode-org/cldr-json/tree/cldr-16720/transforms/cldr-json/cldr-transforms

addressed issues

sffc · 2024-09-19T21:07:55Z

I still don't understand why half of these are identified by BCP-47 and half are identified by their alias.

Armenian-Latin-BGN.json
{
  "_visibility": "external",
  "_alias": "Armenian-Latin/BGN hy-Latn-t-hy-m0-bgn",
  "_source": "hy",
  "_target": "hy_Latn",
  "_direction": "forward",
  "_rulesFile": "Armenian-Latin-BGN.txt"
}

am-Ethi-t-am-brai.json
{
  "_backwardAlias": "Braille-Ethiopic/Amharic am-Ethi-t-am-brai",
  "_visibility": "external",
  "_alias": "Ethiopic-Braille/Amharic am-Brai-t-am-ethi",
  "_source": "am_Ethi",
  "_target": "am_Brai",
  "_direction": "both",
  "_rulesFile": "am-Ethi-t-am-brai.txt"
}

srl295 · 2024-09-19T22:29:23Z

I still don't understand why half of these are identified by BCP-47 and half are identified by their alias.

Armenian-Latin-BGN.json
{
  "_visibility": "external",
  "_alias": "Armenian-Latin/BGN hy-Latn-t-hy-m0-bgn",
  "_source": "hy",
  "_target": "hy_Latn",
  "_direction": "forward",
  "_rulesFile": "Armenian-Latin-BGN.txt"
}

am-Ethi-t-am-brai.json
{
  "_backwardAlias": "Braille-Ethiopic/Amharic am-Ethi-t-am-brai",
  "_visibility": "external",
  "_alias": "Ethiopic-Braille/Amharic am-Brai-t-am-ethi",
  "_source": "am_Ethi",
  "_target": "am_Brai",
  "_direction": "both",
  "_rulesFile": "am-Ethi-t-am-brai.txt"
}

that's how they are identified in the source data.

macchiati · 2024-09-19T23:25:16Z

In CLDR, the BCP47 version of the ID is in the _alias list (resp _backwardAlias)
Those alias lists also contain the non-BCP47 alias.

In JSON, we could filter these to break:

  "_backwardAlias": "Braille-Ethiopic/Amharic am-Ethi-t-am-brai",

  "_alias": "Ethiopic-Braille/Amharic am-Brai-t-am-ethi",

into

  "_backwardAlias": "Braille-Ethiopic/Amharic",
  "_backwardAliasBcp47": "am-Ethi-t-am-brai",

  "_alias": "Ethiopic-Braille/Amharic",
  "_aliasBcp47": "am-Brai-t-am-ethi",

Even better would be for us to do this in the XML source, but that's not something we could do in v46

srl295 · 2024-09-19T23:27:36Z

@macchiati filter how? How do I know which alias is which?

Maybe we should go with this format, and we can add bcp47?

macchiati · 2024-09-19T23:47:45Z

I think it is sufficient to parse with a strict BCP47 parser. If it succeeds without error, it is BCP47, otherwise legacy.

…

On Thu, Sep 19, 2024 at 4:27 PM Steven R. Loomis ***@***.***> wrote: @macchiati <https://github.com/macchiati> filter how? How do I know which alias is which? — Reply to this email directly, view it on GitHub <#4036 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMDYW47MJQVYQS4CSVLZXNMX7AVCNFSM6AAAAABN7UFAY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRSGM3DSNJUGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

srl295 · 2024-09-19T23:55:22Z

I think it is sufficient to parse with a strict BCP47 parser. If it succeeds without error, it is BCP47, otherwise legacy.

seems a bit imprecise/mixed but OK

macchiati · 2024-09-20T17:24:24Z

It will work, but as I wrote earlier we should add better structure

srl295 · 2024-09-20T19:58:54Z

Can you recommend a parser and options ?

- split bcp47 and non-bcp47 aliases.

srl295 · 2024-09-21T04:51:04Z

OK, try the latest, same review link.

attributes are only absent if non-empty.

{
  "_backwardAlias": "Latin-Ethiopic/Tekie_Alibekit",
  "_visibility": "external",
  "_backwardAliasBcp47": "byn-Ethi-t-byn-latn-m0-tekieali",
  "_alias": "Ethiopic-Latin/Tekie_Alibekit",
  "_aliasBcp47": "byn-Latn-t-byn-ethi-m0-tekieali",
  "_source": "byn_Ethi",
  "_direction": "both",
  "_target": "byn_Latn",
  "_rulesFile": "byn-Ethi-t-byn-latn-m0-tekie-alibekit.txt"
}

macchiati

Respot-checked data, looks good to me.

srl295 requested review from sffc, macchiati and btangmu September 10, 2024 21:36

srl295 self-assigned this Sep 10, 2024

srl295 changed the title ~~CLDR-17620 json: add transforms~~ CLDR-16720 json: add transforms Sep 10, 2024

srl295 force-pushed the cldr-16720/json-xlit branch from 7bdacce to 6d40472 Compare September 10, 2024 21:37

CLDR-16720 json: Update the release note

4750f88

srl295 force-pushed the cldr-16720/json-xlit branch from f19ce43 to 4750f88 Compare September 10, 2024 21:43

CLDR-16720 merge from main

3c104e4

srl295 had a problem deploying to cloudflare September 10, 2024 22:03 — with GitHub Actions Failure

github-advanced-security bot found potential problems Sep 10, 2024

View reviewed changes

tools/cldr-code/src/main/java/org/unicode/cldr/json/Ldml2JsonConverter.java Fixed Show fixed Hide fixed

macchiati approved these changes Sep 11, 2024

View reviewed changes

macchiati approved these changes Sep 14, 2024

View reviewed changes

macchiati previously requested changes Sep 17, 2024

View reviewed changes

CLDR-16720 json transliterator update

6ea2b87

- properly use BCP47 for source/target - fix corruption in alias and slashes in output

srl295 had a problem deploying to cloudflare September 17, 2024 22:34 — with GitHub Actions Failure

CLDR-16720 spotless

f93472d

srl295 had a problem deploying to cloudflare September 17, 2024 22:35 — with GitHub Actions Failure

CLDR-16720 json transliterator update

5fc8be5

- back out bcp47 - broke some source/target ids

srl295 requested a review from macchiati September 17, 2024 23:00

srl295 marked this pull request as draft September 19, 2024 05:10

CLDR-16720 json transliterator- improve format

054ab28

- hoist json content up 2 levels - fix 'BGN' in path

srl295 had a problem deploying to cloudflare September 19, 2024 20:53 — with GitHub Actions Failure

srl295 marked this pull request as ready for review September 19, 2024 20:53

CLDR-16720 json transliterator- split out bcp47 aliases

fcbacb9

- split bcp47 and non-bcp47 aliases.

srl295 had a problem deploying to cloudflare September 21, 2024 04:51 — with GitHub Actions Failure

macchiati approved these changes Sep 21, 2024

View reviewed changes

srl295 merged commit 8a22f67 into unicode-org:main Sep 21, 2024
12 of 13 checks passed

srl295 deleted the cldr-16720/json-xlit branch September 21, 2024 21:05

conradarcturus pushed a commit that referenced this pull request Sep 25, 2024

CLDR-16720 json: add transforms (#4036)

12d4847

CLDR-16720 json: add transforms #4036

CLDR-16720 json: add transforms #4036

Uh oh!

Conversation

srl295 commented Sep 10, 2024

Uh oh!

jira-pull-request-webhook bot commented Sep 10, 2024

Uh oh!

jira-pull-request-webhook bot commented Sep 10, 2024

Uh oh!

srl295 commented Sep 10, 2024

Uh oh!

Uh oh!

sffc commented Sep 11, 2024

Uh oh!

robertbastian commented Sep 11, 2024

Uh oh!

srl295 commented Sep 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srl295 commented Sep 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sffc commented Sep 11, 2024

Uh oh!

srl295 commented Sep 11, 2024

Uh oh!

macchiati left a comment

Choose a reason for hiding this comment

Uh oh!

macchiati commented Sep 17, 2024

Uh oh!

srl295 commented Sep 17, 2024

Uh oh!

macchiati commented Sep 18, 2024 via email

Uh oh!

sffc commented Sep 18, 2024

Uh oh!

robertbastian commented Sep 18, 2024

Uh oh!

macchiati commented Sep 18, 2024 via email

Uh oh!

srl295 commented Sep 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srl295 commented Sep 19, 2024

Uh oh!

srl295 commented Sep 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srl295 commented Sep 19, 2024

Uh oh!

sffc commented Sep 19, 2024

Uh oh!

srl295 commented Sep 19, 2024

Uh oh!

macchiati commented Sep 19, 2024

Uh oh!

srl295 commented Sep 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

macchiati commented Sep 19, 2024 via email

Uh oh!

srl295 commented Sep 19, 2024

Uh oh!

macchiati commented Sep 20, 2024

Uh oh!

srl295 commented Sep 20, 2024

Uh oh!

srl295 commented Sep 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

macchiati left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

srl295 commented Sep 11, 2024 •

edited

Loading

srl295 commented Sep 11, 2024 •

edited

Loading

srl295 commented Sep 19, 2024 •

edited

Loading

srl295 commented Sep 19, 2024 •

edited

Loading

srl295 commented Sep 19, 2024 •

edited

Loading

srl295 commented Sep 21, 2024 •

edited

Loading