Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLDR-18002 Update population and likely subtags for MU, TK, ZM and SL #4104

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

conradarcturus
Copy link
Contributor

There are 4 manual overrides in GenerateLikelySubtags.java that conflict with other data: for MU, SL, TK, and ZM.

For each country, the local language (mfe, kri, tkl, and bem) is spoken by far more than English, even if English is the main language of instruction. Education and literacy in each country is low enough that the local languages should be considered the dominant ones.

I was able to find censuses listing language characteristics for MU, TK and ZM. SL I wasn't able to find data, but I removed the override.

To regenerate data use this command mvn package -DskipTests=true && java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData && java -jar tools/cldr-code/target/cldr-code.jar GenerateLikelySubtags && java -jar tools/cldr-code/target/cldr-code.jar GenerateTestData

CLDR-18002

  • This PR completes the ticket.

ALLOW_MANY_COMMITS=true

@srl295 srl295 deleted the branch main October 25, 2024 16:35
@srl295 srl295 closed this Oct 25, 2024
@srl295 srl295 reopened this Oct 25, 2024
@srl295 srl295 added the ddl DDL-SC specific work label Oct 25, 2024
@srl295 srl295 changed the base branch from _ddl/v47 to main October 25, 2024 17:40
@srl295 srl295 force-pushed the CLDR-18002-Update-pop-MU-TK-ZM branch from 1ee81e8 to 89ade9a Compare October 25, 2024 17:40
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • common/supplemental/supplementalData.xml is different
  • common/testData/localeIdentifiers/likelySubtags.txt is different
  • common/testData/localeIdentifiers/localeDisplayName.txt is now changed in the branch
  • tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateLikelySubtags.java is different
  • tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/country_language_population.tsv is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@macchiati
Copy link
Member

I'm generally ok with this, but for the following:

Education and literacy in each country is low enough that the local languages should be considered the dominant ones.

I think that would clearly be the case for voice UIs. Not so clear for text UIs. We should discuss more about how to cleanly segment those. For example, we might have a separate set of likely subtags / locale matching for voice than for text.

There are 4 manual overrides in GenerateLikelySubtags.java that conflict with other data: for MU, SL, TK, and ZM.

For each country, the local language (mfe, kri, tkl, and bem) is spoken by far more than English, even if English is the main language of instruction. Education and literacy in each country is low enough that the local languages should be considered the dominant ones.

I was able to find censuses listing language characteristics for MU, TK and ZM. SL I wasn't able to find data, but I removed the override.

To regenerate data use this command ` mvn package -DskipTests=true &&  java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData &&  java -jar tools/cldr-code/target/cldr-code.jar GenerateLikelySubtags &&  java -jar tools/cldr-code/target/cldr-code.jar GenerateTestData`

CLDR-18002 Actually make local languages the default matches

The prior change didn't exactly work because und_MU was defaulting to en_Latn_MU -- this fixes it to go to mfe -- also for the other languages.

The problem is that English is official in these countries so there's a mis-match

CLDR-18002 Style fix

`mvn --file=tools/pom.xml spotless:apply`

CLDR-18002 Default to English since its official
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • common/supplemental/likelySubtags.xml is different
  • common/supplemental/supplementalData.xml is different
  • common/testData/localeIdentifiers/likelySubtags.txt is different
  • common/testData/localeIdentifiers/localeDisplayName.txt is no longer changed in the branch
  • tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/country_language_population.tsv is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@conradarcturus
Copy link
Contributor Author

conradarcturus commented Oct 30, 2024

we might have a separate set of likely subtags / locale matching for voice than for text.

@macchiati I think that's a great idea to differentiate likely subtags for voice & text content. We can also perhaps make a policy using with macrolanguages, eg. Arabic, Chinese, Fulah dialects. For the most part, most text would be best classified as just zh/ar/ff. However spoken content will have significant differences for both constituent dialects (yue, cmn, apc, ary, fuv, ...). Do we have any initiatives getting CLDR/Unicode to work better for spoken content?

@macchiati
Copy link
Member

Inflections and RBNF play a role, but no organized initiatives yet. We have made room for separate tagging, eg

(intending to allow for a a 'voice' in the future.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ddl DDL-SC specific work
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants