-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLDR-18002 Update population and likely subtags for MU, TK, ZM and SL #4104
base: main
Are you sure you want to change the base?
Conversation
1ee81e8
to
89ade9a
Compare
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
89ade9a
to
91049e8
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
I'm generally ok with this, but for the following:
I think that would clearly be the case for voice UIs. Not so clear for text UIs. We should discuss more about how to cleanly segment those. For example, we might have a separate set of likely subtags / locale matching for voice than for text. |
There are 4 manual overrides in GenerateLikelySubtags.java that conflict with other data: for MU, SL, TK, and ZM. For each country, the local language (mfe, kri, tkl, and bem) is spoken by far more than English, even if English is the main language of instruction. Education and literacy in each country is low enough that the local languages should be considered the dominant ones. I was able to find censuses listing language characteristics for MU, TK and ZM. SL I wasn't able to find data, but I removed the override. To regenerate data use this command ` mvn package -DskipTests=true && java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData && java -jar tools/cldr-code/target/cldr-code.jar GenerateLikelySubtags && java -jar tools/cldr-code/target/cldr-code.jar GenerateTestData` CLDR-18002 Actually make local languages the default matches The prior change didn't exactly work because und_MU was defaulting to en_Latn_MU -- this fixes it to go to mfe -- also for the other languages. The problem is that English is official in these countries so there's a mis-match CLDR-18002 Style fix `mvn --file=tools/pom.xml spotless:apply` CLDR-18002 Default to English since its official
91049e8
to
32878b6
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
@macchiati I think that's a great idea to differentiate likely subtags for voice & text content. We can also perhaps make a policy using with macrolanguages, eg. Arabic, Chinese, Fulah dialects. For the most part, most text would be best classified as just zh/ar/ff. However spoken content will have significant differences for both constituent dialects (yue, cmn, apc, ary, fuv, ...). Do we have any initiatives getting CLDR/Unicode to work better for spoken content? |
Inflections and RBNF play a role, but no organized initiatives yet. We have made room for separate tagging, eg
(intending to allow for a a 'voice' in the future.) |
There are 4 manual overrides in GenerateLikelySubtags.java that conflict with other data: for MU, SL, TK, and ZM.
For each country, the local language (mfe, kri, tkl, and bem) is spoken by far more than English, even if English is the main language of instruction. Education and literacy in each country is low enough that the local languages should be considered the dominant ones.
I was able to find censuses listing language characteristics for MU, TK and ZM. SL I wasn't able to find data, but I removed the override.
To regenerate data use this command
mvn package -DskipTests=true && java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData && java -jar tools/cldr-code/target/cldr-code.jar GenerateLikelySubtags && java -jar tools/cldr-code/target/cldr-code.jar GenerateTestData
CLDR-18002
ALLOW_MANY_COMMITS=true