Unassigned/non-standard (compound) language and dialect codes #432

agutkin · 2021-06-24T11:03:56Z

Wiktionary has entries for several languages and dialects with unofficial codes we can't scrape. Some examples of these include

Central Franconian: gmw-cfr
Old Galician/Portuguese: roa-opt
Westrobothnian: gmq-bot

possibly among others. The first part of the code denotes a valid ISO 639-3 language group, while the second part looks like a temporary assignment.

This issue is not a bug. It is simply intended for the book-keeping purposes. I suppose this is not related to #329.

The text was updated successfully, but these errors were encountered:

lfashby · 2021-06-24T23:06:06Z

codes.py produces a JSON of these when it's run. (I haven't looked at that functionality in a while but I assume it still appropriately collects all the codes we can't match.)

We've never done anything with these unmatched languages but the ones you've linked to definitely have enough entries to warrant supporting them.

agutkin · 2021-06-24T23:23:23Z

Aye, precisely. I should've opened this a long time ago - I noticed Westrobothnian has had a decent-sized lexicon for a long while.

agutkin · 2021-06-25T19:42:44Z

Looking at unmatched_languages.json it turns out that the Wiktionary language codes are rather systematically constructed.

The ones which are probably most problematic (in terms of work involved to support them) are the *-proto languages, but the remaining few five or six are probably reasonably easy to support. I guess what we have here is an edge case where the the wiktionary code maps to a non-existent compound ISO where the first part has to be a valid ISO language group name and should be verifiable, while the second can come from the configuration file.

kylebgorman · 2021-06-28T14:03:37Z

@agutkin can you clarify: what's the action item you imagine here? We add config information for these language so that you can do wikipron gmw-cfr etc.?

agutkin · 2021-06-28T14:52:12Z

Yes, precisely.

aryamanarora · 2021-07-28T13:38:02Z

BTW all of Wiktionary's invented non-ISO-compliant codes are listed here: https://en.wiktionary.org/wiki/Module:languages/datax

sonofthomp · 2023-08-13T21:49:11Z

codes.py produces a JSON of these when it's run.

I tested codes.py and it actually isn't outputting anything for the unmatched_languages.json file – the file just reads {} after the program is ran.

I think this is the result of a commit I made a few weeks ago. codes.py used to error when it encountered something like gmw-cfr. This was changed so that it throws a warning instead. However, I think this change unintentionally made it so that if a language isn't matched, it doesn't get added to unmatched_languages but instead just continues past it.

If we change the try statement starting at line 177 to:

try:
    iso639_lang = iso639.Language.match(wiktionary_code)
except iso639.language.LanguageNotFoundError:
    unmatched_languages[wiktionary_code] = {
        "wiktionary_name": wiktionary_name
    }
    logging.warning(
        "Could not find language with code %s", wiktionary_code
    )
    continue

then that should be back to normal.

I guess what we have here is an edge case where the the wiktionary code maps to a non-existent compound ISO where the first part has to be a valid ISO language group name and should be verifiable, while the second can come from the configuration file.

I'm confused about the format this file would have. Would it look something like this?

{
    "poz": {
        "mly-pro": "Proto-Malayic",
        "pro": "Proto-Malayo-Polynesian"
    },
    "gmq": {
        "scy": "Scanian",
        "bot": "Westrobothnian"
    },
    "gmw": {
        "cfr": "Central Franconian"
    }
}

That way, given a language code like gmw-cfr, Wikipron could identify the gmw as valid (using the iso639 module), and then use the config file to verify that cfr is a valid suffix and get the name of the name Wikipron uses for gmw-cfr. Sorry if I'm misunderstanding.

kylebgorman · 2023-08-13T21:58:53Z

I tested codes.py and it actually isn't outputting anything for the unmatched_languages.json file – the file just reads {} after the program is ran.

I think this is the result of a commit I made a few weeks ago. codes.py used to error when it encountered something like gmw-cfr. This was changed so that it throws a warning instead. However, I think this change unintentionally made it so that if a language isn't matched, it doesn't get added to unmatched_languages but instead just continues past it.

If we change the try statement starting at line 177 to:
try:
    iso639_lang = iso639.Language.match(wiktionary_code)
except iso639.language.LanguageNotFoundError:
    unmatched_languages[wiktionary_code] = {
        "wiktionary_name": wiktionary_name
    }
    logging.warning(
        "Could not find language with code %s", wiktionary_code
    )
    continue
then that should be back to normal.

I see what you did there. Go for it.

I'm confused about the format this file would have. Would it look something like this?

I'll defer to @agutkin; Sasha, what'd you have in mind? What @sonofthomp proposes makes sense to me though.

agutkin · 2023-08-14T08:10:03Z

Yes, Kyle, @sonofthomp suggestion looks good to me as well.

kylebgorman added the language support Language-specific issues label Jun 24, 2021

kylebgorman added the enhancement New feature or request label Jun 28, 2021

sonofthomp mentioned this issue Aug 13, 2023

Added casefold attribute in languages.json #503

Merged

1 task

kylebgorman closed this as completed in #503 Aug 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unassigned/non-standard (compound) language and dialect codes #432

Unassigned/non-standard (compound) language and dialect codes #432

agutkin commented Jun 24, 2021 •

edited

Loading

lfashby commented Jun 24, 2021

agutkin commented Jun 24, 2021

agutkin commented Jun 25, 2021 •

edited

Loading

kylebgorman commented Jun 28, 2021

agutkin commented Jun 28, 2021

aryamanarora commented Jul 28, 2021

sonofthomp commented Aug 13, 2023

kylebgorman commented Aug 13, 2023

agutkin commented Aug 14, 2023

Unassigned/non-standard (compound) language and dialect codes #432

Unassigned/non-standard (compound) language and dialect codes #432

Comments

agutkin commented Jun 24, 2021 • edited Loading

lfashby commented Jun 24, 2021

agutkin commented Jun 24, 2021

agutkin commented Jun 25, 2021 • edited Loading

kylebgorman commented Jun 28, 2021

agutkin commented Jun 28, 2021

aryamanarora commented Jul 28, 2021

sonofthomp commented Aug 13, 2023

kylebgorman commented Aug 13, 2023

agutkin commented Aug 14, 2023

agutkin commented Jun 24, 2021 •

edited

Loading

agutkin commented Jun 25, 2021 •

edited

Loading