-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unassigned/non-standard (compound) language and dialect codes #432
Comments
We've never done anything with these unmatched languages but the ones you've linked to definitely have enough entries to warrant supporting them. |
Aye, precisely. I should've opened this a long time ago - I noticed Westrobothnian has had a decent-sized lexicon for a long while. |
Looking at unmatched_languages.json it turns out that the Wiktionary language codes are rather systematically constructed. The ones which are probably most problematic (in terms of work involved to support them) are the |
@agutkin can you clarify: what's the action item you imagine here? We add config information for these language so that you can do |
Yes, precisely. |
BTW all of Wiktionary's invented non-ISO-compliant codes are listed here: https://en.wiktionary.org/wiki/Module:languages/datax |
I tested I think this is the result of a commit I made a few weeks ago. If we change the try:
iso639_lang = iso639.Language.match(wiktionary_code)
except iso639.language.LanguageNotFoundError:
unmatched_languages[wiktionary_code] = {
"wiktionary_name": wiktionary_name
}
logging.warning(
"Could not find language with code %s", wiktionary_code
)
continue then that should be back to normal.
I'm confused about the format this file would have. Would it look something like this?
That way, given a language code like |
I see what you did there. Go for it.
I'll defer to @agutkin; Sasha, what'd you have in mind? What @sonofthomp proposes makes sense to me though. |
Yes, Kyle, @sonofthomp suggestion looks good to me as well. |
Wiktionary has entries for several languages and dialects with unofficial codes we can't scrape. Some examples of these include
gmw-cfr
roa-opt
gmq-bot
possibly among others. The first part of the code denotes a valid ISO 639-3 language group, while the second part looks like a temporary assignment.
This issue is not a bug. It is simply intended for the book-keeping purposes. I suppose this is not related to #329.
The text was updated successfully, but these errors were encountered: