multiple languages #11

aleksejrs · 2023-10-12T03:19:11Z

aleksejrs
Oct 12, 2023

Store morphs in separate containers based on language
Implement a version of frequency.txt that supports multiple languages

What is that about? How is it going to determine the language?

In my morphemizer, the morphs are both single words and two-word combinations. That requires more RAM (I guess twice as much) and time.

IIRC, I tried something like spaCy and it was too slow if it worked at all (I had incremental reading cards), and eventually I replaced the language-specific translation tables I had tried making with simply a block of regexps to join some frequent combinations with "_" in the string before splitting it, for those combinations to become single morphs. That way, I can see some very frequent three-word combinations I haven't learned in the list.

There are tags that specify what language the note is for, and tags that specify the language the note is in. The former affect priority directly, and both affect how many morphs a morph in that language is equal to (to penalize cards with many unknown/overall morphs). It also determines the possible language of a two-word morph based on the letters in it, to choose how many morphs it will be counted as.

In a multi-language card, there is no way to determine the language of a word exactly. What are you going to do about that?

Unrelated to the language, now my main.py also chooses a focus morph for the non-k+1 cards. IIRC, it is the least frequent one-word morph or the most frequent two-word morph.

mortii · 2023-10-12T11:22:25Z

mortii
Oct 12, 2023
Maintainer

I'll make the roadmap items more descriptive, right now they are just a list written down quickly in a way that makes sense to me.

The user decides the language of cards (note types) in the ankimorphs settings dialog, i.e. for every note type you select a language specific morphemizer.

As for storing morphs in separate containers (sqlite tables), its just a technical thing to make the card sorting algorithm less lossy, I'm still working out the details so it might change.

MorphMan only uses one frequency.txt file for providing user-defined morph prioritization, but this is obviously a problem if you want to do that for multiple languages. I'm considering that users can provide txt files for specific languages in the format: <language_code>-morph-priority.txt, e.g. jp-morph-priority.txt

Hopefully things become more clear once I release the alpha version and people can see and play around with it themselves.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

multiple languages #11

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

multiple languages #11

Uh oh!

aleksejrs Oct 12, 2023

Replies: 1 comment

Uh oh!

mortii Oct 12, 2023 Maintainer

aleksejrs
Oct 12, 2023

mortii
Oct 12, 2023
Maintainer