include grammar difficulty/usefulness #115

mortii · 2024-01-01T20:01:51Z

mortii
Jan 1, 2024
Maintainer

Maybe we should start a discussion thread for thoughts on spacy?

I have to say that overall my initial experience with Ankimorphs + spacy is quite positive. In general I'm finding the difficulty of the cards selected to be appropriate in terms of words chosen. This is obviously better in bigger decks where the difficulty jumps can be more fine-grained. I'd like to say a big thank you to @mortii for putting in the legwork and for everyone else who has contributed along the way! Really, I think ankimorphs is pretty awesome in its current state.

Some thoughts and rambles follow..

Languages with spaces

First - the ankimorphs parser for languages with spaces is quite the step down vs. the spacy models. Of course, it is better than no frequency-based ordering, but yeah the experience with a deck that has a spacy model vs. one that doesn't is quite jarring. I might look into how feasible it is to train some models with free tagged corpora - but I know that even finding good, sizeable (and free) corpora may be hard for languages that do not already have models trained, and that's before whatever work is involved in making the PoS tagging etc. available to spacy. It's possible that some public organisations might be willing to give rights for usage on principle given ankimorphs could make learning smaller languages a lot more accessible.

Japanese

For Japanese, I am less decided. I cannot decide whether I prefer the ankimorphs parser or the spacy model... one would think it would be the former given how the latter is segmenting things, but trying both out, it certainly doesn't feel that obvious.

I should note that I personally do not ignore comprehension cards - if they are in my stack of cards, I view them, either for first review, or for deletion. This is just my preferred way of working - I don't want a load of cards that aren't used for anything in my collection, and if it's not a mass-mined but hand-made/mined card, even if ankimorphs thinks it is a 'comprehension card,' the card is there for a reason (namely I made a card to remember it!)

Grammar

The 'open problem' I would say is something I alluded to in #13 - difficulty-by-word is one thing, but what is not so easy is difficulty-by-grammar. And maybe we simply decide this is out of the scope of the project, as it seems a hard problem to solve.

Here, ankimorphs is less consistent, because we are ordering cards based on lemmas, not phrase-level constructions and not the internal complexity of the word the lemma occurred in. To some degree this is one of the things I like about the spacy model. For example, @mortii mentioned the case of 本当に being parsed as 本当 + に, and you'll find the same with eg. 別に, ために, etc. Now, sure, 本当に has a dictionary entry, and it makes total sense for that to be the case, and if you introduce that phrase to a learner, you'll likely introduce it the first time as 本当に = really, truly, etc. not as 本当 = real, true + に = adverbial copula. But, really what we have here is 本当 + に, where に is functioning as an adverbializer, and this structure is shared by all of these words/constructions - sure there are some that have unpredictable meanings, but by the _n_ᵗʰ phrase that you can decompose as Noun + に you know the pattern - you don't need to separately learn all the words ending in -ly in English, you just learn -ly makes adverbs and then you deal with the few cases with weird meanings as and when.

I don't think that particular case is that important... So moving on to something more complicated: compound verbs. Here I feel that both spacy and ankimorphs are good in different ways. Many compound verbs are easily decomposable, so you don't need to learn the whole compound as a lexical unit, you just need to know the components, for example, in 走り回る, you don't really need to learn 走り回る, you can just learn 走る + 回る. The spacy model does this quite well. For less clear compounds, though, ankimorphs does better because it matches for the whole construction.

One approach to this as was discussed a bit before is to match your string or a concatenation of some strings against a list of words, for example, taking ほんとう + に and matching for dictionary entries that are, well, 本当 + に, ie. 本当に. I think ichi.moe probably has this approach down the best that I've seen, my guess is that it matches for the longest string but with some hardcoded exceptions or rules based on context. JPDB's parser also deserves a mention, but I am imagining they use some transformer-based spacy model in the background, or something like it that is completely homegrown (the latter wouldn't surprise me given what I know of the developer, but I haven't asked them about this.)

There are things that are more complicated than ṁany of those cases though: phrase level constructions, even auxiliary verb conjugation. To some degree matching for lemmas works for both (I think spacy might have an edge for conjugation, probably ankimorphs is better at phrase level stuff).

But there's another approach - use dependency analysis. This was why I originally pointed to ginza - it's some work and the documentation is lacking, but you can get really good dependency info out of it for Japanese. But I'd kind of like an easier solution than ginza, and spacy does make available dependency info for some (most? all?) models, there is some documentation here: https://spacy.io/usage/linguistic-features#dependency-parse .

If we can somehow use this info to adjust the difficulty of cards, maybe by treating different dependency structures as 'schemas' that are learnt like vocabulary, hopefully there'd be a marked improvement in the ordering of cards. Possibly the morphologizer can be used productively too.

Equally, of course, I think it's clear that this is a case of chasing diminishing returns: ankimorphs + spacy works pretty damn well as is. And as @mortii noted, spacy hasn't proven to be a panacea. And, just as spacy hasn't been a panacea, it's unlikely including some kind of grammar-based adjustments will be.

I'll probably toy around with implementing this idea on my end, but I cannot say when I'd have a proof-of-concept. This is mostly just brainstorming on my part. :)

I'd like to wish a happy new year to all of you!

Originally posted by @ashprice in #110 (comment)

mortii · 2024-01-01T21:23:20Z

mortii
Jan 1, 2024
Maintainer Author

You are a gold mine of information @ashprice, thank you so much!

First - the ankimorphs parser for languages with spaces is quite the step down vs. the spacy models. Of course, it is better than no frequency-based ordering, but yeah the experience with a deck that has a spacy model vs. one that doesn't is quite jarring

That is a polite way of putting it, haha. It is absolutely terrible, I only left it in for people that either don't have the hardware to run spacy, or don't want to go through the effort of setting it up.

For Japanese, I am less decided. I cannot decide whether I prefer the ankimorphs parser or the spacy model... one would think it would be the former given how the latter is segmenting things, but trying both out, it certainly doesn't feel that obvious.

I agree, at some point it just boils down to preference.

Now, sure, 本当に has a dictionary entry, and it makes total sense for that to be the case, and if you introduce that phrase to a learner, you'll likely introduce it the first time as 本当に = really, truly, etc. not as 本当 = real, true + に = adverbial copula. But, really what we have here is 本当 + に, where に is functioning as an adverbializer, and this structure is shared by all of these words/constructions - sure there are some that have unpredictable meanings, but by the _n_ᵗʰ phrase that you can decompose as Noun + に you know the pattern - you don't need to separately learn all the words ending in -ly in English, you just learn -ly makes adverbs and then you deal with the few cases with weird meanings as and when.

That is an interesting point. One of the things that attracted me to morphman was the fact that you could use a learn-through-volume approach rather than memorizing grammar. I think the example you use is perfect to illustrate this: when I think of the words 'real' and 'really' they are completely distinct in my mind, and I would also recommend to beginners to learn them as different words rather than adverbalizing 'real', but that is also just preference I guess.

But there's another approach - use dependency analysis. This was why I originally pointed to ginza - it's some work and the documentation is lacking, but you can get really good dependency info out of it for Japanese. But I'd kind of like an easier solution than ginza, and spacy does make available dependency info for some (most? all?) models, there is some documentation here: https://spacy.io/usage/linguistic-features#dependency-parse .

If we can somehow use this info to adjust the difficulty of cards, maybe by treating different dependency structures as 'schemas' that are learnt like vocabulary, hopefully there'd be a marked improvement in the ordering of cards. Possibly the morphologizer can be used productively too.

Equally, of course, I think it's clear that this is a case of chasing diminishing returns: ankimorphs + spacy works pretty damn well as is.

That is really interesting. Do you know of any established metrics for the difficulty of grammar? I assume it could be completely different between language families--applying the same metric to Chinese and English might not be very fruitful/possible.

I do agree with your last point, I think known morphs + sentence length accounts for the majority of how difficult a text is, but optimizing for grammar as well would be awesome if it's feasible!

Happy new year! 🎉

0 replies

ashprice · 2024-01-26T23:22:10Z

ashprice
Jan 26, 2024

That is a polite way of putting it, haha. It is absolutely terrible, I only left it in for people that either don't have the hardware to run spacy, or don't want to go through the effort of setting it up.

Oh, I wouldn't agree with stripping it out at all - it is for sure better than nothing for languages without models. You still get at least some ordering and difficulty analysis. I mean, that is basically what morphman did for all languages except Japanese/Chinese anyway.

I agree, at some point it just boils down to preference.

Still undecided on this to be honest, after some time of use.

That is an interesting point. One of the things that attracted me to morphman was the fact that you could use a learn-through-volume approach rather than memorizing grammar. I think the example you use is perfect to illustrate this: when I think of the words 'real' and 'really' they are completely distinct in my mind, and I would also recommend to beginners to learn them as different words rather than adverbalizing 'real', but that is also just preference I guess.

Sure, I do not really mean I don't think of them that way all the time. I'm aware of both ways, and I would introduce really as a whole unit to an English learner. But I also think it would be useful for the English learner to recognize this pattern. I was thinking more along the lines of building up intuitive knowledge of structures vs. declarative knowledge. I could well imagine that, yeah, I'm going to look up 'really' as a whole word, but if I am trying to teach someone how to use it, I'm probably going to introduce it with a bunch of other adverbs (those that end with -ly and otherwise). I probably might not even mention the word adverb to them, I'm just going to use them, get them to use them, etc.

I think that's why I am not finding the spacy model as bad as I feel I should, because it seems to be doing surprisingly well at clustering structures in that way... maybe because of Zipf's law or some other frequency effect... there are also times this is decidedly not the case. Maybe it is just my card creation or mining process, even.

As a final note on that, of course it could be that what we think of as useful chunks to refer to and define isn't actually the useful chunk to order learning by. But I've no strong convictions on the matter.

Moving on: I think I would honestly prefer getting stuff in the style 本当 + に but only to the extent that the pairing at hand is non-idiosyncratic (or grammatically regular).

What you don't want is a non-transparent word being treated as known because you know the noun and you know (say) に, or worse, a completely different and new construction being treated as known just because you know the other lexical units and a particle.

I guess the question is: how often is that going to happen? And: do the number of idiosyncratic items outweigh the utility of otherwise ordering with the assumption that the structure is compositional?

I don't think this is an issue that finer-grained grammatical analysis will solve, and it's still going to be a personal preference. To be honest if the grammar stuff works half well, it's probably going to be an either-or, unless you're going to make a list somehow.

That is really interesting. Do you know of any established metrics for the difficulty of grammar? I assume it could be completely different between language families--applying the same metric to Chinese and English might not be very fruitful/possible.

I'm aware of language-specific acquisition studies as well as approaches that focus on examining grammatical complexity of languages (relative to each other) and of constructions within languages (which mostly relates to research in either of those two domains, the former usually being acquisition research and the latter usually being theoretical morphology). Complexity isn't the same as 'difficulty,' actually it's quite a different concept, but I think they are likely correlated. However, I'm not sure I can point to anything specific that is relevant here, because:

the acqusition stuff is usually looking at learner output, and I'm not sure we actually want to order our Japanese input by 'natural acquisition orders' or what-have-you. The stuff that doesn't look at output, is usually way too specific to be useful (and a lot of it is methodologically dubious.) I also don't really think we want to hard-code an elaborate set of rules...
the 'languages in general' stuff is fairly abstract, often theoretical, and often widely regarded as not-ready for the purpose, even by authors themselves
the 'constructions within language' stuff is usually very topic/domain specific, and I can't think of a single area that isn't full of disagreements/problems

All that aside, to the extent that I am thinking of difficulty/complexity, the dependency structure sort of does it for us - that is, as it were, our model of constructions, so it should suffice, for most languages at least, to treat them the same way. My basic idea is that you treat specific dependency structures as 'vocabulary' items, that you build up in an i+1 fashion. I think this crude approach is going to be intuitive enough vs trying to find relevant stuff in the theoretical literature of the third kind above, which are more concerned with, for example, allomorphy, or complicated morphophonological processes, or complicated argument structure.

So, the broad idea is:

We have a possessive phrase like John's car, it'll be represented as something like [POSS-PROPN [NOM]] or whatever - that is a 'unit' for our purposes.

The idea is to make an inventory of such units that you have in sentences, maybe you rank by frequency (presumably [possessive-noun [adjective [noun]]] is less common than [possessive-noun [noun]]), and then you show the simplest ones first, and the ones with more recursive depth later. So John's car is simpler than John's green car.

As to the actual issue: yes, maybe because of how spacy represents those 'dependencies,' some languages may be 'black sheep' and will need more work. However, having thrown some Chinese sentences into spacy, it looks like the broad idea should work.

It's when you get to specifics, as I hinted at, that I think any real problems are going to come out of the woodwork. For example, maybe proper nouns are more common in possessive constructions than regular nouns, but you'd rather be shown some number of regular nouns before some less frequent proper nouns, etc. - it's probably not going to be as simple as 'order everything by raw frequency of the grammatical construction.' And I think people may have very different expectations of such a thing, of how they'd want it to work.

I honestly do not know if this kind of thing is going to yield good results. If it requires a lot of manual fixing, it might not be 'worth it' for those of us further along in learning the language for our own benefit. I can imagine it being very useful to a beginner, though.

And maybe a better and simpler approach the one massif.la takes. I find it to work fairly well. 'Idiomaticity' is not strictly the same thing as 'difficulty,' but I'd expect a lot of overlap.

This all said, I've been pretty busy and I am likely to be so for a few months... So this is kind of a backburner idea for me for now.

1 reply

mortii Jan 27, 2024
Maintainer Author

@ashprice I'm always blown away by your comments, they are so good!

We have a possessive phrase like John's car, it'll be represented as something like [POSS-PROPN [NOM]] or whatever - that is a 'unit' for our purposes.

The idea is to make an inventory of such units that you have in sentences, maybe you rank by frequency (presumably [possessive-noun [adjective [noun]]] is less common than [possessive-noun [noun]]), and then you show the simplest ones first, and the ones with more recursive depth later. So John's car is simpler than John's green car.

That is a great explanation. Recursive depth might be a good metric.

it's probably not going to be as simple as 'order everything by raw frequency of the grammatical construction.

I dunno, that doesn't immediately strike me as a terrible idea, but you likely have a much better intuition about this than I do. That being said, something that did immediately pop into my mind was having the equivalent of a frequency file, but with grammatical units (e.g. [POSS-PROPN [NOM]]) instead of morphs. That's more in line with what I thought you meant in your previous comment:

If we can somehow use this info to adjust the difficulty of cards, maybe by treating different dependency structures as 'schemas' that are learnt like vocabulary, hopefully there'd be a marked improvement in the ordering of cards.

Maybe I got blinded by the word 'schema'--it's such an ambiguous word.

Since those 'grammar frequency files' would be language specific it would also take care of the 'black sheep' concern I think. This might also not be an insane amount of work to make since we could re-use the implementation of the current frequency files.

The only real problem that i can see with this approach is that matching the sentences one-to-one with those found in the 'grammar frequency file' could get tricky since they would be length specific; if a sentence has an identical structure to one found, except that it has one more trivial word, then it would mismatch, and perhaps given a disproportionate penalty because of it.

This all said, I've been pretty busy and I am likely to be so for a few months... So this is kind of a backburner idea for me for now.

Of course, don't feel obligated to work on it. I might experiment with this once I'm reasonably satisfied with the other parts of the project.

mortii · 2024-01-28T13:35:25Z

mortii
Jan 28, 2024
Maintainer Author

@ashprice

We should also disambiguate what we are trying to achieve, since frequent != easy.

MorphMan called their algorithm 'MoprhMan Index' and one of the terms in it was called 'usefulness', which was basically the sum of the morph priorities IIRC. At the time, I thought it was needlessly verbose and not very intuitive. However, looking back, I see that it effectively prevents confusing easy with frequent.

In a potential AnkiMoprhs v2 version, we should probably go back to that terminology instead of calling it the difficulty algorithm.

1 reply

mortii Jan 29, 2024
Maintainer Author

We could name it 'AnkiMorphs score' and rename the extra field 'am-difficulty' -> 'am-score'

mortii · 2024-10-13T12:01:07Z

mortii
Oct 13, 2024
Maintainer Author

Not adding any new features, apologies.

0 replies

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

include grammar difficulty/usefulness #115

Uh oh!

mortii Jan 1, 2024 Maintainer

Languages with spaces

Japanese

Grammar

Replies: 5 comments · 3 replies

Uh oh!

Uh oh!

mortii Jan 1, 2024 Maintainer Author

Uh oh!

Uh oh!

ashprice Jan 26, 2024

Uh oh!

mortii Jan 27, 2024 Maintainer Author

Uh oh!

mortii Jan 28, 2024 Maintainer Author

Uh oh!

mortii Jan 29, 2024 Maintainer Author

This comment was marked as off-topic.

This comment was marked as off-topic.

Uh oh!

mortii Oct 13, 2024 Maintainer Author

mortii
Jan 1, 2024
Maintainer

Replies: 5 comments 3 replies

mortii
Jan 1, 2024
Maintainer Author

ashprice
Jan 26, 2024

mortii Jan 27, 2024
Maintainer Author

mortii
Jan 28, 2024
Maintainer Author

mortii Jan 29, 2024
Maintainer Author

mortii
Oct 13, 2024
Maintainer Author