Replies: 5 comments 3 replies
-
You are a gold mine of information @ashprice, thank you so much!
That is a polite way of putting it, haha. It is absolutely terrible, I only left it in for people that either don't have the hardware to run spacy, or don't want to go through the effort of setting it up.
I agree, at some point it just boils down to preference.
That is an interesting point. One of the things that attracted me to morphman was the fact that you could use a learn-through-volume approach rather than memorizing grammar. I think the example you use is perfect to illustrate this: when I think of the words 'real' and 'really' they are completely distinct in my mind, and I would also recommend to beginners to learn them as different words rather than adverbalizing 'real', but that is also just preference I guess.
That is really interesting. Do you know of any established metrics for the difficulty of grammar? I assume it could be completely different between language families--applying the same metric to Chinese and English might not be very fruitful/possible. I do agree with your last point, I think known morphs + sentence length accounts for the majority of how difficult a text is, but optimizing for grammar as well would be awesome if it's feasible! Happy new year! 🎉 |
Beta Was this translation helpful? Give feedback.
-
Oh, I wouldn't agree with stripping it out at all - it is for sure better than nothing for languages without models. You still get at least some ordering and difficulty analysis. I mean, that is basically what morphman did for all languages except Japanese/Chinese anyway.
Still undecided on this to be honest, after some time of use.
Sure, I do not really mean I don't think of them that way all the time. I'm aware of both ways, and I would introduce really as a whole unit to an English learner. But I also think it would be useful for the English learner to recognize this pattern. I was thinking more along the lines of building up intuitive knowledge of structures vs. declarative knowledge. I could well imagine that, yeah, I'm going to look up 'really' as a whole word, but if I am trying to teach someone how to use it, I'm probably going to introduce it with a bunch of other adverbs (those that end with -ly and otherwise). I probably might not even mention the word adverb to them, I'm just going to use them, get them to use them, etc. I think that's why I am not finding the spacy model as bad as I feel I should, because it seems to be doing surprisingly well at clustering structures in that way... maybe because of Zipf's law or some other frequency effect... there are also times this is decidedly not the case. Maybe it is just my card creation or mining process, even. As a final note on that, of course it could be that what we think of as useful chunks to refer to and define isn't actually the useful chunk to order learning by. But I've no strong convictions on the matter. Moving on: I think I would honestly prefer getting stuff in the style 本当 + に but only to the extent that the pairing at hand is non-idiosyncratic (or grammatically regular). What you don't want is a non-transparent word being treated as known because you know the noun and you know (say) に, or worse, a completely different and new construction being treated as known just because you know the other lexical units and a particle. I guess the question is: how often is that going to happen? And: do the number of idiosyncratic items outweigh the utility of otherwise ordering with the assumption that the structure is compositional? I don't think this is an issue that finer-grained grammatical analysis will solve, and it's still going to be a personal preference. To be honest if the grammar stuff works half well, it's probably going to be an either-or, unless you're going to make a list somehow.
I'm aware of language-specific acquisition studies as well as approaches that focus on examining grammatical complexity of languages (relative to each other) and of constructions within languages (which mostly relates to research in either of those two domains, the former usually being acquisition research and the latter usually being theoretical morphology). Complexity isn't the same as 'difficulty,' actually it's quite a different concept, but I think they are likely correlated. However, I'm not sure I can point to anything specific that is relevant here, because:
All that aside, to the extent that I am thinking of difficulty/complexity, the dependency structure sort of does it for us - that is, as it were, our model of constructions, so it should suffice, for most languages at least, to treat them the same way. My basic idea is that you treat specific dependency structures as 'vocabulary' items, that you build up in an i+1 fashion. I think this crude approach is going to be intuitive enough vs trying to find relevant stuff in the theoretical literature of the third kind above, which are more concerned with, for example, allomorphy, or complicated morphophonological processes, or complicated argument structure. So, the broad idea is: We have a possessive phrase like John's car, it'll be represented as something like [POSS-PROPN [NOM]] or whatever - that is a 'unit' for our purposes. The idea is to make an inventory of such units that you have in sentences, maybe you rank by frequency (presumably [possessive-noun [adjective [noun]]] is less common than [possessive-noun [noun]]), and then you show the simplest ones first, and the ones with more recursive depth later. So John's car is simpler than John's green car. As to the actual issue: yes, maybe because of how spacy represents those 'dependencies,' some languages may be 'black sheep' and will need more work. However, having thrown some Chinese sentences into spacy, it looks like the broad idea should work. It's when you get to specifics, as I hinted at, that I think any real problems are going to come out of the woodwork. For example, maybe proper nouns are more common in possessive constructions than regular nouns, but you'd rather be shown some number of regular nouns before some less frequent proper nouns, etc. - it's probably not going to be as simple as 'order everything by raw frequency of the grammatical construction.' And I think people may have very different expectations of such a thing, of how they'd want it to work. I honestly do not know if this kind of thing is going to yield good results. If it requires a lot of manual fixing, it might not be 'worth it' for those of us further along in learning the language for our own benefit. I can imagine it being very useful to a beginner, though. And maybe a better and simpler approach the one massif.la takes. I find it to work fairly well. 'Idiomaticity' is not strictly the same thing as 'difficulty,' but I'd expect a lot of overlap. This all said, I've been pretty busy and I am likely to be so for a few months... So this is kind of a backburner idea for me for now. |
Beta Was this translation helpful? Give feedback.
-
We should also disambiguate what we are trying to achieve, since frequent != easy. MorphMan called their algorithm 'MoprhMan Index' and one of the terms in it was called 'usefulness', which was basically the sum of the morph priorities IIRC. At the time, I thought it was needlessly verbose and not very intuitive. However, looking back, I see that it effectively prevents confusing easy with frequent. In a potential AnkiMoprhs v2 version, we should probably go back to that terminology instead of calling it the difficulty algorithm. |
Beta Was this translation helpful? Give feedback.
This comment was marked as off-topic.
This comment was marked as off-topic.
-
Not adding any new features, apologies. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Originally posted by @ashprice in #110 (comment)
Beta Was this translation helpful? Give feedback.
All reactions