-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limits of hiragana-based romanisation #4
Comments
Yeah, this sounds like a good idea. One problem though, in JMdict database hiragana readings are not separated by kanji so this wasn't possible to implement at the time I wrote romanization algorithm. More recently I have implemented a kanji module (kanji.lisp) that has a function |
Oh. I didn't know that. I guess that makes what I had in mind quite difficult. Special readings could be a problem. And then, what if (this is entirely fictional. I don't know if a real-world example exists) you have a word made up of two kanji, the first one could be read あ or あお and the latter can be read お or おう? If you only know that the entire word reads あおう, then that could be split into あ-おう or あお-う... |
I noticed that the traditional basic option on the site doesn't create the |
@tslater I think if you do |
Looks like it is working. Thanks! |
Hi,
(this doesn't really belong in a bug report but I'd still like to take a second to say that what you've done here is fabulous, amazing, and incredibly helpful. Thank you!).
I'm not sure I understand completely what goes on in romanize.lisp, but under certain circumstances, it ends up merging an "o" and a "u" that it shouldn't. This issue is mentioned here and 追う is given as an example. The correct reading of 追 is お, so that in hiragana, the word comes out as おう. This transformation is lossy/ambiguous, however: Here, お and う are pronounced separately, in contrast to 王, which, too, is romanised as おう but pronounced as a long お. To romanise 追う as ō is misleading, I think.
I believe that the general rule (and this might make for an easy fix) is: Merging of お and う cannot occur across kanji boundaries. In the presence of kanji, the breakup into hiragana and merging of お and う needs to occur before those tokens are thrown together.
Since I'm not a native speaker (quite the opposite), I checked forvo.com and found a recording that supports the claim that お and う are not joined in 追う: In the recording by the user strawberrybrown, the お and the う can be made out quite distinctly. In contrast, I found a few examples of もう, ぽう, ちょう, and どう that she pronounces as mō, pō, chō, and dō, respectively, just as expected. Which is to say, this user does not generally pronounce お and う sounds separately (as could be the case in a dialect, maybe?) but only when they're really meant to be separate.
There is another recording by the user smime in the same place as linked to earlier where the pronunciation of 追う is more difficult to make out, which corresponds to casual speaking.
Finally, please see also wiktionary for romaji of 追う and 王.
Update: 子牛 is another example that showcases this problem. The romanisation is currently incorrectly given as kōshi.
The text was updated successfully, but these errors were encountered: