-
-
Notifications
You must be signed in to change notification settings - Fork 7
consider Mongolian text (NNBSP) in wordcount #3
Comments
The string extracted from PDF is not correct. We need help from L2/17-036's authors to get the original string. Or I can construct a similar test case. The other aspect of the problem is, as I mentioned in L2/17-052, section 1.3, it's not preferred by the majority of Mongolian users that suffixes are not considered separate words. |
Are they counted differently by region and/or language?
Any help with test cases is appreciated !
…On Jan 25, 2017, 5:26 PM -0800, 梁海 Liang Hai ***@***.***>, wrote:
The string extracted from PDF is not correct. We need help from L2/17-036 (http://www.unicode.org/L2/L2017/17036-mongolian-suffix.pdf)'s authors to get the original string. Or I can construct a similar test case.
The other aspect of the problem is, as I mentioned in L2/17-052 (http://www.unicode.org/L2/L2017/17052-mongolian-cmt.pdf), section 1.3, it's not preferred by the majority of Mongolian users that suffixes are not considered separate words.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub (#3 (comment)), or mute the thread (https://github.com/notifications/unsubscribe-auth/AA0Ms6jcvS7UWg6jaZq907dg_EmMBPzQks5rV_Y8gaJpZM4LuLJR).
|
Counted differently by:
Minimum test cases for word counting (string, escaped, transliteration and reference image, notes):
|
The reply above was originally with some mistakes and some information was missing. Please see it in web page if GitHub didn't push notifications of later changes to your email inbox. |
Thank you. Mongolian will become the best supported language here! |
Hi lianghai and srl295, |
It's not a consensus among average users. There's a messy dislocation between linguistic scholars' definition of "word", encoding experts' requirement of NNBSP for correct shaping, common users' understanding of "word", and the intention of word counting (are we counting words for linguistic scholars?) for various user groups. |
@lianghai |
You have misunderstood probably here. Because there is one situation the non-breakable behaviour is not preferred to write inflected words like "ᠠᠵᠢᠯ ᠤᠨ ᠬᠢᠨ ᠲᠠᠢ ᠪᠠᠨ" vertically in horizontal narrow space. It is actually design or space issue. Even the suffixes are started from new column they are part of the word thus they start with medial form. I would say the suffixes can be breakable until they start as medial form, but not separable from the word even they start from new line. |
Then I need to consider the possibility that users in Mongolia tend to consider enclitics as parts of the preceding word because the Cyrillic writing system tends to write enclitics actually as suffixes. My impression from Inner Mongolian users doesn't suggest so. Note "suffix" refers to the structures that are actually attached, thus I try to call the detached structures "enclitics" to distinguish them.
I ask non-scholar users of the Mongolian script every now and then about how do they count words and why. My impression is they don't have a strong tendency to omit enclitics. I don't know your qualifications but most of them did attend Mongolian schools. Also, I might need to clarify that I consider scholars' opinion to be especially irrelevant to this topic, because I don't follow prescriptism. No matter what scholars/teachers teach, what average users have actually learned is what actually matters — note there's a difference.
I agree such cases involve more typographical considerations. However I don't observe a strong preference of non-breaking the space preceding enclitics even in running text of newspaper and books.
I understand break-ability and word-counting are not necessarily relevant, so you can be assured that I won't analyze word-counting according to break-ability.
Noted. Do you have a list of detached structures that you consider as "suffixes" (which I call "enclitics" and I assume you consider they should all be counted as part of preceding word), and marked with their break-ablilty? |
Actually, that is the direct opposite. Teachers teach how words should be written in this script and what is correct or incorrect, no matter what users have learned. We could not order that all they should learn correct but we wish always.
I will bring the list of all detached structures at Mongolian Meeting in April. |
Then it's clear you tend to be more prescriptive, while I work in a more descriptive way. I hope this difference between our methodologies is helpful for you understand our disagreement on certain issues.
Thank you. |
Of course, suffixes are part of its stem word, and its concept same as Cyrillic Mongolian script. In academia, in the essay writing class, students count words exactly with its suffix as the word processing application works. It is obvious. That is why we need this http://www.unicode.org/L2/L2017/17036-mongolian-suffix.pdf proposal into real life as soon as possible. 180F's behavior should be included breakable attribute (like a non-breaking hyphen) but counted within its stem, I hope. When is this proposal going to be implemented in the Unicode standard? I have many font projects that on setback and waiting for this unsolved problem. |
@mongoltolbo Can you elaborate why your font projects are waiting for the proposed MSC model? And may I ask, do you consider it obvious whether the structures uu/üü and ügei should be counted or ignored (counted as part of the stem word)? (cc @badaa for the second question.) |
@lianghai |
See: http://www.unicode.org/L2/L2017/17036-mongolian-suffix.pdf
The text was updated successfully, but these errors were encountered: