consider Mongolian text (NNBSP) in wordcount #3

srl295 · 2017-01-26T00:34:39Z

See: http://www.unicode.org/L2/L2017/17036-mongolian-suffix.pdf

ᠡᠷᠲᠡ ᠲᠣᠭ ᠠ ᠲᠣᠮᠰᠢ ᠦ ᠢ ᠨᠥᠭᠴᠢᠭᠰᠡᠨ ᠠᠯᠠᠪ ᠤᠨ ᠤᠷᠢᠳᠠ ᠠᠨᠤ
The word count is correct at 8 words

lianghai · 2017-01-26T01:26:19Z

The string extracted from PDF is not correct. We need help from L2/17-036's authors to get the original string. Or I can construct a similar test case.

The other aspect of the problem is, as I mentioned in L2/17-052, section 1.3, it's not preferred by the majority of Mongolian users that suffixes are not considered separate words.

srl295 · 2017-01-26T06:50:39Z

Are they counted differently by region and/or language? Any help with test cases is appreciated !

…

On Jan 25, 2017, 5:26 PM -0800, 梁海 Liang Hai ***@***.***>, wrote: The string extracted from PDF is not correct. We need help from L2/17-036 (http://www.unicode.org/L2/L2017/17036-mongolian-suffix.pdf)'s authors to get the original string. Or I can construct a similar test case. The other aspect of the problem is, as I mentioned in L2/17-052 (http://www.unicode.org/L2/L2017/17052-mongolian-cmt.pdf), section 1.3, it's not preferred by the majority of Mongolian users that suffixes are not considered separate words. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub (#3 (comment)), or mute the thread (https://github.com/notifications/unsubscribe-auth/AA0Ms6jcvS7UWg6jaZq907dg_EmMBPzQks5rV_Y8gaJpZM4LuLJR).

lianghai · 2017-01-27T12:00:38Z

Counted differently by:

Native users' common understanding (tend to consider a suffix simply another word) vs scholars' certain grammatical analysis (consider a suffix is part of the word).
- It's like, in a parallel universe, the preposition "as" has a special and required way of writing "s" which doesn't occur in normal words — say, it must be written "aſ something" but normal words like "gas" always just have "s", so many English users might still consider "aſ" a separate word, but many scholars think the space in "aſ something" must be a special character (say, NNBSP) which makes this string to be counted as a single word and non-breaking.
Languages might have different preferences too.
- According to L2/17-008 Proposal to encode one Manchu format character, among the 4 major languages that use the Mongolian script, Mongolian and Todo's mainstream grammar considers a suffix to be part of the word, but Manchu and Sibe's mainstream grammar considers a suffix to be a separate word.

Minimum test cases for word counting (string, escaped, transliteration and reference image, notes):

Mongolian language
- ᠮᠣᠩᠭᠣᠯ ᠤᠨ
- <182E 1823 1829 182D 1823 182F 202F 1824 1828>
- monggol-un
- TUS 9.0 expects this to be counted as 1 word. Users might prefer 2.
- The space here is NNBSP (U+202F) as required by TUS 9.0.
- NNBSP's width (including stretchability), word boundary extending, and non-breaking behaviors are not always preferred by general users.
Manchu language
- ᠮᠠᠨᠵᡠ ᡳ
- <182E 1820 1828 1835 1860 202F 1873>
- manju-i
- TUS 9.0 expects this to be counted as 1 word. Users prefer 2.
- The space here is NNBSP (U+202F) as required by TUS 9.0.
- NNBSP's width (including stretchability) and word boundary extending behaviors are not expected by Manchu and Sibe. See L2/17-008 Proposal to encode one Manchu format character.
- NNBSP's non-breaking behavior is not always preferred by general users.

lianghai · 2017-01-27T12:10:47Z

The reply above was originally with some mistakes and some information was missing. Please see it in web page if GitHub didn't push notifications of later changes to your email inbox.

srl295 · 2017-02-14T03:12:57Z

Thank you. Mongolian will become the best supported language here!

badaa · 2018-02-12T21:57:58Z

Hi lianghai and srl295,
All the suffixes are considered as part of the word, thus they should not be counted separately.
They are counted not differently. If you count mongolian suffixes differently then the count of the words are increased massively because Mongolian is very very very agglutinative language. It's common such a word "ᠠᠵᠢᠯ ᠤᠨ ᠬᠢᠨ ᠲᠠᠢ ᠪᠠᠨ" It is just one word with meaning in english "with colleagues".

lianghai · 2018-02-13T16:19:22Z

@badaa

All the suffixes are considered as part of the word, thus they should not be counted separately.

It's not a consensus among average users. There's a messy dislocation between linguistic scholars' definition of "word", encoding experts' requirement of NNBSP for correct shaping, common users' understanding of "word", and the intention of word counting (are we counting words for linguistic scholars?) for various user groups.

badaa · 2018-02-13T20:36:07Z

@lianghai
It's distinct term. Nobody could imagine in Mongolia that the suffixes are counted as words. What do you mean "average users"?
Every scholar teaches this term in school. Thus, I tend to think that you mean as "average users" anyone who either "unqualified" or who didn't attend Mongolian school.

badaa · 2018-02-13T21:09:55Z

@lianghai

NNBSP's width (including stretchability), word boundary extending, and non-breaking behaviors are not always preferred by general users.

You have misunderstood probably here. Because there is one situation the non-breakable behaviour is not preferred to write inflected words like "ᠠᠵᠢᠯ ᠤᠨ ᠬᠢᠨ ᠲᠠᠢ ᠪᠠᠨ" vertically in horizontal narrow space. It is actually design or space issue. Even the suffixes are started from new column they are part of the word thus they start with medial form. I would say the suffixes can be breakable until they start as medial form, but not separable from the word even they start from new line.

lianghai · 2018-02-25T07:16:21Z

@badaa

Nobody could imagine in Mongolia that the suffixes are counted as words.

Then I need to consider the possibility that users in Mongolia tend to consider enclitics as parts of the preceding word because the Cyrillic writing system tends to write enclitics actually as suffixes. My impression from Inner Mongolian users doesn't suggest so.

Note "suffix" refers to the structures that are actually attached, thus I try to call the detached structures "enclitics" to distinguish them.

What do you mean "average users"?

I ask non-scholar users of the Mongolian script every now and then about how do they count words and why. My impression is they don't have a strong tendency to omit enclitics. I don't know your qualifications but most of them did attend Mongolian schools.

Also, I might need to clarify that I consider scholars' opinion to be especially irrelevant to this topic, because I don't follow prescriptism. No matter what scholars/teachers teach, what average users have actually learned is what actually matters — note there's a difference.

Because there is one situation the non-breakable behaviour is not preferred to write inflected words like "ᠠᠵᠢᠯ ᠤᠨ ᠬᠢᠨ ᠲᠠᠢ ᠪᠠᠨ" vertically in horizontal narrow space. It is actually design or space issue.

I agree such cases involve more typographical considerations. However I don't observe a strong preference of non-breaking the space preceding enclitics even in running text of newspaper and books.

Even the suffixes are started from new column they are part of the word thus they start with medial form.

I understand break-ability and word-counting are not necessarily relevant, so you can be assured that I won't analyze word-counting according to break-ability.

I would say the suffixes can be breakable until they start as medial form, but not separable from the word even they start from new line.

Noted. Do you have a list of detached structures that you consider as "suffixes" (which I call "enclitics" and I assume you consider they should all be counted as part of preceding word), and marked with their break-ablilty?

badaa · 2018-02-25T19:45:37Z

@lianghai

No matter what scholars/teachers teach, what average users have actually learned is what actually matters — note there's a difference.

Actually, that is the direct opposite. Teachers teach how words should be written in this script and what is correct or incorrect, no matter what users have learned. We could not order that all they should learn correct but we wish always.

Noted. Do you have a list of detached structures that you consider as "suffixes" (which I call "enclitics" and I assume you consider they should all be counted as part of preceding word), and marked with their break-ablilty?

I will bring the list of all detached structures at Mongolian Meeting in April.

lianghai · 2018-02-26T08:31:45Z

@badaa

Actually, that is the direct opposite. Teachers teach how words should be written in this script and what is correct or incorrect, no matter what users have learned. We could not order that all they should learn correct but we wish always.

Then it's clear you tend to be more prescriptive, while I work in a more descriptive way. I hope this difference between our methodologies is helpful for you understand our disagreement on certain issues.

I will bring the list of all detached structures at Mongolian Meeting in April.

Thank you.

mongoltolbo · 2018-03-02T07:47:13Z

Of course, suffixes are part of its stem word, and its concept same as Cyrillic Mongolian script. In academia, in the essay writing class, students count words exactly with its suffix as the word processing application works. It is obvious. That is why we need this http://www.unicode.org/L2/L2017/17036-mongolian-suffix.pdf proposal into real life as soon as possible. 180F's behavior should be included breakable attribute (like a non-breaking hyphen) but counted within its stem, I hope. When is this proposal going to be implemented in the Unicode standard? I have many font projects that on setback and waiting for this unsolved problem.
http://mongoltolbo.com

lianghai · 2018-03-02T13:54:08Z

@mongoltolbo Can you elaborate why your font projects are waiting for the proposed MSC model? And may I ask, do you consider it obvious whether the structures uu/üü and ügei should be counted or ignored (counted as part of the stem word)? (cc @badaa for the second question.)

fromUB · 2018-03-14T14:53:17Z

@lianghai
Inflections for grammatical cases are parts of the nouns which they are modifying. If they were separate words, they would be written with initial letters, but they start with mid characters indicating that they are part of the noun.

aronsoyol · 2019-02-21T08:12:28Z

Whatever U202F or U180F,and whatever whether count suffix as one word or not. Mongolian Suffix Connector should stretch its width in justify line alignment mode.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consider Mongolian text (NNBSP) in wordcount #3

consider Mongolian text (NNBSP) in wordcount #3

srl295 commented Jan 26, 2017 •

edited

Loading

lianghai commented Jan 26, 2017

srl295 commented Jan 26, 2017 via email

lianghai commented Jan 27, 2017 •

edited

Loading

lianghai commented Jan 27, 2017

srl295 commented Feb 14, 2017

badaa commented Feb 12, 2018

lianghai commented Feb 13, 2018

badaa commented Feb 13, 2018

badaa commented Feb 13, 2018

lianghai commented Feb 25, 2018

badaa commented Feb 25, 2018 •

edited

Loading

lianghai commented Feb 26, 2018

mongoltolbo commented Mar 2, 2018 •

edited

Loading

lianghai commented Mar 2, 2018 •

edited

Loading

fromUB commented Mar 14, 2018

aronsoyol commented Feb 21, 2019 •

edited

Loading

consider Mongolian text (NNBSP) in wordcount #3

consider Mongolian text (NNBSP) in wordcount #3

Comments

srl295 commented Jan 26, 2017 • edited Loading

lianghai commented Jan 26, 2017

srl295 commented Jan 26, 2017 via email

lianghai commented Jan 27, 2017 • edited Loading

lianghai commented Jan 27, 2017

srl295 commented Feb 14, 2017

badaa commented Feb 12, 2018

lianghai commented Feb 13, 2018

badaa commented Feb 13, 2018

badaa commented Feb 13, 2018

lianghai commented Feb 25, 2018

badaa commented Feb 25, 2018 • edited Loading

lianghai commented Feb 26, 2018

mongoltolbo commented Mar 2, 2018 • edited Loading

lianghai commented Mar 2, 2018 • edited Loading

fromUB commented Mar 14, 2018

aronsoyol commented Feb 21, 2019 • edited Loading

srl295 commented Jan 26, 2017 •

edited

Loading

lianghai commented Jan 27, 2017 •

edited

Loading

badaa commented Feb 25, 2018 •

edited

Loading

mongoltolbo commented Mar 2, 2018 •

edited

Loading

lianghai commented Mar 2, 2018 •

edited

Loading

aronsoyol commented Feb 21, 2019 •

edited

Loading