Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mandarin transliteration in HK data #200

Open
ronaldtse opened this issue Jan 10, 2020 · 1 comment
Open

Mandarin transliteration in HK data #200

ronaldtse opened this issue Jan 10, 2020 · 1 comment
Labels
enhancement New feature or request

Comments

@ronaldtse
Copy link
Contributor

Mandarin transliteration in HK data

(Originally in #39)

Extracted a list of Mandarin transliteration from the Hong Kong dataset, and created a toneless pinyin map for testing.

  1. Spacing
    Hanyu Pinyin (zho_Hani2Latn_GCH_1979) has detailed rules on word segmentation. These rules have not yet been implemented. Whether a space is needed depends on a number of factors, and cannot be handled by mapping rules alone. For example, these place names below all contain the character "灣", but only the first and third rows below are transliterated as one word.

image

A separate parsing layer may be needed in order to handle the insertion of space (related to #44 ).

  1. Syllable separator for zero-onset syllables
    Syllables begin with a, o, and e should be preceded by a syllable separator unless it is the first syllable of a word, e.g. 西安 Xi’an.

  2. Hong Kong specific reading
    涌: Chong
    仔: Zai
    咀: Zui (<嘴)

Toneless Pinyin Map with HK place names
cn-chn-Hans-Latn-pinyin_toneless.yaml.zip

Originally posted by @chaaklau in #39 (comment)

@ronaldtse
Copy link
Contributor Author

Also see #39 (comment)

@ronaldtse ronaldtse added the enhancement New feature or request label Jan 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant