-
Notifications
You must be signed in to change notification settings - Fork 51
FreeDict HOWTO – Unicode
“People in different countries use different characters to represent the words of their native languages. Nowadays most applications, including email systems and web browsers, are 8-bit clean, i.e. they can operate on and display text correctly provided that it is represented in an 8-bit character set, like ISO-8859-1.” [Bruno Haible: Linux Unicode Howto]
From: http://www.unicode.org/standard/WhatIsUnicode.html
“Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.”
Essentially, Unicode encoding supplies the unique character number that is typed to the page each time you enter a letter on your keyboard. A font-set then renders the character onto the screen so that you may read it. It's important to understand that Unicode is not a font-set, it is a protocol for the mappings that font-sets use to render fonts on your screen or on to a printed page.
Tip:
All new dictionaries should be written using Unicode, ideally that would be a character set like UTF-8.
UTF-8 is a way of wrapping up all real-world characters in a portable and efficient way. This includes most 8 bit and many 16 bit or 2 byte character sets. Your current character sets are probably included, so it may be as simple as putting
as the XML declaration of your final document.
You should ensure that your dictionary does not rely on any particular font set and is equally functional when rendered as simple text. Remember, fonts are just "pretty renderings" of real characters. Most modern Text Editors (e.g. Xemacs, emacs, Vim, GEdit, Kedit, Notepad++) should be fine.
This is not the place for a full explanation of Unicode. Please see Markus Kuhn's excellent summary at http://www.cl.cam.ac.uk/~mgk25/unicode.html. The Linux Unicode HOWTO is well worth visiting: ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html
- The Universal Character Set is defined by ISO/IEC standard ISO 10646.
- Unicode is a parallel standard that functionally defines a few "coal face" or real life, protocols.
- Unicode is kept in synchronisation with ISO/IEC 10646.
- UTF-8 is one of four ways to save Unicode into bytes. Another common one is UTF-16.
- UTF-8 is the default standard for XML documents and is an excellent choice as it contains the character mappings sets from (almost) all known languages, while being fully compliant with current and earlier computer standards. Note most web browsers and protocols assume a UTF-8 compliant encoding.
- All ISO type character sets in general use are covered by UTF-8 (including ASCII) as the KJC Family (Korean, Japanese and some of the Chinese family of characters). If your editor has a UTF-8 option please set it on.
Tip:
If you have automated some or all of your dictionary construction, please be careful to maintain character type compatibility throughout the process. C coders should use the “wide character type”. Most scripting languages now also support UTF-8 (Python, Perl, PHP, Java and Ruby at least). Shell scripts usually adopt the local environment settings. Please check your gawk and sed are mapping cleanly. You may need very recent versions.
More Information. If you are on a Linux (or similar) system, try man 7 unicode. You may also have some unicode tools on board: man 1 unicode. You may also need to set your LANG environment settings, most Linux type systems support doing this on a per instance basis, that is you may run a number of language and locales concurrently. Examples are given later.