-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathEncodings_and_R.html
514 lines (477 loc) · 22.3 KB
/
Encodings_and_R.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="content-type"
content="text/html; charset=ISO-8859-1">
<title>Encodings and R</title>
</head>
<body>
<h1 align="center">Encodings and R</h1>
<p>
The use of encodings is raised sporadically on the R mailing lists,
with discussion of ideas to `do better'. R has been developed by
authors speaking English or a Western European language, and its
current mindset is the ISO Latin 1 (aka ISO 8859-1) character
set. Even these authors find some problems, for example the lack
of some currency symbols (notably the Euro, € if it displays for
you). Users of R in Central Europe need more characters and are
sometimes puzzled that Latin 2 (aka ISO 8859-2) works only
partially.
Other languages present much greater challenges, and there is a <a
href="http://jasp.ism.ac.jp/workshop0312/ismws.2003.pdf">project</a>
to `Japanize' R which (for obvious reasons) is little known outside
Japan.
</p>
<p>One of the challenges is that in European usage, <tt>nchar(x)</tt>
is the number of characters in the string but is used for
adjusting layouts. In other encodings there can be different
values for<br>
</p>
<ol>
<li>The number of characters in a string</li>
<li>The number of bytes used to store a string and</li>
<li>The number of columns used to display the string -- some chars
may be double width even in a monospaced font.<br>
</li>
</ol>
Fortunately <tt>nchar</tt> is little used (and often to see if a
string is empty), but the C-level equivalents are widely used in all
three meanings, and it is used at R level in all three meanings.<br>
<br>
<span style="font-weight: bold;">Update:</span> This document was first
written in December 2003: see the <a href="#Encodings_in_R_2.1.0">below</a>
for changes made for R 2.1.0.<br>
<h2>Encoding in R 1.8.x</h2>
<p>The default behavour is to treat characters as a stream of
8-bit
bytes, and not to interpret them other than to assume that each byte
represents one character. The only exceptions are
</p>
<ul>
<li>The connections mechanism allows the remapping of input files to
the `native' encoding. Since the encoding defaults to
<samp>getOption("encoding")</samp> it is possible to use this within
<samp>read.table</samp> fairly easily. Note that this is a
byte-level remapping, and that not all of R's input goes through the
connections mechanism.</li>
<li>Those graphical devices which name glyphs, notably
<samp>postscript</samp> and <samp>pdf</samp>, do have to deal with
encoding, and they allow the user to specify the byte-level mapping of
code to glyphs. This has been one of the problem areas as the
standard Adobe font metrics included in R only cover ISO Latin 1 and
not for example the Euro (although the URW font metrics supplied do
have it). Similarly, the Adobe fonts do not cover all of ISO
Latin 2.</li>
</ul>
With these exceptions, character encoding is the responsibility of the
environment provided by the OS, so<br>
<ul>
<li>What glyph is displayed on a graphics device depends on the
encoding of the font selected. <br>
</li>
<li>How output is displayed in the terminal depends on the font and
locale selected.<br>
</li>
<li>What numeric code is generated by keystrokes depends on the
keyboard mapping or locale in use.<br>
</li>
</ul>
<h2>Towards Unicode?</h2>
<p>
It seems generally agreed that Unicode is the way to cope with all
known character sets. There is a comprehensive <a
href="http://www.unicode.org/faq/">FAQ</a>. Unicode defines a
numbering of characters up to 31 bits although it seems agreed than
only 21 bits will ever be used. However, to use it as an
encoding would be rather wasteful, and most people seem to use UTF-8
(see this <a
href="http://www.cl.cam.ac.uk/%7Emgk25/unicode.html">FAQ</a>, rather
Unix-oriented), in which each character is represented as 1,2,...,6
bytes (and how many can be deduced from the first byte). As
7-bit ASCII characters are represented as a single byte (with the high
bit zero) there is no storage overhead unless non-American characters
are used.
An alternative encoding is UTF-16, which is a two-byte encoding of
most characters and a pair of two-bytes for others (`surrogate
pairs'). UTF-16 without surrogates is sometimes known as UCS-2,
and was the Unicode standard prior to version 3.0. (Note that
the ISO C99 wide characters need not be encoded as UCS-2.)
UTF-16 is big-endian unless otherwise specified (as UTF-16LE).
There is the concept of a <a
href="http://www.unicode.org/faq/utf_bom.html">BOM</a>, a non-printing
first character that can be used to determine the endian-ness (and
which Microsoft code expects to see in UTF-16 files).<a
href="http://www.unicode.org/faq/utf_bom.html"><br>
</a></p>
<p>Not only can a single
character be stored in a variable number of bytes but it can be
displayed in 1, 2 or even 0 columns.</p>
<p>
Linux and other systems based on <samp>glibc</samp> are moving towards
UTF-8 support: if the locale is set to <samp>en_GB.utf8</samp> then
the run-time assumes UTF-8 encoding is required. <a
href="ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html">Here</a>
is a somewhat outdated Linux HOWTO: its advice is to use wide
characters internally and ISO C99 facilities to convert to and from
extenal representations.<br>
</p>
<p>Major Unix distributions
(e.g. <a href="http://wwws.sun.com/software/whitepapers/wp-unicode/">Solaris</a>
2.8) are also incorporating UTF-8 support. It appears that the Mac part
of MacOS X uses UTF-16. </p>
<p>Windows has long supported `wide characters', that is 2-byte
representations of characters, and provides fonts covering a very wide
range of glyphs (at least under NT-based versions of Windows).
This appears to be little-endian UCS-2, and it is said that internally
Windows NT uses wide characters, converting the usual byte-based
characters to and fro as needed. Some Asian versions of Windows
use a double-byte character set (DBCS) which appears to represent
characters in one or two bytes: this is the meaning of
<samp>char</samp> in Japanese-language versions of Windows.
Long filenames are stored in `Unicode', and are sometimes
automatically translated to the `OEM' character set (that is ASCII
plus an 8-bit extension set by the code page). Windows 2000 and later
have limited support for the surrogate pairs of UTF-16. Translations
from `Unicode' to UTF-8 and vice versa by functions
<samp>WideCharToMultiByte</samp> and <samp>MultiByteToWideChar</samp>
are supported in NT-based Windows versions, and in earlier ones with
the `Microsoft Layer for Unicode'.</p>
<h2>Implementation issues</h2>
If R were to use UTF-8 internally we would need to handle at least the
following issues<br>
<ul>
<li>Conversion to UTF-8 on input. This would be easy for
connections-based input (although a more general way to describe the
source encoding would be required), but all the console/keyboard-based
input routines would need to be modified. There would need to be
a more comprehensive way to specify encodings. Possibilities are to
use <a
href="http://www.gnu.org/software/libiconv/"><samp>libiconv</samp></a>
(if installed, or to install it ourselves) or a DIY approach like <a
href="http://www.tcl.tk/doc/howto/i18n.html">Tcl/Tk</a>.<br>
</li>
<li>Conversion of text output. This would be easy for
connections-based output, but dialog-box based output would need to be
handled, for example. It is not clear what to do with characters
which cannot be mapped -- the graphical devices currently map to
space.<br>
</li>
<li>Handling of file names. It is quite common to read,
manipulate and process file names. If these are allowed to be
UTF-8 this would be straightforward, but are they? Probably
usually not. Note that Unix kernels expect single bytes for NUL
and / at least, so cannot work with UTF-16 file names. <br> On
<a
href="http://developer.apple.com/technotes/tn2002/tn2078.html">MacOS
X</a> and Windows the encoding of file names depends on the file
system: the modern file systems use UCS-2 file names.<br>
</li>
<li>Graphical text output. This boils down to either selecting
suitable fonts or converting to the encoding of the fonts. I
suspect that under Windows a 2-byte encoding would be used, and X
servers can make use of ISO10646-1 fonts but the present device
would need its font-handling <a
href="http://www.debian.org/doc/manuals/intro-i18n/ch-output.en.html#s-output-x-xlib">internationalized</a>.<br>
</li>
<li>Text manipulation, for example <samp>match</samp> and
<samp>grep</samp> and <samp>tolower</samp>. For some of these
UTF-8 versions are readily available and others we would have to
rewrite. And a lucky few like <samp>match</samp> would work
directly on the encoded strings. For PCRE, only UTF-8 is available as
there is no wide-character version, whereas the GNU <samp>regex</samp>
has a wide-character version but not a UTF-8 one (and according to <a
href="http://www.cl.cam.ac.uk/%7Emgk25/unicode.html#mod">Markus
Kuhn</a> is 100x slower as a result).<br> String collation is also an
issue in a few places, but <samp>strcoll</samp> should be UTF-8-aware
on suitable OSes.<br> Most widespread is the use of
<samp>snprintf</samp>, <samp>strncmp</samp> and simple loops to e.g,
map <samp>\</samp> to <samp>/</samp> (and the latter is fine as no
ASCII character can occur as part of a multibyte sequence). The use of
classification types such as <samp>isalpha</samp> would need
replacement (and probably coercion to wide characters would be
easiest).
<samp>substr(ing)</samp> and <samp>strsplit</samp> will need to be
aware of character boundaries. Note that Unicode has three
cases, not two (the extra one being `title').<br>
</li>
<li>We would need to support the <samp>\uxxxx</samp> format for
arbitrary Unicode characters.</li>
<li>The format for the distribution of R sources. Fortunately
only a few files are not in ASCII: some <samp>.Rd</samp> files were, as
well as the <samp>THANKS</samp> file.<br>
</li>
<li>Help files. Most modern Web browsers can display UTF-8, and Perl
5.8 is apparently aware of UTF-8 (and uses it internally) so it
<it>may</it> be fairly easy to make use of our existing
<samp>Rdconv</samp>. I have just added a
<samp>charset=iso-8859-1</samp> to the header of the converted HTML
help files, and this would need to be changed. Since LaTeX cannot
handle Unicode we would have to convert the encoding of latex help
files or use Lambda (and tell it they were in UTF-8).<br>
</li>
<li>Environment variables could have both names and values in UTF-8.<br>
</li>
</ul>
The API for extending R would be problematical. There are a few
hundred R extensions written in C and FORTRAN, and a few of them
manipulate character vectors. They would not be expecting UTF-8
encoding (and probably have not thought about encodings at all).
Possible ways forward are<br>
<ul>
<li>To map to a single-byte encoding (Latin1?) and back again when .C
does
the copying.</li>
<li>Just to pass through the stream of bytes.</li>
</ul>
This does raise the issue of whether the <samp>CHAR</samp> internal
type should be used for UTF-8 or a new type created. It would
probably be better to create a new type for raw bytes.<br>
<br>
Eiji Nakama's <a
href="http://www.stat.auckland.ac.nz/%7Eihaka/Papers/Nakama.pdf">paper</a>
on `Japanizing R' seems to take the earlier multi-byte character
approach rather than UTF-8 or UCS-2, except for Windows fonts.
Functions such as <samp>isalpha</samp> do not work correctly in MBCSs
(including UTF-8).<br>
<br>
The Debian <a href="http://www.debian.org/doc/manuals/intro-i18n/">guide
to internationalization</a> is a useful background resource. Note
that internationalization is often abbreviated as 'i18n', and
localization (support for a particular locale) as 'L10n'. The
main other internationalization/localization issue is to allow for the
translation of messages (and to translate them).<br>
<br>
<h2><a name="Encodings_in_R_2.1.0"></a>Encodings in R 2.1.0</h2>
Work has started in December 2004 on implementing UTF-8 support for R
2.1.0, expected to be released in April 2005. Currently
implemented are:
<ul>
<li>The parser has been made aware of multi-byte characters in UTF-8
and so works in character (rather than byte) units.
</li>
<li>An internationalized version of the regexp code. For the
basic and extended regexps we use the code from <samp>glibc-2.3.3
</samp>which internally uses widechars and so supports all
multi-byte character sets, e.g. UTF-8. For the Perl versions
we use PCRE, which has UTF-8 (but not general MBCS) support
available.
</li>
<li>Replacement versions of <samp>chartr</samp>,
<samp>toupper</samp> and <samp>tolower</samp> work <it>via</it>
conversion to widechar and so handle any MBCS that the OS supports
as the current locale.
</li>
<li><samp>substr()</samp>, <samp>make.names()</samp> work with
characters not bytes.
</li>
<li><samp>nchar()</samp> has an additional argument to return the
number of bytes, the number of characters or the display
width. It was often used in conjunction with
<samp>substr()</samp> to truncate character strings: that
should be done in terms of display width for which there is a
new function <samp>strtrim()</samp>.
</li>
<li>A new function <samp>iconv()</samp> allows character vectors to
be mapped between encodings (where it is available: <a
href="http://www.gnu.org/software/libiconv/">GNU libiconv</a> has
been grafted on for the Windows build).
</li>
<li>The '<samp>encoding</samp>' argument of connections has been
changed from a numeric vector to a character string naming an
encoding that <samp>iconv</samp> knows about, and re-encoding on the
fly can now be done on both input and output. Note that this
does not apply to the 'terminal' connections nor text connections,
but does to all file-like connections. If input is redirected from
a file (or pipe), the input encoding can be specified by the
command-line flag <samp>--encoding</samp>.
</li>
<li>The <samp>postscript()</samp> and <samp>pdf()</samp> devices
handle UTF-8 strings by remapping to Latin1 (this is currently
hardcoded).
</li>
<li>A start has been made on converting the <samp>X11()</samp>
device and the X11-based data editor using Nakama's Japanization
patches, but adding X input methods to the data editor so it does
now work in a (Western) UTF-8 locale.
</li>
<li><samp>scan()</samp> needs single-byte chars for its decimal,
comment and separator characters -- this is now enforced. It still
uses <samp>isspace</samp> and <samp>isdigit</samp>, so only ASCII
space and digit chars are recognized (but this seems little
problem).
</li>
<li><samp>abbreviate()</samp> is a problem: its algorithm is
hardcoded for English (e.g. which bytes are vowels) and it now warns
if given non-ASCII text.
</li>
<li><samp>print()</samp>ing looks for valid characters and only
escapes non-printable characters (rather than bytes). It does so
by converting to widechars and using the <samp>wctype</samp> functions
in the current locale.
</li>
<li>UTF-8 strings are passed to and from the <samp>tcltk</samp>
package (this applies in any MBCS).
</li>
<li>
There is some support for <samp>pch=n</samp> > 127 and
<samp>pch="c"</samp> in UTF-8 locales, where a number is taken to be
the Unicode character number, and the first MBCS character is
taken.
</li>
<li>
The replacement for <samp>strptime</samp> has been rewritten to work
a character at a time, using widechars internally.
</li>
<li>
The Hershey fonts are encoded in Latin-1, so the
<samp>vfont</samp> support has been rewritten to re-encode to
Latin-1.
</li>
<li>
A new function <samp>localeToCharset</samp> attempts to deduce
plausible character sets from the locale name (on Unix and on
Windows). This is used by <samp>source</samp> to test out plausible
encodings if the (new) argument <samp>encoding = "unknown"</samp> is
specified.
</li>
<li>
<samp>.Rd</samp> has a new directive <samp>\encoding{}</samp> to set
the encoding to be assumed for the file and hence its HTML
translation (and also this is given as a comment in the example
file). Note that one has to be careful here, as some
implementations of <samp>iconv</samp> do not allow any 8-bit chars
in the <samp>C</samp> locale, and the lack of standards for charset
names is also a problem.
</li>
<li>
The Windows console and data editor have been modified to work with
MBCS character sets, as well as having support for double-width
characters.
</li>
<li>
<samp>readChar</samp> and <samp>writeChar</samp> work in characters
not bytes.
</li>
<li>
<samp>.C</samp> supports a new argument <samp>ENCODING=</samp> to
specify the encoding expected for character strings.
</li>
<li>
<samp>delimMatch</samp> (<samp>tools</samp>) returns the position and
match length in characters not bytes, and allows multi-byte delimiters.
</li>
</ul>
For many of these features R needs to be configured with
<samp>--enable-utf8</samp>.
<h3>Implementation details</h3>
The C code often handles character strings as a whole. We have
identified the following places where character-level access is used:
<ul>
<li>In the parser to identify tokens. (<samp>gram.y</samp>)</li>
<li><samp>do_nchar, do_substr, do_substrgets, do_strsplit,
do_abbrev, do_makenames, do_grep, do_gsub, do_regexpr, do_tolower,
do_chartr, do_agrep, do_strtrim</samp> (<samp>character.c</samp>),
<samp>do_pgrep, do_pgsub, do_pregexpr</samp>
(<samp>pcre.c</samp>)</li>
<li><samp>GEText, GEMetricInfo</samp>.(<samp>engine.c</samp>)</li>
<li><samp>RenderStr</samp>. (<samp>plotmath.c</samp>)</li>
<li><samp>RStrlen</samp>, <samp>EncodeString</samp>
(<samp>printutil.c</samp>)</li>
<li>The dataentry editor (various <samp>dataentry.c</samp>)</li>
<li>Graphics devices in handling encoded text, and in metric
info. (Currently <samp>devX11.c, rotated.c</samp> and
<samp>devPS.c</samp> have been changed, and <samp>devPicTeX.c</samp>
is tied to TeX which is a byte-based program.)
</li>
<li>
The ASCII versions of <samp>load</samp> and <samp>save</samp>. As
these are a reversible representation of objects in ASCII, it does
not matter if they are handled as byte streams.
</li>
<li>
New wrapper functions <samp>Rf_strchr</samp>,
<samp>Rf_strrchr</samp> and <samp>R_fixslash</samp> cover
comparisons with single ASCII characters.<br><br>
<samp>backquotify</samp> (<samp>deparse.c</samp>),
<samp>do_dircreate</samp> (<samp>platform.c</samp>), <samp>do_getwd</samp>,
<samp>do_basename</samp>, <samp>do_dirname</samp> and
<samp>isBlankString</samp> (<samp>util.c</samp>)
are now MBCS-aware.
</li>
</ul>
<p>
There are many other places which do a comparison with a single ASCII
character (such as . or / or \ or LF) and so cause no problem in UTF-8
but might in other MBCSs. These include <samp>filbuf</samp>
(<samp>platform.c</samp>, which looks for CR and LF and these seem
safe), <samp>fillBuffer</samp> (<samp>scan.c</samp>) and there are
others.
<p>
Encodings which are likely to cause problems include
<ul>
<li> Vietnamese (VISCII). This uses 186 characters including the
control characters <samp>0x02, 0x05, 0x06, 0x14, 0x19, 0x1e</samp>:
the Windows GUI makes use of these as control characters.
</li>
<li>Big5, GBK, Shift-JIS. These are all 1- or 2-byte encodings
including ASCII as 1-byte chars (except Shift-JIS replaces backspace
by ¥) but whose second byte overlaps the ASCII range.
</li>
</ul>
<p> <samp>fillBuffer</samp> (<samp>scan.c</samp>) has now been
rewritten to be aware of double-byte character sets and to only test
the lead byte.
<h3>Windows</h3>
Windows does things somewhat differently. `Standard' versions of
Windows have only single-byte locales, with the interpretation of
those bytes being determined by <em>code pages</em>. However, `East
Asian' versions (an optional install at least on Windows XP) use
double-byte locales in which characters can be represented by one or
two bytes (and can be one or two columns wide).
Windows also has `Unicode' (UCS-2) applications in which all
information is transferred as 16-bit wide characters, and the locale
does not affect the interpretation. Windows 2000 and later have
optional support for surrogate pairs (UTF-16) but this is not normally
enabled. (See <a
href="http://www.i18nguy.com/surrogates.html">here</a> for how to
enable it.)
Currently R-devel has three levels of MBCS support under Windows.
<ul>
<li> By default, all character strings are interpreted as single bytes.
</li>
<li> If <samp>SUPPORT_MBCS</samp> is defined in <samp>MkRules</samp>
and in <samp>config.h</samp>, <samp>R.dll</samp> will recognize
multi-byte characters if run in a MBCS locale and generally (but not
always, notably in <samp>scan</samp>) treat them as whole units.
</li>
<li> If in addition <samp>SUPPORT_GUI_MBCS</samp> is defined in
<samp>MkRules</samp>, <samp>RGui</samp> is compiled to be aware of
multi-byte characters if run in a MBCS locale, and cursor movements
will work in whole characters, with the cursor width adapting to the
current character's width.
</li>
<li> If <samp>SUPPORT_UTF8</samp> is defined in addition to
<samp>SUPPORT_MBCS</samp>, most of <samp>R.dll</samp> will assume it
is running in a UTF-8 locale. As there are no such locales under
Windows, this is only useful with a custom front-end that
communicates in UTF-8 (and even then there are issues with file
names and content, and environment variables).
</li>
</ul>
<h2>Localization of messages</h2>
As from 2005-01-25, R uses GNU <samp>gettext</samp> where available.
So far only the start-up message is marked for translation, as a
proof-of-concept: there are several thousand C-level messages that
could potentially be translated.
The same mechanism could be applied to R packages, provided they call
<samp>dgettext</samp> with a <samp>PACKAGE</samp> specific to the
package, and install their own <samp>PACKAGE.mo</samp> files, say via
an <samp>inst/po</samp> directory. The <samp>splines</samp> package
was been converted to show how this might be done: it only has one
error message.
<br><br>
Brian Ripley<br>
2004-01-11, 2005-01-25<br>
</p>
</body>
</html>