Encodings_and_R.html

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta http-equiv="content-type"
 content="text/html; charset=ISO-8859-1">
  <title>Encodings and R</title>
</head>
<body>
<h1 align="center">Encodings and R</h1>
<p>
The use of encodings is raised sporadically on the R mailing lists,
with discussion of ideas to `do better'.&nbsp; R has been developed by
authors speaking English or a Western European language, and its
current mindset is the ISO Latin 1 (aka ISO 8859-1) character
set.&nbsp; Even these authors find some problems, for example the lack
of&nbsp; some currency symbols (notably the Euro, &#8364; if it displays for
you).&nbsp; Users of R in Central Europe need more characters and are
sometimes puzzled that Latin 2 (aka ISO 8859-2) works only
partially.&nbsp;
Other languages present much greater challenges, and there is a <a
 href="http://jasp.ism.ac.jp/workshop0312/ismws.2003.pdf">project</a>
to `Japanize' R which (for obvious reasons) is little known outside
Japan.
</p>
<p>One of the challenges is that in European usage, <tt>nchar(x)</tt>
is the number of characters in the string but is used for
adjusting layouts.&nbsp; In other encodings there can be different
values for<br>
</p>
<ol>
  <li>The number of characters in a string</li>
  <li>The number of bytes used to store a string and</li>
  <li>The number of columns used to display the string -- some chars
may be double width even in a monospaced font.<br>
  </li>
</ol>
Fortunately <tt>nchar</tt> is little used (and often to see if a
string is empty), but the C-level equivalents are widely used in all
three meanings, and it is used at R level in all three meanings.<br>
<br>
<span style="font-weight: bold;">Update:</span> This document was first
written in December 2003: see the <a href="#Encodings_in_R_2.1.0">below</a>
for changes made for R 2.1.0.<br>

<h2>Encoding in R 1.8.x</h2>
<p>The default behavour&nbsp; is to treat characters as a stream of
8-bit
bytes, and not to interpret them other than to assume that each byte
represents one character.&nbsp; The only exceptions are
</p>
<ul>
  <li>The connections mechanism allows the remapping of input files to
the `native' encoding.&nbsp; Since the encoding defaults to
<samp>getOption("encoding")</samp> it is possible to use this within
<samp>read.table</samp> fairly easily.&nbsp; Note that this is a
byte-level remapping, and that not all of R's input goes through the
connections mechanism.</li>
  <li>Those graphical devices which name glyphs, notably
<samp>postscript</samp> and <samp>pdf</samp>, do have to deal with
encoding, and they allow the user to specify the byte-level mapping of
code to glyphs.&nbsp; This has been one of the problem areas as the
standard Adobe font metrics included in R only cover ISO Latin 1 and
not for example the Euro (although the URW font metrics supplied do
have it).&nbsp; Similarly, the Adobe fonts do not cover all of ISO
Latin 2.</li>
</ul>
With these exceptions, character encoding is the responsibility of the
environment provided by the OS, so<br>
<ul>
  <li>What glyph is displayed on a graphics device depends on the
encoding of the font selected.&nbsp; <br>
  </li>
  <li>How output is displayed in the terminal depends on the font and
locale selected.<br>
  </li>
  <li>What numeric code is generated by keystrokes depends on the
keyboard mapping or locale in use.<br>
  </li>
</ul>

<h2>Towards Unicode?</h2>
<p>
It seems generally agreed that Unicode is the way to cope with all
known character sets.&nbsp; There is a comprehensive <a
href="http://www.unicode.org/faq/">FAQ</a>. Unicode defines a
numbering of characters up to 31 bits although it seems agreed than
only 21 bits will ever be used.&nbsp; However, to use it as an
encoding would be rather wasteful, and most people seem to use UTF-8
(see this <a
href="http://www.cl.cam.ac.uk/%7Emgk25/unicode.html">FAQ</a>, rather
Unix-oriented), in which each character is represented as 1,2,...,6
bytes (and how many can be deduced from the first byte).&nbsp; As
7-bit ASCII characters are represented as a single byte (with the high
bit zero) there is no storage overhead unless non-American characters
are used.

An alternative encoding is UTF-16, which is a two-byte encoding of
most characters and a pair of two-bytes for others (`surrogate
pairs').&nbsp; UTF-16 without surrogates is sometimes known as UCS-2,
and was the Unicode standard prior to version 3.0.&nbsp; (Note that
the ISO C99 wide characters need not be encoded as UCS-2.)&nbsp;
UTF-16 is big-endian unless otherwise specified (as UTF-16LE).&nbsp;
There is the concept of a <a
href="http://www.unicode.org/faq/utf_bom.html">BOM</a>, a non-printing
first character that can be used to determine the endian-ness (and
which Microsoft code expects to see in UTF-16 files).<a
href="http://www.unicode.org/faq/utf_bom.html"><br>
</a></p>
<p>Not only can a single
character be stored in a variable number of bytes but it can be
displayed in 1, 2 or even 0 columns.</p>
<p>
Linux and other systems based on <samp>glibc</samp> are moving towards
UTF-8 support: if the locale is set to <samp>en_GB.utf8</samp> then
the run-time assumes UTF-8 encoding is required. <a
href="ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html">Here</a>
is a somewhat outdated Linux HOWTO: its advice is to use wide
characters internally and ISO C99 facilities to convert to and from
extenal representations.<br>
</p>
<p>Major Unix distributions
(e.g. <a href="http://wwws.sun.com/software/whitepapers/wp-unicode/">Solaris</a>
2.8) are also incorporating UTF-8 support. It appears that the Mac part
of MacOS X uses UTF-16. </p>
<p>Windows has long supported `wide characters', that is 2-byte
representations of characters, and provides fonts covering a very wide
range of glyphs (at least under NT-based versions of Windows).&nbsp;
This appears to be little-endian UCS-2, and it is said that internally
Windows NT uses wide characters, converting the usual byte-based
characters to and fro as needed.&nbsp; Some Asian versions of Windows
use a double-byte character set (DBCS) which appears to represent
characters in one or two bytes: this is the meaning of
<samp>char</samp> in Japanese-language versions of Windows.&nbsp;
Long filenames are stored in `Unicode', and are sometimes
automatically translated to the `OEM' character set (that is ASCII
plus an 8-bit extension set by the code page).  Windows 2000 and later
have limited support for the surrogate pairs of UTF-16.&nbsp; Translations
from `Unicode' to UTF-8 and vice versa by functions
<samp>WideCharToMultiByte</samp> and <samp>MultiByteToWideChar</samp>
are supported in NT-based Windows versions, and in earlier ones with
the `Microsoft Layer for Unicode'.</p>

<h2>Implementation issues</h2>
If R were to use UTF-8 internally we would need to handle at least the
following issues<br>
<ul>
  <li>Conversion to UTF-8 on input.&nbsp; This would be easy for
connections-based input (although a more general way to describe the
source encoding would be required), but all the console/keyboard-based
input routines would need to be modified.&nbsp; There would need to be
a more comprehensive way to specify encodings. Possibilities are to
use <a
href="http://www.gnu.org/software/libiconv/"><samp>libiconv</samp></a>
(if installed, or to install it ourselves) or a DIY approach like <a
href="http://www.tcl.tk/doc/howto/i18n.html">Tcl/Tk</a>.<br>
  </li>
  <li>Conversion of text output.&nbsp; This would be easy for
connections-based output, but dialog-box based output would need to be
handled, for example.&nbsp; It is not clear what to do with characters
which cannot be mapped -- the graphical devices currently map to
space.<br>
  </li>
  <li>Handling of file names.&nbsp; It is quite common to read,
manipulate and process file names.&nbsp; If these are allowed to be
UTF-8 this would be straightforward, but are they?&nbsp; Probably
usually not.&nbsp; Note that Unix kernels expect single bytes for NUL
and&nbsp; / at least, so cannot work with UTF-16 file names. <br> On
<a
href="http://developer.apple.com/technotes/tn2002/tn2078.html">MacOS
X</a> and Windows the encoding of file names depends on the file
system: the modern file systems use UCS-2 file names.<br>
  </li>
  <li>Graphical text output.&nbsp; This boils down to either selecting
suitable fonts or converting to the encoding of the fonts.&nbsp; I
suspect that under Windows a 2-byte encoding would be used, and X
servers can make use of&nbsp; ISO10646-1 fonts but the present device
would need its font-handling <a
href="http://www.debian.org/doc/manuals/intro-i18n/ch-output.en.html#s-output-x-xlib">internationalized</a>.<br>
  </li>
  <li>Text manipulation, for example <samp>match</samp> and
<samp>grep</samp> and <samp>tolower</samp>.&nbsp; For some of these
UTF-8 versions are readily available and others we would have to
rewrite.&nbsp; And a lucky few like <samp>match</samp> would work
directly on the encoded strings. For PCRE, only UTF-8 is available as
there is no wide-character version, whereas the GNU <samp>regex</samp>
has a wide-character version but not a UTF-8 one (and according to <a
href="http://www.cl.cam.ac.uk/%7Emgk25/unicode.html#mod">Markus
Kuhn</a> is 100x slower as a result).<br> String collation is also an
issue in a few places, but <samp>strcoll</samp> should be UTF-8-aware
on suitable OSes.<br> Most widespread is the use of
<samp>snprintf</samp>, <samp>strncmp</samp> and simple loops to e.g,
map <samp>\</samp> to <samp>/</samp> (and the latter is fine as no
ASCII character can occur as part of a multibyte sequence). The use of
classification types such as <samp>isalpha</samp> would need
replacement (and probably coercion to wide characters would be
easiest).&nbsp;
<samp>substr(ing)</samp> and <samp>strsplit</samp> will need to be
aware of character boundaries.&nbsp; Note that Unicode has three
cases, not two (the extra one being `title').<br>
  </li>
  <li>We would need to support the <samp>\uxxxx</samp> format for
arbitrary Unicode characters.</li>
  <li>The format for the distribution of R sources.&nbsp; Fortunately
only a few files are not in ASCII: some <samp>.Rd</samp> files were, as
well as the <samp>THANKS</samp> file.<br>
  </li>
  <li>Help files. Most modern Web browsers can display UTF-8, and Perl
5.8 is apparently aware of UTF-8 (and uses it internally) so it
<it>may</it> be fairly easy to make use of our existing
<samp>Rdconv</samp>.&nbsp; I have just added a
<samp>charset=iso-8859-1</samp> to the header of the converted HTML
help files, and this would need to be changed. Since LaTeX cannot
handle Unicode we would have to convert the encoding of latex help
files or use Lambda (and tell it they were in UTF-8).<br>
  </li>
  <li>Environment variables could have both names and values in UTF-8.<br>
  </li>
</ul>
The API for extending R would be problematical.&nbsp; There are a few
hundred R extensions written in C and FORTRAN, and a few of them
manipulate character vectors.&nbsp; They would not be expecting UTF-8
encoding (and probably have not thought about encodings at all).&nbsp;
Possible ways forward are<br>
<ul>
  <li>To map to a single-byte encoding (Latin1?) and back again when .C
does
the copying.</li>
  <li>Just to pass through the stream of bytes.</li>
</ul>
This does raise the issue of whether the <samp>CHAR</samp> internal
type should be used for UTF-8 or a new type created.&nbsp; It would
probably be better to create a new type for raw bytes.<br>
<br>
Eiji Nakama's <a
href="http://www.stat.auckland.ac.nz/%7Eihaka/Papers/Nakama.pdf">paper</a>
on `Japanizing R' seems to take the earlier multi-byte character
approach rather than UTF-8 or UCS-2, except for Windows fonts.&nbsp;
Functions such as <samp>isalpha</samp> do not work correctly in MBCSs
(including UTF-8).<br>
<br>
The Debian <a href="http://www.debian.org/doc/manuals/intro-i18n/">guide
to internationalization</a> is a useful background resource.&nbsp; Note
that internationalization is often abbreviated as 'i18n', and
localization (support for a particular locale) as 'L10n'.&nbsp; The
main other internationalization/localization issue is to allow for the
translation of messages (and to translate them).<br>
<br>
<h2><a name="Encodings_in_R_2.1.0"></a>Encodings in R 2.1.0</h2>
Work has started in December 2004 on implementing UTF-8 support for R
2.1.0, expected to be released in April 2005.&nbsp; Currently
implemented are:
<ul>
  <li>The parser has been made aware of multi-byte characters in UTF-8
  and so works in character (rather than byte) units.
  </li>

  <li>An internationalized version of the regexp code.&nbsp; For the
  basic and extended regexps we use the code from <samp>glibc-2.3.3
  </samp>which internally uses widechars and so supports all
  multi-byte character sets, e.g.  UTF-8.&nbsp; For the Perl versions
  we use PCRE, which has UTF-8 (but not general MBCS) support
  available.
  </li>

  <li>Replacement versions of <samp>chartr</samp>,
  <samp>toupper</samp> and <samp>tolower</samp> work <it>via</it>
  conversion to widechar and so handle any MBCS that the OS supports
  as the current locale.
  </li>

  <li><samp>substr()</samp>, <samp>make.names()</samp> work with
  characters not bytes.
  </li>

  <li><samp>nchar()</samp> has an additional argument to return the
  number of bytes, the number of characters or the display
  width.&nbsp; It was often used in conjunction with
  <samp>substr()</samp> to truncate character strings:&nbsp; that
  should be done in terms of&nbsp; display width for which there is a
  new function <samp>strtrim()</samp>.
  </li>

  <li>A new function <samp>iconv()</samp> allows character vectors to
  be mapped between encodings (where it is available: <a
  href="http://www.gnu.org/software/libiconv/">GNU libiconv</a> has
  been grafted on for the Windows build).
  </li>

  <li>The '<samp>encoding</samp>' argument of connections has been
  changed from a numeric vector to a character string naming an
  encoding that <samp>iconv</samp> knows about, and re-encoding on the
  fly can now be done on both input and output.&nbsp; Note that this
  does not apply to the 'terminal' connections nor text connections,
  but does to all file-like connections.  If input is redirected from
  a file (or pipe), the input encoding can be specified by the
  command-line flag <samp>--encoding</samp>.
  </li>

  <li>The <samp>postscript()</samp> and <samp>pdf()</samp> devices
  handle UTF-8 strings by remapping to Latin1 (this is currently
  hardcoded).
  </li>

  <li>A start has been made on converting the <samp>X11()</samp>
  device and the X11-based data editor using Nakama's Japanization
  patches, but adding X input methods to the data editor so it does
  now work in a (Western) UTF-8 locale.
  </li>

  <li><samp>scan()</samp> needs single-byte chars for its decimal,
  comment and separator characters -- this is now enforced.  It still
  uses <samp>isspace</samp> and <samp>isdigit</samp>, so only ASCII
  space and digit chars are recognized (but this seems little
  problem).
  </li>

  <li><samp>abbreviate()</samp> is a problem: its algorithm is
  hardcoded for English (e.g. which bytes are vowels) and it now warns
  if given non-ASCII text.
  </li>

  <li><samp>print()</samp>ing looks for valid characters and only
  escapes non-printable characters (rather than bytes).&nbsp; It does so
  by converting to widechars and using the <samp>wctype</samp> functions
  in the current locale.
  </li>

  <li>UTF-8 strings are passed to and from the <samp>tcltk</samp>
  package (this applies in any MBCS).
  </li>

  <li>
  There is some support for <samp>pch=n</samp> &gt; 127 and
  <samp>pch="c"</samp> in UTF-8 locales, where a number is taken to be
  the Unicode character number, and the first MBCS character is
  taken.
  </li>

  <li>
  The replacement for <samp>strptime</samp> has been rewritten to work
  a character at a time, using widechars internally.
  </li>

  <li>
  The Hershey fonts are encoded in Latin-1, so the
  <samp>vfont</samp> support has been rewritten to re-encode to
  Latin-1.
  </li>

  <li>
  A new function <samp>localeToCharset</samp> attempts to deduce
  plausible character sets from the locale name (on Unix and on
  Windows).  This is used by <samp>source</samp> to test out plausible
  encodings if the (new) argument <samp>encoding = "unknown"</samp> is
  specified.
  </li>

  <li>
  <samp>.Rd</samp> has a new directive <samp>\encoding{}</samp> to set
  the encoding to be assumed for the file and hence its HTML
  translation (and also this is given as a comment in the example
  file).  Note that one has to be careful here, as some
  implementations of <samp>iconv</samp> do not allow any 8-bit chars
  in the <samp>C</samp> locale, and the lack of standards for charset
  names is also a problem.
  </li>

  <li>
  The Windows console and data editor have been modified to work with
  MBCS character sets, as well as having support for double-width
  characters.
  </li>

  <li>
  <samp>readChar</samp> and <samp>writeChar</samp> work in characters
  not bytes.
  </li>

  <li>
  <samp>.C</samp> supports a new argument <samp>ENCODING=</samp> to
  specify the encoding expected for character strings.
  </li>

  <li>
  <samp>delimMatch</samp> (<samp>tools</samp>) returns the position and
  match length in characters not bytes, and allows multi-byte delimiters.
  </li>
</ul>
For many of these features R needs to be configured with
<samp>--enable-utf8</samp>.

<h3>Implementation details</h3>
The C code often handles character strings as a whole.&nbsp; We have
identified the following places where character-level access is used:
<ul>
  <li>In the parser to identify tokens. (<samp>gram.y</samp>)</li>
  <li><samp>do_nchar, do_substr, do_substrgets, do_strsplit,
  do_abbrev, do_makenames, do_grep, do_gsub, do_regexpr, do_tolower,
  do_chartr, do_agrep, do_strtrim</samp> (<samp>character.c</samp>),
  <samp>do_pgrep, do_pgsub,&nbsp; do_pregexpr</samp>
  (<samp>pcre.c</samp>)</li>
  <li><samp>GEText, GEMetricInfo</samp>.(<samp>engine.c</samp>)</li>
  <li><samp>RenderStr</samp>. (<samp>plotmath.c</samp>)</li>
  <li><samp>RStrlen</samp>, <samp>EncodeString</samp>
  (<samp>printutil.c</samp>)</li>
  <li>The dataentry editor (various <samp>dataentry.c</samp>)</li>
  <li>Graphics devices in handling encoded text, and in metric
  info.&nbsp; (Currently <samp>devX11.c, rotated.c</samp> and
  <samp>devPS.c</samp> have been changed, and <samp>devPicTeX.c</samp>
  is tied to TeX which is a byte-based program.)
  </li>
  <li>
  The ASCII versions of <samp>load</samp> and <samp>save</samp>.  As
  these are a reversible representation of objects in ASCII, it does
  not matter if they are handled as byte streams.
  </li>
  <li>
  New wrapper functions <samp>Rf_strchr</samp>,
  <samp>Rf_strrchr</samp> and <samp>R_fixslash</samp> cover
  comparisons with single ASCII characters.<br><br>
  <samp>backquotify</samp> (<samp>deparse.c</samp>),
  <samp>do_dircreate</samp> (<samp>platform.c</samp>), <samp>do_getwd</samp>,
  <samp>do_basename</samp>, <samp>do_dirname</samp> and
  <samp>isBlankString</samp> (<samp>util.c</samp>)
  are now MBCS-aware.
  </li>
</ul>

<p>
There are many other places which do a comparison with a single ASCII
character (such as . or / or \ or LF) and so cause no problem in UTF-8
but might in other MBCSs.&nbsp; These include <samp>filbuf</samp>
(<samp>platform.c</samp>, which looks for CR and LF and these seem
safe), <samp>fillBuffer</samp> (<samp>scan.c</samp>) and there are
others.

<p>
Encodings which are likely to cause problems include
<ul>
  <li> Vietnamese (VISCII).  This uses 186 characters including the
  control characters <samp>0x02, 0x05, 0x06, 0x14, 0x19, 0x1e</samp>:
  the Windows GUI makes use of these as control characters.
  </li>
  <li>Big5, GBK, Shift-JIS.  These are all 1- or 2-byte encodings
  including ASCII as 1-byte chars (except Shift-JIS replaces backspace
  by &yen;) but whose second byte overlaps the ASCII range.
  </li>
</ul>

<p> <samp>fillBuffer</samp> (<samp>scan.c</samp>) has now been
rewritten to be aware of double-byte character sets and to only test
the lead byte.

<h3>Windows</h3>

Windows does things somewhat differently.  `Standard' versions of
Windows have only single-byte locales, with the interpretation of
those bytes being determined by <em>code pages</em>.  However, `East
Asian' versions (an optional install at least on Windows XP) use
double-byte locales in which characters can be represented by one or
two bytes (and can be one or two columns wide).

Windows also has `Unicode' (UCS-2) applications in which all
information is transferred as 16-bit wide characters, and the locale
does not affect the interpretation.  Windows 2000 and later have
optional support for surrogate pairs (UTF-16) but this is not normally
enabled.  (See <a
href="http://www.i18nguy.com/surrogates.html">here</a> for how to
enable it.)

Currently R-devel has three levels of MBCS support under Windows.
<ul>
  <li> By default, all character strings are interpreted as single bytes.
  </li>
  <li> If <samp>SUPPORT_MBCS</samp> is defined in <samp>MkRules</samp>
  and in <samp>config.h</samp>, <samp>R.dll</samp> will recognize
  multi-byte characters if run in a MBCS locale and generally (but not
  always, notably in <samp>scan</samp>) treat them as whole units.
  </li>
  <li> If in addition <samp>SUPPORT_GUI_MBCS</samp> is defined in 
  <samp>MkRules</samp>, <samp>RGui</samp> is compiled to be aware of 
  multi-byte characters if run in a MBCS locale, and cursor movements
  will work in whole characters, with the cursor width adapting to the
  current character's width.
  </li>
  <li> If <samp>SUPPORT_UTF8</samp> is defined in addition to
  <samp>SUPPORT_MBCS</samp>, most of <samp>R.dll</samp> will assume it
  is running in a UTF-8 locale.  As there are no such locales under
  Windows, this is only useful with a custom front-end that
  communicates in UTF-8 (and even then there are issues with file
  names and content, and environment variables).
  </li>
</ul>

<h2>Localization of messages</h2>

As from 2005-01-25, R uses GNU <samp>gettext</samp> where available.
So far only the start-up message is marked for translation, as a
proof-of-concept: there are several thousand C-level messages that
could potentially be translated.

The same mechanism could be applied to R packages, provided they call
<samp>dgettext</samp> with a <samp>PACKAGE</samp> specific to the
package, and install their own <samp>PACKAGE.mo</samp> files, say via
an <samp>inst/po</samp> directory.  The <samp>splines</samp> package
was been converted to show how this might be done: it only has one
error message.

<br><br>
Brian Ripley<br>
2004-01-11, 2005-01-25<br>
</p>
</body>
</html>