Skip to content

What is the state of unicode support in GAP?

Markus Pfeiffer edited this page Oct 20, 2015 · 3 revisions

Unicode support in strings

Support for Unicode characters (better called code points to avoid confusion) and strings is provided in the GAPDoc package.

Since GAP predates the standardisation of Unicode, GAP strings have the following properties according to GAP's documentation:

A string is a dense list (see IsList (21.1-1), IsDenseList (21.1-2))
of characters (see IsChar (27.1-1)); thus strings are always homogeneous
(see IsHomogeneousList (21.1-3)).

According to Frank Lübeck (@frankluebeck), the author of the GAPDoc package:

On all currently supported operating systems there are 256 objects in GAP in the filter IsChar. They are in bijection with the integers in the range [0..255] (via CharInt and IntChar):

gap> List([0..255],CharInt);
gap> List(last,IntChar) = [0..255];

Some functionality for Unicode strings is provided by the GAPDoc package, documented in the GAPDoc Manual. This includes encoding and decoding support as well as translation functions into LaTeX and HTML.

Also, Frank Lübeck has gone through the effort of converting GAP's source files to UTF-8 encoding:

I have changed (almost) all files in the GAP distribution to UTF-8 encoding in 2010. So, you can do what you suggest in comments or printable strings. GAP also uses some heuristics to detect if it is running in a UTF-8 terminal and adjusts some viewing, printing and the display of help pages accordingly. Try, as an example,

?ClassMultiplicationCoefficient

in a UTF-8 terminal.

Unicode support in GAP source

Using Unicode in comments is possible, but might bear some of the problems that are stated for identifiers below. Providing Unicode support for identifiers in GAP code is not supported, even discouraged for the following reasons:

  • Non-ASCII characters are usually not easily entered in an editor
  • Non-ASCII characters can actually be displayed wrong, not at all, or look (almost) the same as ASCII characters, for example in your browser Α and A will most likely look (almost) the same, but they are in fact different code points.
  • Non-ASCII characters can be hard to copy-and-paste.
Clone this wiki locally