design/collation_options.html

<html>

<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<meta name="GENERATOR" content="Microsoft FrontPage 4.0">
<meta name="ProgId" content="FrontPage.Editor.Document">
<title>Collation Attributes</title>
</head>

<body bgcolor="#FFFFFF">

<h2>Collation Attributes</h2>
<h2><i>(Draft 2003-02-18, MED)</i></h2>
<p>When collating or matching text, a number of attributes can be used to affect
the desired result. The following describes the attributes, their values, their
effects, their normal usage, and the string comparison performance and sort key
length implications. It also includes single-letter abbreviations for both the
attributes and their values. These abbreviations allow a 'short-form'
specification of a set of collation options, such as
&quot;UCA4.0.0_AS_LSV_S&quot;, which can be used to specific that the desired
options are: UCA version 4.0.0; ignore spaces, punctuation and symbols; use
Swedish linguistic conventions; compare case-insensitively.</p>
<p>A number of attribute values are common across different attributes; these
include <b>Default</b> (abbreviated as D), <b>On</b> (O), and <b>Off</b> (X).
Unless otherwise stated, the examples use the UCA alone with default settings.</p>
<h3>Main References</h3>
<ul>
  <li>For a full list of supported locales in ICU, see <a href="http://oss.software.ibm.com/cgi-bin/icu/lx">Locale
    Explorer</a>, which also contains an on-line demo showing sorting for each
    locale. The demo allows you to try different attribute values, to see how
    they affect sorting.</li>
  <li>To see tabular results for different locales in ICU (with the tailored
    characters marked), see the <a href="http://oss.software.ibm.com/icu/charts/collation/">ICU
    Collation Charts</a>. For a view of the UCA table itself, see the <a href="http://www.unicode.org/charts/collation/">Unicode
    Collation Charts</a>.</li>
  <li>For the UCA specification, see <a href="http://www.unicode.org/reports/tr10/">UTS
    #10: Unicode Collation Algorithm</a>.</li>
  <li>For more detail on the precise effects of these options, see <a href="http://oss.software.ibm.com/icu/userguide/Collate_Customization.html">Collation
    Customization</a>.</li>
</ul>
<table border="1" cellspacing="0" cellpadding="4">
  <tr>
    <th align="left" valign="top">Attribute</th>
    <th align="left" valign="top">Ab.</th>
    <th align="left" valign="top">Possible Values</th>
    <th align="left" valign="top">Description</th>
  </tr>
  <tr>
    <td valign="top">Locale</td>
    <td valign="top">L</td>
    <td valign="top"><i>&lt;locale&gt;</i></td>
    <td valign="top">The Locale attribute is typically the most important
      attribute for correct sorting and matching, according to the user
      expectations in different countries and regions. The default UCA ordering
      will only sort a few languages such as English and Italian correctly
      (&quot;correctly&quot; meaning according to the normal expectations for
      users of the languages). Otherwise, you need to supply the locale to UCA
      in order to properly collate text for a given language. Thus a locale
      needs to be supplied so as to choose a collator that is correctly <i>tailored</i>
      for that locale.
      <p>The choice of a locale will automatically preset the values for all of
      the attributes to something that is reasonable for that locale. Thus most
      of the time the other attributes do not need to be explicitly set. In some
      cases, the choice of locale will make a difference in string comparison
      performance and/or sort key length.</p>
      <p>In short attribute names, <i>L&lt;language&gt;_&lt;region&gt;_&lt;variant&gt;</i>
      is represented by:</p>
      <ul>
        <li><i>L&lt;language&gt;</i></li>
        <li><i>R&lt;region&gt;</i></li>
        <li><i>V&lt;variant&gt;</i></li>
      </ul>
      <p>If no language, region, or variant is selected, the collator will use
      the default UCA ordering. <a href="http://oss.software.ibm.com/cgi-bin/icu/lx">Locale
      Explorer</a> shows the language, regions, and variants that ICU supports,
      and provides a demo of how they will differ in terms of sorted output.</p>
      <table border="0" cellspacing="0" cellpadding="4">
        <tr>
          <td><i>Example:</i>&nbsp;</td>
          <td>Locale=&quot;sv&quot; (Swedish)</td>
          <td>&quot;Kypper&quot; &lt; &quot;Köpfe&quot;</td>
        </tr>
        <tr>
          <td></td>
          <td>Locale=&quot;de&quot; (German)</td>
          <td>&quot;Köpfe&quot; &lt; &quot;Kypper&quot;</td>
        </tr>
      </table>
    </td>
  </tr>
  <tr>
    <td valign="top">Strength</td>
    <td valign="top">S</td>
    <td valign="top">1, 2, 3, 4, I, D</td>
    <td valign="top">The Strength attribute determines whether accents or case
      are taken into account when collating or matching text. ( (In writing
      systems without case or accents, it controls similarly important
      features).&nbsp; The default strength setting usually does not need to be
      changed for collating (sorting), but often needs to be changed when <i>matching</i>
      (e.g. SELECT). The possible values include Default (D), Primary (1),
      Secondary (2), Tertiary (3), Quaternary (4), and Identical (I).
      <p>For example, people may choose to ignore accents or ignore accents and
      case when searching for text.</p>
      <ul>
        <li>to ignore accents and case, use Strength=Primary (1)</li>
        <li>to ignore case, use Strength=Secondary (2);</li>
        <li>to ignore neither accents nor case, use Strength=Tertiary (3).</li>
      </ul>
      <p>Almost all characters are distinguished by the first three levels, and
      in most locales the default value is thus Tertiary. However, if Alternate
      is set to be Shifted, then the Quaternary strength (4) can be used to
      break ties among whitespace, punctuation, and symbols that would otherwise
      be ignored. If very fine distinctions among characters are required, then
      the Identical strength (I) can be used (for example, Identical Strength
      distinguishes between the <span style="font-variant: small-caps">Mathematical
      Bold Small A</span> and the <span style="font-variant: small-caps">Mathematical
      Italic Small A.</span> For more examples, look at the cells with white
      backgrounds in the collation charts). However, using levels higher than
      Tertiary — especially the Identical strength — will result in
      significantly longer sort keys, and slower string comparison performance
      for equal strings.</p>
      <table border="0" cellspacing="0" cellpadding="4">
        <tr>
          <td><i>Example:</i>&nbsp;</td>
          <td>S=1</td>
          <td>role = Role = rôle</td>
        </tr>
        <tr>
          <td></td>
          <td>S=2</td>
          <td>role = Role &lt; rôle</td>
        </tr>
        <tr>
          <td></td>
          <td>S=3</td>
          <td>role &lt; Role &lt; rôle</td>
        </tr>
      </table>
    </td>
  </tr>
  <tr>
    <td valign="top">Case_Level</td>
    <td valign="top">K</td>
    <td valign="top">X, O, D</td>
    <td valign="top">
      <p align="left">The Case_Level attribute is used when ignoring accents <i>but
      not</i> case. In such a situation, set Strength to be Primary, and
      Case_Level to be On. In most locales, this setting is Off by default.
      There is a small string comparison performance and sort key impact if this
      attribute is set to be On.</p>
      <table border="0" cellspacing="0" cellpadding="4">
        <tr>
          <td><i>Example:</i>&nbsp;</td>
          <td>S=1, K=X</td>
          <td>role = Role = rôle</td>
        </tr>
        <tr>
          <td></td>
          <td>S=1, K=O</td>
          <td>role = rôle &lt;&nbsp; Role</td>
        </tr>
      </table>
    </td>
  </tr>
  <tr>
    <td valign="top">Case_First</td>
    <td valign="top">C</td>
    <td valign="top">X, L, U, D</td>
    <td valign="top">
      <p align="left">The Case_First attribute is used to control whether
      uppercase letters come before lowercase letters or vice versa, in the
      absence of other differences in the strings. The possible values are
      Uppercase_First (U) and Lowercase_First (L), plus the standard Default and
      Off. There is almost no difference between the Off and Lowercase_First
      options in terms of results, so typically users will not use
      Lowercase_First: only Off or Uppercase_First. (People interested in the
      detailed differences between X and L should consult the <a href="http://oss.software.ibm.com/icu/userguide/Collate_Customization.html">Collation
      Customization</a>).</p>
      <p align="left">Specifying either L or U won't affect string comparison
      performance, but will affect the sort key length..</p>
      <table border="0" cellspacing="0" cellpadding="4">
        <tr>
          <td><i>Example:</i>&nbsp;</td>
          <td>C=X or C=L</td>
          <td>&quot;china&quot; &lt; &quot;China&quot; &lt; &quot;denmark&quot;
            &lt; &quot;Denmark&quot;</td>
        </tr>
        <tr>
          <td></td>
          <td>C=O</td>
          <td>&quot;China&quot; &lt; &quot;china&quot; &lt; &quot;Denmark&quot;
            &lt; &quot;denmark&quot;</td>
        </tr>
      </table>
    </td>
  </tr>
  <tr>
    <td valign="top">Alternate</td>
    <td valign="top">A</td>
    <td valign="top">N, S, D</td>
    <td valign="top">The Alternate attribute is used to control the handling of
      the so-called <i>variable </i>characters in the UCA: whitespace,
      punctuation and symbols. If Alternate is set to Non-Ignorable (N), then
      differences among these characters are of the same importance as
      differences among letters. If Alternate is set to Shifted (S), then these
      characters are of only minor importance. The Shifted value is often used
      in combination with Strength set to Quaternary. In such a case,
      white-space, punctuation, and symbols are considered when comparing
      strings, but only if all other aspects of the strings (base letters,
      accents, and case) are identical. If Alternate is not set to Shifted, then
      there is no difference between a Strength of 3 and a Strength of 4.
      <p>For more information and examples, see <a href="http://www.unicode.org/reports/tr10/#Variable_Weighting">Variable_Weighting</a>
      in the UCA. The reason the Alternate values are not simply On and Off is
      that additional Alternate values may be added in the future. The UCA
      option <b>Blanked</b> is expressed with Strength set to 3, and Alternate
      set to Shifted.</p>
      <p>The default for most locales is Non-Ignorable. If Shifted is selected,
      it may be slower if there are many strings that are the same except for
      punctuation; sort key length will not be affected unless the strength
      level is also increased.</p>
      <table border="0" cellspacing="0" cellpadding="4">
        <tr>
          <td><i>Example:</i>&nbsp;</td>
          <td>S=3, A=N</td>
          <td>di Silva &lt; Di Silva &lt; diSilva &lt; U.S.A. &lt; USA</td>
        </tr>
        <tr>
          <td></td>
          <td>S=3, A=S</td>
          <td>di Silva = diSilva &lt; Di Silva&nbsp; &lt; U.S.A. = USA</td>
        </tr>
        <tr>
          <td></td>
          <td>S=4, A=S</td>
          <td>di Silva &lt; diSilva &lt; Di Silva &lt; U.S.A. &lt; USA</td>
        </tr>
      </table>
    </td>
  </tr>
  <tr>
    <td valign="top">Variable_Top</td>
    <td valign="top">T</td>
    <td valign="top"><i>&lt;string&gt;</i></td>
    <td valign="top">The Variable_Top attribute is only meaningful if the
      Alternate attribute is not set to Non-Ignorable. In such a case, it
      controls which characters count as ignorable. The string value specifies
      the &quot;highest&quot; character (in UCA order) weight that is to be
      considered ignorable.
      <p>Thus, for example, if a user wanted white-space to be ignorable, but
      not any visible characters, then s/he would use the value Variable_Top=&quot;\u0020&quot;
      (space). The string should only be a single character. All characters of
      the same primary weight are equivalent, so Variable_Top=&quot;\u3000&quot;
      (ideographic space) has the same effect as Variable_Top=&quot;\u0020&quot;.</p>
      <p>This setting (alone) has little impact on string comparison
      performance; setting it lower or higher will make sort keys slightly
      shorter or longer respectively</p>
      <table border="0" cellspacing="0" cellpadding="4">
        <tr>
          <td><i>Example:</i>&nbsp;</td>
          <td>S=3, A=N</td>
          <td>di Silva &lt; diSilva &lt; U.S.A. &lt; USA</td>
        </tr>
        <tr>
          <td></td>
          <td>S=3, A=S</td>
          <td>di Silva = diSilva &lt; U.S.A. = USA</td>
        </tr>
        <tr>
          <td></td>
          <td>S=3, A=S, T=&quot; &quot;</td>
          <td>di Silva &lt; diSilva &lt; U.S.A. = USA</td>
        </tr>
      </table>
    </td>
  </tr>
  <tr>
    <td valign="top">Normalization</td>
    <td valign="top">N</td>
    <td valign="top">X, O, D</td>
    <td valign="top">
      <p align="left">The Normalization setting determines whether text is
      thoroughly normalized or not in comparison. Even if the setting is off
      (which is the default for many locales), text as represented in common
      usage will compare correctly (for details, see <a href="http://www.unicode.org/notes/tn5/">UTN
      #5</a>). Only if the accent marks are in non-canonical order will there be
      a problem. If the setting is On, then the best results are guaranteed for
      all possible text input.</p>
      <p align="left">There is a medium string comparison performance cost if
      this attribute is On, depending on the frequency of sequences that require
      normalization. There is no significant effect on sort key length.</p>
      <table border="0" cellspacing="0" cellpadding="4">
        <tr>
          <td><i>Example:</i>&nbsp;</td>
          <td>N=X</td>
          <td><font face="Arial Unicode MS">ä = a + &#9676;&#776;&#776; &lt; ä&#776;&#776;
            + &#9676;&#803; &lt; &#7841;&#803; + &#9676;&#776;&#776;</font></td>
        </tr>
        <tr>
          <td></td>
          <td>N=O</td>
          <td><font face="Arial Unicode MS">ä = a + &#9676;&#776;&#776; &lt; ä&#776;&#776;
            + &#9676;&#803; = &#7841;&#803; + &#9676;&#776;&#776;</font></td>
        </tr>
      </table>
    </td>
  </tr>
  <tr>
    <td valign="top">French</td>
    <td valign="top">F</td>
    <td valign="top">X, O, D</td>
    <td valign="top">
      <p align="left">The French sort strings with different accents from the
      back of the string. This attribute is automatically set to On for the
      French locales and a few others. Users normally would not need to
      explicitly set this attribute. There is a string comparison performance
      cost when it is set On, but sort key length is unaffected.</p>
      <table border="0" cellspacing="0" cellpadding="4">
        <tr>
          <td><i>Example:</i>&nbsp;</td>
          <td>F=X</td>
          <td>cote &lt; coté &lt; côte &lt; côté</td>
        </tr>
        <tr>
          <td></td>
          <td>F=O</td>
          <td>cote &lt; côte &lt; coté &lt; côté</td>
        </tr>
      </table>
    </td>
  </tr>
  <tr>
    <td valign="top">Hiragana</td>
    <td valign="top">H</td>
    <td valign="top">X, O, D</td>
    <td valign="top">
      <p align="left">Compatibility with JIS x 4061 requires the introduction of
      an additional level to distinguish Hiragana and Katakana characters. If
      compatibility with that standard is required, then this attribute should
      be set On, and the strength set to Quaternary. This will affect sort key
      length and string comparison string comparison performance.</p>
      <table border="0" cellspacing="0" cellpadding="4">
        <tr>
          <td><i>Example:</i>&nbsp;</td>
          <td>H=X, S=4</td>
          <td>&#12365;&#12421;&#12358; = &#12461;&#12517;&#12454; &lt;
            &#12365;&#12422;&#12358; = &#12461;&#12518;&#12454;</td>
        </tr>
        <tr>
          <td></td>
          <td>H=O, S=4</td>
          <td>&#12365;&#12421;&#12358; &lt; &#12461;&#12517;&#12454; &lt;
            &#12365;&#12422;&#12358; &lt; &#12461;&#12518;&#12454;</td>
        </tr>
      </table>
    </td>
  </tr>
</table>
<h3>Summary of Value Abbreviations:</h3>
<table border="1" cellspacing="0" cellpadding="4">
  <tr>
    <th align="left">Value</th>
    <th align="left">Abb.</th>
  </tr>
  <tr>
    <td>Default</td>
    <td>D</td>
  </tr>
  <tr>
    <td>On</td>
    <td>O</td>
  </tr>
  <tr>
    <td>Off</td>
    <td>X</td>
  </tr>
  <tr>
    <td>Primary</td>
    <td>1</td>
  </tr>
  <tr>
    <td>Secondary</td>
    <td>2</td>
  </tr>
  <tr>
    <td>Tertiary</td>
    <td>3</td>
  </tr>
  <tr>
    <td>Quarternary</td>
    <td>4</td>
  </tr>
  <tr>
    <td>Identical</td>
    <td>I</td>
  </tr>
  <tr>
    <td>Shifted</td>
    <td>S</td>
  </tr>
  <tr>
    <td>Non-Ignorable</td>
    <td>N</td>
  </tr>
  <tr>
    <td>Lower-First</td>
    <td>L</td>
  </tr>
  <tr>
    <td>Upper-First</td>
    <td>U</td>
  </tr>
</table>
<h3>Space Padding</h3>
<p>In many database products, fields are padded with null. To get correct
results, the input to a Collator should omit any superfluous trailing padding
spaces. The problem arises with contractions, expansions, or normalization.
Suppose that there are two fields, one containing &quot;aed&quot; and the other
with &quot;äd&quot;. A traditional German sort will compare &quot;ä&quot; as
if it were &quot;ae&quot; (on a primary level), so the order will be &quot;äd&quot;
&lt; &quot;aed&quot;. But if both fields are padded with spaces to a length of
3, then this will reverse the order, since the first will compare as if it were
one character longer. In other words, when you start with strings 1 and 2</p>
<table border="1">
  <tr>
    <td width="5%">1.</td>
    <td width="10%" align="center">a</td>
    <td width="10%" align="center">e</td>
    <td width="10%" align="center">d</td>
    <td width="10%" align="center">&lt;space&gt;</td>
  </tr>
  <tr>
    <td width="5%">2.</td>
    <td align="center">ä</td>
    <td align="center">d</td>
    <td align="center">&lt;space&gt;</td>
    <td align="center">&lt;space&gt;</td>
  </tr>
</table>
<p>they end up being compared on a primary level as if they were 1' and 2'</p>
<table border="1">
  <tr>
    <td width="5%">1'.</td>
    <td width="10%" align="center">a</td>
    <td width="10%" align="center">e</td>
    <td width="10%" align="center">d</td>
    <td width="10%" align="center">&lt;space&gt;</td>
    <td width="10%" align="center">&nbsp;</td>
  </tr>
  <tr>
    <td width="5%">2'.</td>
    <td align="center">a</td>
    <td align="center">e</td>
    <td align="center">d</td>
    <td align="center">&lt;space&gt;</td>
    <td align="center">&lt;space&gt;</td>
  </tr>
</table>
<p>Since 2' has an extra character (the extra space), it counts as having a
primary difference when it shouldn't. The correct result occurs when the
trailing padding spaces are removed, as in 1&quot; and 2&quot;</p>
<table border="1">
  <tr>
    <td width="5%">1&quot;.</td>
    <td width="10%" align="center">a</td>
    <td width="10%" align="center">e</td>
    <td width="10%" align="center">d</td>
  </tr>
  <tr>
    <td width="5%">2&quot;.</td>
    <td align="center">a</td>
    <td align="center">e</td>
    <td align="center">d</td>
  </tr>
</table>
<h3>Additional References</h3>
<ul>
  <li><a href="http://oss.software.ibm.com/icu/apiref/ucol_8h.html#a66">http://oss.software.ibm.com/icu/apiref/ucol_8h.html#a66</a></li>
  <li><a href="http://oss.software.ibm.com/icu/apiref/ucol_8h.html#a65">http://oss.software.ibm.com/icu/apiref/ucol_8h.html#a65</a></li>
  <li><a href="http://oss.software.ibm.com/icu/docs/">http://oss.software.ibm.com/icu/docs/</a>
    (see &quot;Collation in ICU&quot;)</li>
  <li><a href="http://oss.software.ibm.com/icu/userguide/Collate_Intro.html">http://oss.software.ibm.com/icu/userguide/Collate_Intro.html</a></li>
  <li><a href="http://oss.software.ibm.com/icu/apiref/">http://oss.software.ibm.com/icu/apiref/</a>
    (see Collator, StringSearch)</li>
  <li><a href="http://oss.software.ibm.com/icu4j/doc/">http://oss.software.ibm.com/icu4j/doc/</a>
    (see Collator, Collation*, <a href="http://oss.software.ibm.com/icu4j/doc/com/ibm/icu/text/SearchIterator.html" target="classFrame">SearchIterator</a>)</li>
  <li><a href="http://oss.software.ibm.com/icu/charts/collation/">ICU Collation
    Charts</a></li>
  <li><a href="http://www.unicode.org/charts/collation/">Unicode Collation
    Charts</a>.</li>
</ul>
<p>&nbsp;</p>
<p><br>
</p>

</body>

</html>