-
Notifications
You must be signed in to change notification settings - Fork 35
/
collation_options.html
497 lines (492 loc) · 20.4 KB
/
collation_options.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<meta name="GENERATOR" content="Microsoft FrontPage 4.0">
<meta name="ProgId" content="FrontPage.Editor.Document">
<title>Collation Attributes</title>
</head>
<body bgcolor="#FFFFFF">
<h2>Collation Attributes</h2>
<h2><i>(Draft 2003-02-18, MED)</i></h2>
<p>When collating or matching text, a number of attributes can be used to affect
the desired result. The following describes the attributes, their values, their
effects, their normal usage, and the string comparison performance and sort key
length implications. It also includes single-letter abbreviations for both the
attributes and their values. These abbreviations allow a 'short-form'
specification of a set of collation options, such as
"UCA4.0.0_AS_LSV_S", which can be used to specific that the desired
options are: UCA version 4.0.0; ignore spaces, punctuation and symbols; use
Swedish linguistic conventions; compare case-insensitively.</p>
<p>A number of attribute values are common across different attributes; these
include <b>Default</b> (abbreviated as D), <b>On</b> (O), and <b>Off</b> (X).
Unless otherwise stated, the examples use the UCA alone with default settings.</p>
<h3>Main References</h3>
<ul>
<li>For a full list of supported locales in ICU, see <a href="http://oss.software.ibm.com/cgi-bin/icu/lx">Locale
Explorer</a>, which also contains an on-line demo showing sorting for each
locale. The demo allows you to try different attribute values, to see how
they affect sorting.</li>
<li>To see tabular results for different locales in ICU (with the tailored
characters marked), see the <a href="http://oss.software.ibm.com/icu/charts/collation/">ICU
Collation Charts</a>. For a view of the UCA table itself, see the <a href="http://www.unicode.org/charts/collation/">Unicode
Collation Charts</a>.</li>
<li>For the UCA specification, see <a href="http://www.unicode.org/reports/tr10/">UTS
#10: Unicode Collation Algorithm</a>.</li>
<li>For more detail on the precise effects of these options, see <a href="http://oss.software.ibm.com/icu/userguide/Collate_Customization.html">Collation
Customization</a>.</li>
</ul>
<table border="1" cellspacing="0" cellpadding="4">
<tr>
<th align="left" valign="top">Attribute</th>
<th align="left" valign="top">Ab.</th>
<th align="left" valign="top">Possible Values</th>
<th align="left" valign="top">Description</th>
</tr>
<tr>
<td valign="top">Locale</td>
<td valign="top">L</td>
<td valign="top"><i><locale></i></td>
<td valign="top">The Locale attribute is typically the most important
attribute for correct sorting and matching, according to the user
expectations in different countries and regions. The default UCA ordering
will only sort a few languages such as English and Italian correctly
("correctly" meaning according to the normal expectations for
users of the languages). Otherwise, you need to supply the locale to UCA
in order to properly collate text for a given language. Thus a locale
needs to be supplied so as to choose a collator that is correctly <i>tailored</i>
for that locale.
<p>The choice of a locale will automatically preset the values for all of
the attributes to something that is reasonable for that locale. Thus most
of the time the other attributes do not need to be explicitly set. In some
cases, the choice of locale will make a difference in string comparison
performance and/or sort key length.</p>
<p>In short attribute names, <i>L<language>_<region>_<variant></i>
is represented by:</p>
<ul>
<li><i>L<language></i></li>
<li><i>R<region></i></li>
<li><i>V<variant></i></li>
</ul>
<p>If no language, region, or variant is selected, the collator will use
the default UCA ordering. <a href="http://oss.software.ibm.com/cgi-bin/icu/lx">Locale
Explorer</a> shows the language, regions, and variants that ICU supports,
and provides a demo of how they will differ in terms of sorted output.</p>
<table border="0" cellspacing="0" cellpadding="4">
<tr>
<td><i>Example:</i> </td>
<td>Locale="sv" (Swedish)</td>
<td>"Kypper" < "Köpfe"</td>
</tr>
<tr>
<td></td>
<td>Locale="de" (German)</td>
<td>"Köpfe" < "Kypper"</td>
</tr>
</table>
</td>
</tr>
<tr>
<td valign="top">Strength</td>
<td valign="top">S</td>
<td valign="top">1, 2, 3, 4, I, D</td>
<td valign="top">The Strength attribute determines whether accents or case
are taken into account when collating or matching text. ( (In writing
systems without case or accents, it controls similarly important
features). The default strength setting usually does not need to be
changed for collating (sorting), but often needs to be changed when <i>matching</i>
(e.g. SELECT). The possible values include Default (D), Primary (1),
Secondary (2), Tertiary (3), Quaternary (4), and Identical (I).
<p>For example, people may choose to ignore accents or ignore accents and
case when searching for text.</p>
<ul>
<li>to ignore accents and case, use Strength=Primary (1)</li>
<li>to ignore case, use Strength=Secondary (2);</li>
<li>to ignore neither accents nor case, use Strength=Tertiary (3).</li>
</ul>
<p>Almost all characters are distinguished by the first three levels, and
in most locales the default value is thus Tertiary. However, if Alternate
is set to be Shifted, then the Quaternary strength (4) can be used to
break ties among whitespace, punctuation, and symbols that would otherwise
be ignored. If very fine distinctions among characters are required, then
the Identical strength (I) can be used (for example, Identical Strength
distinguishes between the <span style="font-variant: small-caps">Mathematical
Bold Small A</span> and the <span style="font-variant: small-caps">Mathematical
Italic Small A.</span> For more examples, look at the cells with white
backgrounds in the collation charts). However, using levels higher than
Tertiary — especially the Identical strength — will result in
significantly longer sort keys, and slower string comparison performance
for equal strings.</p>
<table border="0" cellspacing="0" cellpadding="4">
<tr>
<td><i>Example:</i> </td>
<td>S=1</td>
<td>role = Role = rôle</td>
</tr>
<tr>
<td></td>
<td>S=2</td>
<td>role = Role < rôle</td>
</tr>
<tr>
<td></td>
<td>S=3</td>
<td>role < Role < rôle</td>
</tr>
</table>
</td>
</tr>
<tr>
<td valign="top">Case_Level</td>
<td valign="top">K</td>
<td valign="top">X, O, D</td>
<td valign="top">
<p align="left">The Case_Level attribute is used when ignoring accents <i>but
not</i> case. In such a situation, set Strength to be Primary, and
Case_Level to be On. In most locales, this setting is Off by default.
There is a small string comparison performance and sort key impact if this
attribute is set to be On.</p>
<table border="0" cellspacing="0" cellpadding="4">
<tr>
<td><i>Example:</i> </td>
<td>S=1, K=X</td>
<td>role = Role = rôle</td>
</tr>
<tr>
<td></td>
<td>S=1, K=O</td>
<td>role = rôle < Role</td>
</tr>
</table>
</td>
</tr>
<tr>
<td valign="top">Case_First</td>
<td valign="top">C</td>
<td valign="top">X, L, U, D</td>
<td valign="top">
<p align="left">The Case_First attribute is used to control whether
uppercase letters come before lowercase letters or vice versa, in the
absence of other differences in the strings. The possible values are
Uppercase_First (U) and Lowercase_First (L), plus the standard Default and
Off. There is almost no difference between the Off and Lowercase_First
options in terms of results, so typically users will not use
Lowercase_First: only Off or Uppercase_First. (People interested in the
detailed differences between X and L should consult the <a href="http://oss.software.ibm.com/icu/userguide/Collate_Customization.html">Collation
Customization</a>).</p>
<p align="left">Specifying either L or U won't affect string comparison
performance, but will affect the sort key length..</p>
<table border="0" cellspacing="0" cellpadding="4">
<tr>
<td><i>Example:</i> </td>
<td>C=X or C=L</td>
<td>"china" < "China" < "denmark"
< "Denmark"</td>
</tr>
<tr>
<td></td>
<td>C=O</td>
<td>"China" < "china" < "Denmark"
< "denmark"</td>
</tr>
</table>
</td>
</tr>
<tr>
<td valign="top">Alternate</td>
<td valign="top">A</td>
<td valign="top">N, S, D</td>
<td valign="top">The Alternate attribute is used to control the handling of
the so-called <i>variable </i>characters in the UCA: whitespace,
punctuation and symbols. If Alternate is set to Non-Ignorable (N), then
differences among these characters are of the same importance as
differences among letters. If Alternate is set to Shifted (S), then these
characters are of only minor importance. The Shifted value is often used
in combination with Strength set to Quaternary. In such a case,
white-space, punctuation, and symbols are considered when comparing
strings, but only if all other aspects of the strings (base letters,
accents, and case) are identical. If Alternate is not set to Shifted, then
there is no difference between a Strength of 3 and a Strength of 4.
<p>For more information and examples, see <a href="http://www.unicode.org/reports/tr10/#Variable_Weighting">Variable_Weighting</a>
in the UCA. The reason the Alternate values are not simply On and Off is
that additional Alternate values may be added in the future. The UCA
option <b>Blanked</b> is expressed with Strength set to 3, and Alternate
set to Shifted.</p>
<p>The default for most locales is Non-Ignorable. If Shifted is selected,
it may be slower if there are many strings that are the same except for
punctuation; sort key length will not be affected unless the strength
level is also increased.</p>
<table border="0" cellspacing="0" cellpadding="4">
<tr>
<td><i>Example:</i> </td>
<td>S=3, A=N</td>
<td>di Silva < Di Silva < diSilva < U.S.A. < USA</td>
</tr>
<tr>
<td></td>
<td>S=3, A=S</td>
<td>di Silva = diSilva < Di Silva < U.S.A. = USA</td>
</tr>
<tr>
<td></td>
<td>S=4, A=S</td>
<td>di Silva < diSilva < Di Silva < U.S.A. < USA</td>
</tr>
</table>
</td>
</tr>
<tr>
<td valign="top">Variable_Top</td>
<td valign="top">T</td>
<td valign="top"><i><string></i></td>
<td valign="top">The Variable_Top attribute is only meaningful if the
Alternate attribute is not set to Non-Ignorable. In such a case, it
controls which characters count as ignorable. The string value specifies
the "highest" character (in UCA order) weight that is to be
considered ignorable.
<p>Thus, for example, if a user wanted white-space to be ignorable, but
not any visible characters, then s/he would use the value Variable_Top="\u0020"
(space). The string should only be a single character. All characters of
the same primary weight are equivalent, so Variable_Top="\u3000"
(ideographic space) has the same effect as Variable_Top="\u0020".</p>
<p>This setting (alone) has little impact on string comparison
performance; setting it lower or higher will make sort keys slightly
shorter or longer respectively</p>
<table border="0" cellspacing="0" cellpadding="4">
<tr>
<td><i>Example:</i> </td>
<td>S=3, A=N</td>
<td>di Silva < diSilva < U.S.A. < USA</td>
</tr>
<tr>
<td></td>
<td>S=3, A=S</td>
<td>di Silva = diSilva < U.S.A. = USA</td>
</tr>
<tr>
<td></td>
<td>S=3, A=S, T=" "</td>
<td>di Silva < diSilva < U.S.A. = USA</td>
</tr>
</table>
</td>
</tr>
<tr>
<td valign="top">Normalization</td>
<td valign="top">N</td>
<td valign="top">X, O, D</td>
<td valign="top">
<p align="left">The Normalization setting determines whether text is
thoroughly normalized or not in comparison. Even if the setting is off
(which is the default for many locales), text as represented in common
usage will compare correctly (for details, see <a href="http://www.unicode.org/notes/tn5/">UTN
#5</a>). Only if the accent marks are in non-canonical order will there be
a problem. If the setting is On, then the best results are guaranteed for
all possible text input.</p>
<p align="left">There is a medium string comparison performance cost if
this attribute is On, depending on the frequency of sequences that require
normalization. There is no significant effect on sort key length.</p>
<table border="0" cellspacing="0" cellpadding="4">
<tr>
<td><i>Example:</i> </td>
<td>N=X</td>
<td><font face="Arial Unicode MS">ä = a + ◌̈̈ < ä̈̈
+ ◌̣ < ạ̣ + ◌̈̈</font></td>
</tr>
<tr>
<td></td>
<td>N=O</td>
<td><font face="Arial Unicode MS">ä = a + ◌̈̈ < ä̈̈
+ ◌̣ = ạ̣ + ◌̈̈</font></td>
</tr>
</table>
</td>
</tr>
<tr>
<td valign="top">French</td>
<td valign="top">F</td>
<td valign="top">X, O, D</td>
<td valign="top">
<p align="left">The French sort strings with different accents from the
back of the string. This attribute is automatically set to On for the
French locales and a few others. Users normally would not need to
explicitly set this attribute. There is a string comparison performance
cost when it is set On, but sort key length is unaffected.</p>
<table border="0" cellspacing="0" cellpadding="4">
<tr>
<td><i>Example:</i> </td>
<td>F=X</td>
<td>cote < coté < côte < côté</td>
</tr>
<tr>
<td></td>
<td>F=O</td>
<td>cote < côte < coté < côté</td>
</tr>
</table>
</td>
</tr>
<tr>
<td valign="top">Hiragana</td>
<td valign="top">H</td>
<td valign="top">X, O, D</td>
<td valign="top">
<p align="left">Compatibility with JIS x 4061 requires the introduction of
an additional level to distinguish Hiragana and Katakana characters. If
compatibility with that standard is required, then this attribute should
be set On, and the strength set to Quaternary. This will affect sort key
length and string comparison string comparison performance.</p>
<table border="0" cellspacing="0" cellpadding="4">
<tr>
<td><i>Example:</i> </td>
<td>H=X, S=4</td>
<td>きゅう = キュウ <
きゆう = キユウ</td>
</tr>
<tr>
<td></td>
<td>H=O, S=4</td>
<td>きゅう < キュウ <
きゆう < キユウ</td>
</tr>
</table>
</td>
</tr>
</table>
<h3>Summary of Value Abbreviations:</h3>
<table border="1" cellspacing="0" cellpadding="4">
<tr>
<th align="left">Value</th>
<th align="left">Abb.</th>
</tr>
<tr>
<td>Default</td>
<td>D</td>
</tr>
<tr>
<td>On</td>
<td>O</td>
</tr>
<tr>
<td>Off</td>
<td>X</td>
</tr>
<tr>
<td>Primary</td>
<td>1</td>
</tr>
<tr>
<td>Secondary</td>
<td>2</td>
</tr>
<tr>
<td>Tertiary</td>
<td>3</td>
</tr>
<tr>
<td>Quarternary</td>
<td>4</td>
</tr>
<tr>
<td>Identical</td>
<td>I</td>
</tr>
<tr>
<td>Shifted</td>
<td>S</td>
</tr>
<tr>
<td>Non-Ignorable</td>
<td>N</td>
</tr>
<tr>
<td>Lower-First</td>
<td>L</td>
</tr>
<tr>
<td>Upper-First</td>
<td>U</td>
</tr>
</table>
<h3>Space Padding</h3>
<p>In many database products, fields are padded with null. To get correct
results, the input to a Collator should omit any superfluous trailing padding
spaces. The problem arises with contractions, expansions, or normalization.
Suppose that there are two fields, one containing "aed" and the other
with "äd". A traditional German sort will compare "ä" as
if it were "ae" (on a primary level), so the order will be "äd"
< "aed". But if both fields are padded with spaces to a length of
3, then this will reverse the order, since the first will compare as if it were
one character longer. In other words, when you start with strings 1 and 2</p>
<table border="1">
<tr>
<td width="5%">1.</td>
<td width="10%" align="center">a</td>
<td width="10%" align="center">e</td>
<td width="10%" align="center">d</td>
<td width="10%" align="center"><space></td>
</tr>
<tr>
<td width="5%">2.</td>
<td align="center">ä</td>
<td align="center">d</td>
<td align="center"><space></td>
<td align="center"><space></td>
</tr>
</table>
<p>they end up being compared on a primary level as if they were 1' and 2'</p>
<table border="1">
<tr>
<td width="5%">1'.</td>
<td width="10%" align="center">a</td>
<td width="10%" align="center">e</td>
<td width="10%" align="center">d</td>
<td width="10%" align="center"><space></td>
<td width="10%" align="center"> </td>
</tr>
<tr>
<td width="5%">2'.</td>
<td align="center">a</td>
<td align="center">e</td>
<td align="center">d</td>
<td align="center"><space></td>
<td align="center"><space></td>
</tr>
</table>
<p>Since 2' has an extra character (the extra space), it counts as having a
primary difference when it shouldn't. The correct result occurs when the
trailing padding spaces are removed, as in 1" and 2"</p>
<table border="1">
<tr>
<td width="5%">1".</td>
<td width="10%" align="center">a</td>
<td width="10%" align="center">e</td>
<td width="10%" align="center">d</td>
</tr>
<tr>
<td width="5%">2".</td>
<td align="center">a</td>
<td align="center">e</td>
<td align="center">d</td>
</tr>
</table>
<h3>Additional References</h3>
<ul>
<li><a href="http://oss.software.ibm.com/icu/apiref/ucol_8h.html#a66">http://oss.software.ibm.com/icu/apiref/ucol_8h.html#a66</a></li>
<li><a href="http://oss.software.ibm.com/icu/apiref/ucol_8h.html#a65">http://oss.software.ibm.com/icu/apiref/ucol_8h.html#a65</a></li>
<li><a href="http://oss.software.ibm.com/icu/docs/">http://oss.software.ibm.com/icu/docs/</a>
(see "Collation in ICU")</li>
<li><a href="http://oss.software.ibm.com/icu/userguide/Collate_Intro.html">http://oss.software.ibm.com/icu/userguide/Collate_Intro.html</a></li>
<li><a href="http://oss.software.ibm.com/icu/apiref/">http://oss.software.ibm.com/icu/apiref/</a>
(see Collator, StringSearch)</li>
<li><a href="http://oss.software.ibm.com/icu4j/doc/">http://oss.software.ibm.com/icu4j/doc/</a>
(see Collator, Collation*, <a href="http://oss.software.ibm.com/icu4j/doc/com/ibm/icu/text/SearchIterator.html" target="classFrame">SearchIterator</a>)</li>
<li><a href="http://oss.software.ibm.com/icu/charts/collation/">ICU Collation
Charts</a></li>
<li><a href="http://www.unicode.org/charts/collation/">Unicode Collation
Charts</a>.</li>
</ul>
<p> </p>
<p><br>
</p>
</body>
</html>