File tree 1 file changed +15
-2
lines changed
1 file changed +15
-2
lines changed Original file line number Diff line number Diff line change @@ -202,11 +202,24 @@ possible to UTF-8-encode a single code point in different ways, but that is
202
202
explicitly forbidden, and the shortest possible encoding should always be used
203
203
(and that is what Perl does). */
204
204
205
- /* Comments as to the meaning of each are given at their corresponding utf8.h
206
- * definitions. */
205
+ /* It turns out that just this one number is sufficient to derive all the basic
206
+ * macros for UTF-8 and UTF-EBCDIC. Everything follows from the fact that
207
+ * there are 6 bits of real information in a UTF-8 continuation byte vs. 5 bits
208
+ * in a UTF-EBCDIC one. */
207
209
208
210
#define UTF_ACCUMULATION_SHIFT 5
209
211
212
+ /* Also needed is how perl handles a start byte of 8 one bits. The decision
213
+ * was made to just append the minimal number of bytes after that so that code
214
+ * points up to 64 bits wide could be represented. In UTF-8, that was an extra
215
+ * 5 bytes, and in UTF-EBCDIC it's 6. The result is in UTF8_MAXBYTES defined
216
+ * above. This implementation has the advantage that you have everything you
217
+ * need in the first byte. Other ways of extending UTF-8 have been devised,
218
+ * some to arbitrarily high code points. But they require looking at the next
219
+ * byte(s) when the first one is 8 one bits. */
220
+
221
+ /* These others are for efficiency or for other decisions we've made */
222
+
210
223
#define isUTF8_POSSIBLY_PROBLEMATIC (c ) \
211
224
_generic_isCC(c, _CC_UTF8_START_BYTE_IS_FOR_AT_LEAST_SURROGATE)
212
225
You can’t perform that action at this time.
0 commit comments