Skip to content

Commit 82257e2

Browse files
committed
utfebcdic.h: Add comments
1 parent e56dfd9 commit 82257e2

File tree

1 file changed

+15
-2
lines changed

1 file changed

+15
-2
lines changed

utfebcdic.h

+15-2
Original file line numberDiff line numberDiff line change
@@ -202,11 +202,24 @@ possible to UTF-8-encode a single code point in different ways, but that is
202202
explicitly forbidden, and the shortest possible encoding should always be used
203203
(and that is what Perl does). */
204204

205-
/* Comments as to the meaning of each are given at their corresponding utf8.h
206-
* definitions. */
205+
/* It turns out that just this one number is sufficient to derive all the basic
206+
* macros for UTF-8 and UTF-EBCDIC. Everything follows from the fact that
207+
* there are 6 bits of real information in a UTF-8 continuation byte vs. 5 bits
208+
* in a UTF-EBCDIC one. */
207209

208210
#define UTF_ACCUMULATION_SHIFT 5
209211

212+
/* Also needed is how perl handles a start byte of 8 one bits. The decision
213+
* was made to just append the minimal number of bytes after that so that code
214+
* points up to 64 bits wide could be represented. In UTF-8, that was an extra
215+
* 5 bytes, and in UTF-EBCDIC it's 6. The result is in UTF8_MAXBYTES defined
216+
* above. This implementation has the advantage that you have everything you
217+
* need in the first byte. Other ways of extending UTF-8 have been devised,
218+
* some to arbitrarily high code points. But they require looking at the next
219+
* byte(s) when the first one is 8 one bits. */
220+
221+
/* These others are for efficiency or for other decisions we've made */
222+
210223
#define isUTF8_POSSIBLY_PROBLEMATIC(c) \
211224
_generic_isCC(c, _CC_UTF8_START_BYTE_IS_FOR_AT_LEAST_SURROGATE)
212225

0 commit comments

Comments
 (0)