Skip to content

Smaller lengths array? #10

Open
Open
@Andersama

Description

@Andersama

Found this code referenced inside imgui, so far as I can tell I'm not sure why the lengths array needs to contain 32 results.

The reason being is that the bits that control the length of a utf8 sequence are the leading 1s up at the front of a byte with a terminating 0 (assuming there was software out there dealing with utf8 in a bitstream).

0 -> ascii -> 1 byte
10 -> continuation -> 1 byte
110 -> 2 byte sequence
1110 -> 3 byte sequence
11110 -> 4 byte sequence

111110 -> presumably a 5 byte sequence
1111110 -> presumably a 6 byte sequence
11111110 -> presumably a 7 byte sequence
11111111 -> presumably an 8 byte sequence

However, currently utf8 only deals with at worst 4 byte code points so while the pattern could continue at the moment there aren't 5 byte sequences. Which means you could just drop the terminating 0 for the 4 byte sequence and work from there.

Now I'm not sure about the rest of the code dealing with errors and masks and shifting around...but presumably...an equivalent function exists, but with a smaller table.

    static const char lengths[] = {
1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 2, 2, 3, 4
    };
    
    static const int masks[]  = {0x00, 0x7f, 0x1f, 0x0f, 0x07};
    static const uint32_t mins[] = {4194304, 0, 128, 2048, 65536};
    static const int shiftc[] = {0, 18, 12, 6, 0};
    static const int shifte[] = {0, 6, 4, 2, 0};

    unsigned char *s = buf;
    int len = lengths[s[0] >> 4]; // here we just grab the upper nibble (the 4 bits which determine the length)
    // I kept the 0s which appear to contribute to determining the error.
    // in theory the code past this point works in a similar fashion except for the error handling of the erroneous sequence of 11111

For the remaining decoding section here:

    *c  = (uint32_t)(s[0] & masks[len]) << 18;
    *c |= (uint32_t)(s[1] & 0x3f) << 12;
    *c |= (uint32_t)(s[2] & 0x3f) <<  6;
    *c |= (uint32_t)(s[3] & 0x3f) <<  0;
    *c >>= shiftc[len];

I haven't exactly tested this...but it strikes me that the masking operation doesn't need a table.

    *c  = (uint32_t)((s[0] << len) >> len) << 18;
    // alternatively
    *c  = (uint32_t)(s[0] & (0xff >> len)) << 18;

Similar concepts could apply for the shiftc and shifte tables as these are multiples of 6 and 2.

   //*c >>= shiftc[len];
   *c >>= 24 - 6*len;
   //*e >>= shifte[len];
   *e >>= 8 - 2*len;

I'm guessing you've probably written code like this already, but I was curious.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions