[avx2] Optimize byte packing during decoding

During decoding, `decode_avx2` returns an 32-byte value along with a 32-bit mask indicating which bytes are valid (i.e. not decoded from whitespace). Currently, these are packed using a simple loop over the bytes. There are likely more efficient ways to do this. (On AVX-512, you'd use the VPCOMPRESSB instruction, but that's not available here)