FlatBuffers 64 for C++ by dbaileychess · Pull Request #7935 · google/flatbuffers

dbaileychess · 2023-05-05T00:09:38Z

This introduces 64-bit FlatBuffers.

This allows buffers to be larger than 2 GiB limit due to the addressable range of the uoffset_t (aka uint32_t) used. This add a uoffset64_t (aka uint64_t) as a possible offset backing type, allowing the addressable range to be much larger.

Overview

The buffer is now conceptually two regions of contiguous memory:

[           binary           ]
[32-bit region][64-bit region]

Where the 32-bit region was historically the whole of the FlatBuffer. All 32-bit offsets (Offset) are relative to the end of the 32-bit region, and thus can only address objects within that region. The new 64-bit offset (Offset64) is relative to end of the 64-bit region (or conceptually the tail of the whole buffer). So Offset64 can address any object within the buffer.

This leads to an important concept for using 64-bit FlatBuffers:

All 64-bit offsets MUST be serialized to the binary first, before adding any 32-bit offsets

Attempting otherwise will lead to an assertion.

Schema

Two new attributes are added:

offset64

This will generate methods to produce Offset64 return types. This can be used on strings and vectors.
vector64

This implies offset64 but also expands the type of the length field of a vector from 32-bits to 64-bits. This allows you to store a single large vector that is > 2 GiB. This also works with nested_flatbuffer attributes.

Code Generation

Only C++ code generation is supported at the moment.

For the most part, this code is semantically similar to the 32-bit version, it just switches out some types. It also requires use of the new FlatBufferBuilder64 to handle the larger buffer.

Builder

There is a new FlatBufferBuilder64 that is used to build these large buffers. They have various methods to create the supported 64-bit enabled types: vectors and strings (sorry no tables yet).

Semantically the builder is the same as the 32-bit one (its just a templated version of it), so the API and flow of building will be identical. The only difference is the inclusion of more template parameters to dictate the use Offset64 or Offset.

For example there are now:

FlatBufferBuilder64 builder;
Offset<T> offset = builder.CreateVector(T t); // create a normal 32-bit vector with 32-bit length field.
Offset64<T> offset = builder.CreateVector<Offset64>(T t); // create a 64-bit vector with 32-bit length field,
Offset64<T> offset = builder.CreateVector<Offset64, Vector64>(T t); // create a 64-bit vector with 64-bit length field.

// Same with strings
Offset<String> offset = builder.CreateString("hi");
Offset64<String> offset = builder.CreateString<Offset64>("hi");

Accessing

Accessing a 64-bit FlatBuffer is almost identical to the current ways. Only the returned types differ a bit for vectors, where now there is a Vector64<T> (which is just a Vector<T, uoffset64_t>). So it will have identical API, just operates on different length types.

Compatibility

Adding either offset64 or vector64 to an existing field is an evolution error (it would fail the compatibility check), so backwards compatibility is preserved, as any 64-bit field would have to be a new field.

Implementation Notes

Here is what an annotated binary looks for a example schema that uses various 64-bit fields.

Primarily the difference is the support for UOffset64 in the table definition. The associated vtable doesn't need special treatment since it natively supported various offsets.

root_table (RootTable):
  +0x1C | 14 00 00 00             | SOffset32  | 0x00000014 (20) Loc: 0x08          | offset to vtable
  +0x20 | D0 00 00 00 00 00 00 00 | UOffset64  | 0x00000000000000D0 (208) Loc: 0xF0 | offset to field `far_vector` (vector)
  +0x28 | 00 00 00 00             | uint8_t[4] | ....                               | padding
  +0x2C | D2 04 00 00             | uint32_t   | 0x000004D2 (1234)                  | table field `a` (Int)
  +0x30 | 8C 00 00 00 00 00 00 00 | UOffset64  | 0x000000000000008C (140) Loc: 0xBC | offset to field `far_string` (string)
  +0x38 | 00 00 00 00             | uint8_t[4] | ....                               | padding
  +0x3C | 40 00 00 00             | UOffset32  | 0x00000040 (64) Loc: 0x7C          | offset to field `near_string` (string)
  +0x40 | 70 00 00 00 00 00 00 00 | UOffset64  | 0x0000000000000070 (112) Loc: 0xB0 | offset to field `big_vector` (vector64)
  +0x48 | 08 00 00 00 00 00 00 00 | UOffset64  | 0x0000000000000008 (8) Loc: 0x50   | offset to field `big_struct_vector` (vector64)

The other interesting case is the vector64 that now supports a uint64_t length field:

vector64 (RootTable.big_vector):
  +0xB0 | 04 00 00 00 00 00 00 00 | uint64_t   | 0x0000000000000004 (4)             | length of vector (# items)
  +0xB8 | 05                      | uint8_t    | 0x05 (5)                           | value[0]
  <2 regions omitted>
  +0xBB | 08                      | uint8_t    | 0x08 (8)                           | value[3]

Fixes: #7537

aardappel

This is awesome! Appreciate the great care you've taken to not disrupt the 32-bit eco system much :) Overall complexity surprisingly low too. Very neat!

include/flatbuffers/buffer.h

include/flatbuffers/flatbuffer_builder.h

src/idl_gen_cpp.cpp

src/idl_parser.cpp

…tor types

battre · 2023-05-11T10:04:36Z

FYI, the tests are now really slow...

time ./flatbuffers_unittests
ALL TESTS PASSED

real    1m24.133s
user    1m23.377s
sys     0m0.752s

aardappel · 2023-05-11T15:24:34Z

@battre @dbaileychess odd, they used to be really fast, worth seeing which test that is..

battre · 2023-05-11T16:01:48Z

I attached a debugger because I thought that the tests were in an infinite loop. When I triggered a break, I was in the resize here:

    std::vector<uint8_t> big_data;
    big_data.resize(big_vector_size);

dbaileychess · 2023-05-11T16:26:46Z

Yeah, I actually made a giant buffer to test and it takes a while to make. Let me fix it so the average case doesn't have to do it.

dbaileychess · 2023-05-11T16:41:13Z

Fixed in 66e9d98

Before:

[I] derekbailey@lysine ~/P/d/flatbuffers (master)> time ./flattests
ALL TESTS PASSED

________________________________________________________
Executed in    4.44 secs    fish           external
   usr time    1.31 secs    1.35 millis    1.31 secs
   sys time    3.11 secs    0.40 millis    3.11 secs

After:

[I] derekbailey@lysine ~/P/d/flatbuffers (master)> time ./flattests
ALL TESTS PASSED

________________________________________________________
Executed in  121.91 millis    fish           external
   usr time   93.57 millis    1.43 millis   92.14 millis
   sys time   28.39 millis    0.35 millis   28.04 millis

I guess my machine was beefy enough that the 100 ms to 4 s wasn't too bad that I noticed.

battre · 2023-05-11T17:11:18Z

Hm... My machine that needed 1m24.133s has 128 virtual cores and 512 GB of RAM :-) - But it's a virtual machine and I compiled it within a Chrome checkout. I wonder whether Chrome has special parameters for the memory allocator...

battre · 2023-05-11T17:14:18Z

That helped here as well!

time ./flatbuffers_unittests
ALL TESTS PASSED

real    0m0.236s
user    0m0.228s
sys     0m0.008s

Thank you.

* First working hack of adding 64-bit. Don't judge :) * Made vector_downward work on 64 bit types * vector_downward uses size_t, added offset64 to reflection * cleaned up adding offset64 in parser * Add C++ testing skeleton for 64-bit * working test for CreateVector64 * working >2 GiB buffers * support for large strings * simplified CreateString<> to just provide the offset type * generalize CreateVector template * update test_64.afb due to upstream format change * Added Vector64 type, which is just an alias for vector ATM * Switch to Offset64 for Vector64 * Update for reflection bfbs output change * Starting to add support for vector64 type in C++ * made a generic CreateVector that can handle different offsets and vector types * Support for 32-vector with 64-addressing * Vector64 basic builder + tests working * basic support for json vector64 support * renamed fields in test_64bit.fbs to better reflect their use * working C++ vector64 builder * Apply --annotate-sparse-vector to 64-bit tests * Enable Vector64 for --annotate-sparse-vectors * Merged from upstream * Add `near_string` field for testing 32-bit offsets alongside * keep track of where the 32-bit and 64-bit regions are for flatbufferbuilder * move template<> outside class body for GCC * update run.sh to build and run tests * basic assertion for adding 64-bit offset at the wrong time * started to separate `FlatBufferBuilder` into two classes, 1 64-bit aware, the other not * add test for nested flatbuffer vector64, fix bug in alignment of big vectors * fixed CreateDirect method by iterating by Offset64 first * internal refactoring of flatbufferbuilder * block not supported languages in the parser from using 64-bit * evolution tests for adding a vector64 field * conformity tests for adding/removing offset64 attributes * ensure test is for a big buffer * add parser error tests for `offset64` and `vector64` attributes * add missing static that GCC only complains about * remove stdint-uintn.h header that gets automatically added * move 64-bit CalculateOffset internal * fixed return size of EndVector * various fixes on windows * add SizeT to vector_downward * minimze range of size changes in vector and builder * reworked how tracking if 64-offsets are added * Add ReturnT to EndVector * small cleanups * remove need for second Array definition * combine IndirectHelpers into one definition * started support for vector of struct * Support for 32/64-vectors of structs + Offset64 * small cleanups * add verification for vector64 * add sized prefix for 64-bit buffers * add fuzzer for 64-bit * add example of adding many vectors using a wrapper table * run the new -bfbs-gen-embed logic on the 64-bit tests * remove run.sh and fix cmakelist issue * fixed bazel rules * fixed some PR comments * add 64-bit tests to cmakelist

dbaileychess linked an issue May 5, 2023 that may be closed by this pull request

Possible design for 64-bit sized buffer support in FlatBuffers #7537

Closed

dbaileychess requested review from CasperN and aardappel May 5, 2023 00:09

github-actions bot added c++ codegen Involving generating code from schema java json php python labels May 5, 2023

dbaileychess force-pushed the flatbuffers-64 branch from ecdee46 to 4e7ce80 Compare May 5, 2023 00:13

aardappel reviewed May 6, 2023

View reviewed changes

dbaileychess force-pushed the flatbuffers-64 branch from 73b48fa to 5c2d39a Compare May 8, 2023 20:29

dbaileychess added 18 commits May 9, 2023 08:56

First working hack of adding 64-bit. Don't judge :)

1f53a53

Made vector_downward work on 64 bit types

0033e15

vector_downward uses size_t, added offset64 to reflection

a91e67d

cleaned up adding offset64 in parser

8a4b283

Add C++ testing skeleton for 64-bit

00a5c35

working test for CreateVector64

b08d3ed

working >2 GiB buffers

e396b34

support for large strings

44b9b91

simplified CreateString<> to just provide the offset type

d02a187

generalize CreateVector template

a00d388

update test_64.afb due to upstream format change

7add255

Added Vector64 type, which is just an alias for vector ATM

9ef353a

Switch to Offset64 for Vector64

1ca9995

Update for reflection bfbs output change

181ba16

Starting to add support for vector64 type in C++

c130cd2

made a generic CreateVector that can handle different offsets and vec…

e90f0a5

…tor types

Support for 32-vector with 64-addressing

35d1df0

Vector64 basic builder + tests working

b76bfa0

dbaileychess added 17 commits May 9, 2023 08:56

reworked how tracking if 64-offsets are added

a75b3fb

Add ReturnT to EndVector

6d7a8fe

small cleanups

b51a18d

remove need for second Array definition

38e98d1

combine IndirectHelpers into one definition

699b618

started support for vector of struct

2cc133b

Support for 32/64-vectors of structs + Offset64

46f2068

small cleanups

d2641a3

add verification for vector64

d075a35

add sized prefix for 64-bit buffers

040cea8

add fuzzer for 64-bit

cb64fef

add example of adding many vectors using a wrapper table

66c8059

run the new -bfbs-gen-embed logic on the 64-bit tests

e7f1afe

remove run.sh and fix cmakelist issue

6e52dbb

fixed bazel rules

b3e3fad

fixed some PR comments

46a0ea8

add 64-bit tests to cmakelist

1ac0550

dbaileychess force-pushed the flatbuffers-64 branch from 5c2d39a to 1ac0550 Compare May 9, 2023 16:00

dbaileychess merged commit 63b7b25 into master May 9, 2023

dbaileychess deleted the flatbuffers-64 branch May 11, 2023 19:17

PINTO0309 mentioned this pull request Sep 21, 2023

Ubuntu22.04, tensorflow v2.14.0rc1, ml_dtypes, flatc v23.5.26 PINTO0309/onnx2tf#506

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FlatBuffers 64 for C++#7935

FlatBuffers 64 for C++#7935
dbaileychess merged 62 commits intomasterfrom
flatbuffers-64

dbaileychess commented May 5, 2023

Uh oh!

aardappel left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

battre commented May 11, 2023

Uh oh!

aardappel commented May 11, 2023

Uh oh!

battre commented May 11, 2023

Uh oh!

dbaileychess commented May 11, 2023

Uh oh!

dbaileychess commented May 11, 2023

Uh oh!

battre commented May 11, 2023

Uh oh!

battre commented May 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dbaileychess commented May 5, 2023

Overview

Schema

Code Generation

Builder

Accessing

Compatibility

Implementation Notes

Uh oh!

aardappel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

battre commented May 11, 2023

Uh oh!

aardappel commented May 11, 2023

Uh oh!

battre commented May 11, 2023

Uh oh!

dbaileychess commented May 11, 2023

Uh oh!

dbaileychess commented May 11, 2023

Uh oh!

battre commented May 11, 2023

Uh oh!

battre commented May 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants