-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible design for 64-bit sized buffer support in FlatBuffers #7537
Comments
This would help with #7536, but I think it seems a little "special cased" for a language feature with changes to the schema language and everything. I think it's better to provide this as a library, we can provide some |
@CasperN I am not sure how you'd "do it as a library", given that this needs features in the schema parser, the code generators, the runtime code, etc. It be possible to do it in the runtime only but it be very unergonomic, and work differently in every language. And yes that linked issue is pretty much my use case #1 |
So the binary format would be
The first size prefixed table on the left would correspond to Without knowing the types, a Making a |
Basically I'm suggesting adding a new kind of implicit "root", there's
|
Which brings up the question, we use size prefixed flatbuffers quite a bit. How would those work with your proposal? Since 2 GB means that the highest bit is available, maybe use that to signal that this is a 64bit flatbuffer inside the size prefixed flatbuffer? |
@CasperN I also don't understand the relevance of @AustinSchuh yes, sadly, the buffer as a whole would need a new kind of "64-bit size prefixed".. it may be good to set the high bit, but users were already required to statically differentiate between size prefixed and not, so this new type is no different. |
I agree that the many tables should be independent (no sharing between elements of the big vector) to avoid those problems. The relevance of size-prefixed is that it tells you when the “root” flatbuffer ends and when the table of offsets begin. The “size” refers to the size of the table of type Root. The byte after that is the start of the vector of offsets.
The only difference between my suggestion and the original post is the observation that if the big vector is “top level” and not inside of the table, Root, then this can be implemented with even less effort.
Actually the simpler version is to not distinguish between the root table and the sub tables. Consider the following format:
• the first 8 bytes is the uint64 length in bytes of everything
• Second 8 bytes are the unit64 number of elements in the vector
• Then the uint64 offsets to the nested flatbuffers
• Then the nested flatbuffers
Applications can choose to make the 0th element a special “Root” table and all other elements a “Sub” table
…On Sep 19, 2022, 18:49 -0400, Wouter van Oortmerssen ***@***.***>, wrote:
@CasperN vector of 64bit offsets with many tables afterwards was essentially my original suggestion in the linked "projects" page. But that design has many problems, in particular around avoiding sharing of vtables, and any nested/shared objects, which would generate more 64-bit offsets needed inside the many tables. It also makes it harder to force ordering in the buffer. To that end, I simplified the design to only support nested_flatbuffer which eliminates all those problems in one go.
I also don't understand the relevance of size prefixed.
@AustinSchuh yes, sadly, the buffer as a whole would need a new kind of "64-bit size prefixed".. it may be good to set the high bit, but users were already required to statically differentiate between size prefixed and not, so this new type is no different.
And for the nested buffers there is no problem, since they were always sized anyway, and will now be automatically 64-bit sized.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
We needed >2gb support and ended up doing something very similar. Our format was: The regular flatbuffer had normal int fields with an index into the offset table. We used the file identifier to detect if this was a plain flatbuffer or our custom format. |
@CasperN yes, but this makes it into a very "custom" format.. my original design tries to keep things as similar to existing FlatBuffers as possible, and gives some flexibility as to where references to these big objects/vectors are stored (doesn't have to be a in a single top level vector). That little flexibility comes at almost no cost. @BrianHarris cool, yes, that is essentially a container format (PAK file? :) around smaller FlatBuffers. It differs from what I am proposing that I want to use regular FlatBuffers for the "directory" part (and also allow larger than 2GB constituent parts). |
I actually disagree whether the cost/flexibility trade off is worth it. I think the only flexibility benefit is the ability to access the big vector via the root table (or one of the tables you can access via the root table). It's also nice that there isn't a proliferation of "root access types" (we'd have I think the cost would be worth it if this feature comes with a roadmap for more 64bit flatbuffers features: e.g. any table can have a 64bit offset to a big vector, nested flatbuffer, string, or a big string. In that context, I can absolutely see why the codegen must be aware of 64bit offsets and we'd want a uniform design for the different use cases. However, we're not going for such a large project. |
I agree with Casper that this can, and probably should, be done as a library/documentation/sample code. At least until "FlatBuffers2" One of the problems with the approach I took is writing out the data requires knowing about all the data blobs ahead of time, so we can pre-allocate the vector at the beginning of the buffer. It also requires us going back and filling in the offsets once we know the final size of the flatbuffer. This wasn't a problem for us because we never streamed this data out while writing it, so seeking on disk wasn't a concern. A nice feature of flatbuffers though is it's written depth-first, so you never have to seek backwards when writing, nor do you have to scan your source data to pre-calculate anything. Moving the vector to the end of the buffer would solve this problem though: You could find the root table by looking at the first offset (or if "number of offsets" is 0 then just looking immediately before it). |
Just chiming in - I'd love to see this implemented. At 2 different companies now I've seen custom hacks added specifically to support >4gb binary blobs contained within flatbuffers streams. One just wrote a metadata header that contained the size of the blob & wrote it directly to the stream after the flatbuffers message, leaving it to the streaming code to do the right thing. The other forked flatbuffers and made it possible to enable 64bit offsets on a per-schema basis, which is wasteful, but functional. |
I think 64 bit should just be the same format compiled with uoffset_t and soffset_t as 64 bit offsets. It relatively trivial to do, offers complete freedom for the end user, and the file format would at most be twice as much as you need to, which probably is of less concern at those scales (although it will affect cache locality and thus performance). Starting to add custom partially large data structures quickly gets complex both for the library and the end user. I have looked into a number of ways of doing it and never really found anything satisfying. It is similar to the jump from 8/16 bit memory banks, then x86 segmented 16 bit memory then 32-bit memory, then 64 bit memory. We never had segmented 32 bit memory but jumped straight to 64 bits (albeit CPU's don't actually address the full 64-bit range on the internal bus). Should we add an option for 64 bit, there would be need to be a format identifier bit and I would not expect the same generated code to work with both 32- and 64-bit formats. In fact, flatcc has prepared namespaces to generate code for different formats. I also suggest looking into StreamBuffers at the same time so at least readers are guaranteed to be able to read front to back generated buffers. This is a bigger change, but more important for larger buffers. It means you can write and transport buffers without keeping all in memory. https://github.com/dvidelabs/flatcc/blob/master/doc/binary-format.md#streambuffers |
Also, I can add that I do work with a large format read only database in FlatBuffers where the index is generated in a separate FlatBuffer linking to a number of on-disk buffers each limited to 2GB and mapped into memory as needed. This works fine, and I'm not sure a single 64-bit buffer would easily replace it due to file size, cache, memory and transport concerns, but it also isn't something you build in one day. |
I don't think we've invested into testing this very thoroughly, unfortunately. As @aardappel noted, "Given the amount of language implementations that we have, and the amount of code locations that would need touching to become 32/64 agnostic this would actually be a huuuge amount of work, and unrealistic to expect to be able to implement"
If we're writing logs to disk in a flatbuffer format, it'd be nice if we wrote things front-to-back so we can end the file with the vector of offsets and metadata. This probably can't be integrated in the normal flatbuffer builder though, and probably would be bad for streaming settings.
@aardappel I just thought about this one. How do you expect having a large flatbuffer vector to help with this? Presumably one of the big vector's elements would be a >2GB blob of external-to-flatbuffers data? Also, I just realized in @aardappel's original proposal, there may be more than one 64bit vector, so long as all of them are at the end. That is indeed different from the solution I and @BrianHarris were suggesting which only supports one large vector. I think we need more consensus on canonical use cases before we can shrink the design space... |
Yup, totally see that.. but a difficult problem to solve and likely independent of 64-bit support.
One 64-bit vector would be such a blob (as opposed to an element), see
Yes, I think that is crucial for some level of flexibility in what kind of things can be implemented with this. |
I have a suggestion for a more aggressive, but IMHO also more valuable change. Replace offsets from being i32 to a special type of LEB. This special type of LEB encodes the size of the number in the first two bits of the first/leading byte. Where:
The decoding logic looks something like following:
This LEB type let us work with padding and memory alignment, as the values are always 1, 2, 4, or 8 bytes long. The length of buffers and strings could also be encoded with such LEB. Same for vector of scalar values. For vector of offsets, it would be better to use the same pattern, but shift it by another 2 bits. First two bits encode the byte size of the vector values, next two bits encode the byte size of the vector length. The offsets in the vector I stored as u8/u16/u32/u64, based on the max value in the vector. A flag in the schema could identify if the new LEB values are used throughout the buffer and not only in some special places. This change will reduce buffer size for small buffer and will make almost infinite size buffers (62 bit pointers length and 60 bit vector size) possible. |
Not saying yes or no, but if you want to do LEB, you might want to follow QUIC |
Ah right, QUIC was my initial inspiration. I think in case of FlatBuffers where we use little endian representation explicitly, my suggestion is a bit more ergonomic. As we consume the bytes left to right, the byte size encoding should be in the first (less significant) byte. And encoding the byte size in the left side of the bit pattern of the first byte, will make the reading part of the value more cumbersome. Please let me know if I am mistaken. |
I'm a big fan of variable length encodings in general, but the problem is that pretty much all such formats, no matter how clever, as still significantly slower than a simple 32/64-bit read, to the point that it would not be suitable for FlatBuffers. It be great for a FlatBuffers-like format that was optimized for size but still random access, but not for all offsets in FlatBuffers. FlexBuffers of course has them, and their I tried to mitigate some of the inefficiency of decoding the payload size in every value by only doing it once for every vector, but even there the problem is that there is just no fast CPU instructions for reading a variable length integer with a dynamic size, beyond Even the slow conditional variable size offset decoding is appropriate for FlexBuffers since it is a slower to access format anyway (field access goes through a binary search!), but not so much for FlatBuffers. On top of that of course is that the motivation for this issue (unlike the "FlatBuffers2" thread) is to extend the existing format in a backwards compatible way, which this wouldn't do. |
Actually the reading of a variable length offset can be quite efficient as we have a simplified problem on our hands. The runtime value can be UInt64, the buffer is aligned and padded, so decoding of the value can look something like this:
I also think that it would be possible to introduce the feature in a backwards compatible way. Forward compatibility is a different story though. |
I have tested variable length encodings in various contexts, including a variant of robin hood hashing where an offset needed to be stored. In general the performance is rather poor if it is in a critical loop. |
@mikkelfj most variable length encodings rely on loops and conditions for decoding. My proposal is branch less. It is still definitely slower than just direct four byte read, but as performance penalties go, it should be not too bad. Having smaller length and offset values will produce smaller buffer and increase data locality, which in term might speed up overall performance. That sad, it is all theoretical until someone actually implements and profiles it. |
All I say is test it. I was surprised. I scrapped an entire optimized algorithm design and implementation because it wouldn't fly. It goes from sub-ns to 10 ns or so depending. I another test I found that nested branching is much faster than lookup if you have a choice. |
Totally agree regarding testing. And I am also not so sure regarding lookup performance. That sad, the mask can also be computed purely arithmetically:
So one can chose between branching, lookup and arithmetics. |
Note that I also have concerns regarding compatibility and implementation effort, but I'm focusing on the technical / performance aspect in the above, given that this is what is primarily differentiating flatbuffers from msgpack and protocol buffers. |
There is also a technical concern regarding unaligned access, and reading "too much ahead" which could be necessary for performance. In other contexts, when parsing, I often add some zero bytes at the end of the buffer for reading ahead, but this isn't really practical here. In terms of encoding, I followed and participated in QUIC v1 design and the change to using a single uniform VLE representation greatly simplified many aspects compared to various bit flags in many header fields that were in the early design. But here you also have to carefully deal with buffer overrun and unaligned access. |
Conditional assignment and arithmetic would usually beat any alternative unless the overhead is excessive. Sequential dependency chains like a = 1 | x; b = a >> 1 | 1; c = b >> 1 | 1; ... (just random pointless logic) are typically faster than alternatives. This is used in integer log2 algorithms. C does not have (branchless) conditional assignment, but a = x ? 1 : a; will typically optimize to a cmov like instruction. |
I would like to prioritize this to be the top feature to work on for 2023. I think the ideal end state would be what @mikkelfj suggests: However, that being said, @aardappel OP design seems straightforward and does avoid many complications with the nested_flatbuffers. Though this approach does involve collusion of the end user to be explicit in defining their schema for larger buffer support, as well as serializing the data in the correct order. I do like the option that non-supported languages can simply skip these 64-bit aware fields and still provide some level of compatibility. Lastly, we have @CasperN and @BrianHarris idea of implementing this as a library, with some sort of meta-format file that manages multiple independent flatbuffers. This might have the least amount of changes necessary, but also in the least flexible and most rigid of the three proposals. Things to think about that I am just laying out.
|
I think we could use this as an escape, but not for deciding the final format. I think we need an extra field for that. EDIT: the optional size prefix is also a point of confusion. This leads to:
I think this shouldn't be a big problem, but large vtables could be a problem, also for CPU cache. However, if we have a format identifier, we could sneak in support for this if we really need to. I.e. I think this is an orthogonal problem we do not need to address at the same time.
I have a long running idea of a) extending flatbuffers with Mix-In inheritance. There might be some optimizations that could use that field. b) Proper NULL values could be encoded with an optional bitmap that could be encoded there. In general such options (I know that proposal isn't popular, but it is an example). In general we could have flags to encode certain encoding extensions. It could also be used for dynamically extending the representation as you suggest. In either case I'm not really in favor of conflating this with 64-bit global format - both due to the complexity of it, and for performance reasons in the common case. But I do like the idea of reserving this field for flags of some sort. Again, this could be enabled via my proposed format identifier such the vtable field is only special in special FB formats. Finally, it is worth noting that the Arrow format also exists. It uses FB for metadata but a Google Drill like mix of row/column data that is better suitable for more homogenous (but potentially sparse) data in large volumes. |
@dbaileychess I think first thing to decide is how much engineering time you and any collaborators would have for this. If that is a lot, then there are options. I just think routing support for possible 64-bit offsets thru all APIs of all languages is a tremendous engineering exercise that I personally wouldn't want to bet on. My suggested approach marries low engineering effort with enough escape hatches for people to use the 64-bit space. Escape hatches are ugly, but remember this is a niche requirement, 99.9% of FB users really don't need more than 32-bits.
|
I would tend to agree, but I think could be solved by just detecting if you have a supported format or not. Branching after opening the file / buffer would not be good in general. |
Yeah, i haven't audited the path to see if it is ever used. Was just a suggestion to throw in the wind :)
Agreed, I don't think it is important to expand the size of vtables, I just wanted to bring it up if we had something that could tackle dynamic switching of offset sizes, as it could possible be applied to vtables as well.
Agreed.
Yep, I think for the cost-benefit tradeoff, your or the library approach is the best compared to the "native" 64-bit support. |
Are there any progress on implementing 64-bit size in the 2023 or is this idea still in design phase? |
This is actually in progress: https://github.com/google/flatbuffers/tree/flatbuffers-64 |
Nice I can only hope it'll make master soon as it would be great utility for saving bigger files 🏅 |
@machineko details of your usecase? |
Deep learning models serialization and transfer around the distributed system (sometimes with weights >20GB which right now is using json and <2GB using flatbuffers) |
Just wanted to say big thank you to all who are working on supporting 64 bit flatbuffers! I was caught up in different projects and am now returning to the project from #7536 (hopefully I can make it public in the next 2-3 weeks!), where I already thought of design changes for my library to partition the flatbuffers into < 2 GB chunks. However, I will try out https://github.com/google/flatbuffers/tree/flatbuffers-64 first for sure, since partitioning would complicate things for future users of my library a lot. I wanted to expose as little as possible from flatbuffers, since my library will probably be used by people with no knowledge on serialization and "low-level things". To give you a rough hint, the most popular file format to describe programs in my area is this one here: https://homes.esat.kuleuven.be/~nsmart/MPC/. I definitely want to change this :D |
I would like to open a invitation for people to start trying out the 64-bit work in the https://github.com/google/flatbuffers/tree/flatbuffers-64 feature branch. There is still stuff for me to do in it, but at this time I feel like most of the features are implemented and tested to a degree. I've only tested with clang and gcc on linux, I believe the mac and windows errors are just about narrowing of types that I will address. I don't have docs/tutorial yet, but I hope the tests and schema are straightforward to get the gist.
|
Everything is Green on that branch for now, and I'll try my best to keep it that way until we merge it in. |
@dbaileychess, thanks so much for this work! I find myself in need of 64-bit offsets, but I'm running into the current limitation of vectors of tables not supporting the I was wondering if support for faraway vectors of tables is being considered for the future, or whether there is a different option currently that would achieve something similar? |
I would have to double check, but I thought I put everything in and didn't have any abandon work. |
Looking at the commit again, I think I misunderstood: it adds support for 64-bit vectors of structs, but it seems 64-bit vectors of tables were never supported. I believe I've been able to update my schemas to support larger buffers, taking advantage of I basically went from something like this:
To this:
I still can't have a massive number of |
So this is why I argued it would be better to just have a pure 64-bit version of FlatBuffers. |
"2GB should be enough for anybody!" I said some 9 years ago when first coming up with FlatBuffers. Of course, even back then OSes were already 64-bit and machines had memory for bigger buffers than that, but because of the nature of FlatBuffers (largely immutable data written linearly) it seemed that if people would store >2GB of FlatBuffer data, that would be most efficient in multiple files (packets / buffers / ...) anyway. And of course 32-bit offset made the format a whole lot more compact.
Turns out, people nowadays have actual legit use-cases for storing more than 2GB in a single consecutive FlatBuffer:
Now, what would seem simple is to simply fork the entirety of FlatBuffers and have new files where every offset is now 64-bit. Given the amount of language implementations that we have, and the amount of code locations that would need touching to become 32/64 agnostic this would actually be a huuuge amount of work, and unrealistic to expect to be able to implement with volunteer maintainers. Also, a fork is not desirable, for many use case a buffer that may contain 64-bit sized items but is still largely compatible with the 32-bit format is much nicer.
So, here's a possible design that aims to enable the above use cases while causing the absolute least amount of churn for FlatBuffers implementations, and making support as easy as possible. This may come at some cost of generality, but since 64-bit buffers are still a bit niche I think this is the right tradeoff.
The idea is to only have 1 new "type variant": 64-bit sized scalar vectors.
Such a vector is designated by using the
vector64
attribute that may only be used on table fields. That table field will be implemented using a 64-bit offset to a vector with a 64-bit size field.The entire rest of FlatBuffers will stay 32-bit as before.
In terms of serialization, think of any buffer having a head and a tail. The head is the traditional FlatBuffer, the tail (what you serialize first) may ONLY consist of 64-bit vectors, thru
FlatBufferBuilder::CreateVector64
or similar. The builder can easily keep track that the tail is only such vectors (and assert if not), and that all the rest of the data (the head) still fits in 2GB (so all traditional offsets work). The offsets created in the tail may only be stored invector64
fields in the head.Now you may wonder how this helps with the first use case, where each record may have a variety of data in it (nested tables/strings..). The idea is that these will be stored as a
nested_flatbuffer
. This solves several problems, first it means that this record is forced to be self-contained, meaning it doesn't share strings or anything with the main buffer, which would now also need to become 64-bit, complicating the design significantly. It also allows us to restrict the supported 64-bit types to just scalar vectors.As an example:
So here,
my_records
indicates the first use case, and needs a table to wrap the vector. That doesn't seem ideal, but since we're typically referring to large vectors, this overhead should be minimal.my_gigantic_buffer
is the 2nd use case, where we simply want to store a binary blob >2GB.To serialize, you'd first create vectors for each of these
[ubyte]
, followed by a normal buffer storing all the references to it. Fields likemy_normal_vector
would need their vector to be serialized after the 64-bit vectors.In terms of forwards/backwards compatibility, we'd modify the schema parser such that initially occurrences of
vector64
errors for any language backend that doesn't have support for it, much like we've done for new union related features in the past. Since usingvector64
means adding a new field, we can thus never have the situation where an old implementation accidentally tries to read a 64-bit vector offset as 32-bit.Then, language can incrementally add support for these new 64-bit vectors, which should be relatively straightforward, e.g. for C++ this
CreateVector64
function does most of the work of tracking the use of these vectors, would return a new offset type incompatible with 32-bit offsets, and serialization functions for these new fields would only accept that offset type. Similarly, when reading, we will now have aVector64
type separate fromVector
for access.A language that can't spend the time for a full implementation but wants to be compatible with schemas that use
vector64
could simply add a quick check for thevector64
attribute and then omit accessors for that field, and would thus generate code that can still access the rest of that buffer (or simply buffers that don't happens to actually store any 64-bit vectors).Related:
WDYT?
@rw @mikkelfj @vglavnyy @mzaks @mustiikhalil @dnfield @dbaileychess @lu-wang-g @stewartmiles @alexames @paulovap @AustinSchuh @maxburke @svenk177 @jean-airoldie @krojew @iceb0y @evanw @KageKirin @llchan @schoetbi @evolutional
The text was updated successfully, but these errors were encountered: