-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What would a "FlatBuffers2" binary format look like? #5875
Comments
Some quick thoughts: Ad 1. We would need to (potentially) deal with compiler/target platform problems when dealing with unaligned access. Is adding padding such a big problem at the moment? Ad 4. This sounds like a compatibility problem while migrating schemas. |
|
random thoughts: Overall, I like many of the suggestions, but there are too many optional and variable parameters in the proposal which would make things slow. There is a lot more branching going on, and code complexity. detractors to Wouters original comment:
my inclusive comments:
I really would like to remove nested buffers - I has caused me so much headache and they haven't been properly implemented elsewhere. We could have a packed buffer format where multiple buffers an be embedded in the same file or memory block and referenced by some identifier without actually storing the buffer inside the other buffer. It could be an 8-byte random identifier. Buffers could also be given a random initial identifier in this format. Just a thought. This means how buffers are stacked or packed is not critical as long as they can be located. |
We also need a proper NULL type for representing database data. |
Somehow I had persuaded myself that 9) doesn't need a format change and that it could be done by just building from the other end of the buffer and allocating space for the full message and vtable when creating the object. It would require a pretty massive API change though. Maybe I'm just hopeful. My understanding of the premise of flatbuffers (and Cap'n'Proto, which we evaluated before picking flatbuffers) is that compression algorithms are wonderful, so don't burden the format with being clever and compact. Protobufs attempt that, and end up requiring a separate serialization/deserialization step. Sharing vtables goes against that. Most of the rest of my feedback is at the API level. Happy to give it if there is interest. |
|
If that's an explicit opt-in, then it seems fine.
I absolutely agree with this. I would extend this a bit further to add a proper optional scalar fields. Adding bool flags if a scalar is present or not, as we need to right now if 0 is a legitimate value, is a quite frustrating workaround, given non-scalars have this built-in. |
We have optional scalar fields today. It just isn't plumbed up to be accessible by default to the user. The offset for the scalar fields in the vtable is 0 if they aren't populated. See Protobuf started out with required being the default and concluded that it was a bad idea. protocolbuffers/protobuf#2497 is a small part of the discussion. |
If something isn't publicly available, it doesn't exist from the user perspective. We should expose such information. Additionally, I have an impression we're reaching the classic dilemma of speed vs size. So the first question we need to answer is - which one do we favor? For me, FB was always about performance, so if we need to keep padding or sacrifice some internal improvements for the sake of being fast, I would personally stick with that. If some improvement doesn't impact performance, then let's consider it. |
Maybe a varint? |
Would it be possible to have most (all?) of these features be specified with flags at the beginning of a payload? I know that that gets into "framing format" territory, but providing a set of feature flags at the beginning of a table (or at the beginning of an entire buffer) could allow a lot of flexibility at little cost. It would probably just be an optional initial table at the beginning, containing metadata. |
My 5 cents.
|
My personal View on the following:
This is an amazing idea, actually. Which made me think if we can actually remove the vtables completely, since all the fields are going to be required, or optional as mentioned above can be presented as a Bool. This would allow us to remove the vtable, and use the Generated table that's already predefined. example: |
Regarding the optional / required table field status mentioned by @maxburke. I think changing current behaviour is a bad idea in regards to backward and forward compatibility. Marking fields as non optional is a "lie" we make for API convenience. There are no guaranties that a field is present, as we have no control over, who created the buffer, given your system is not a closed one. When we define a field as required, we commit to not being able to deprecate this field. We also can't set a new field as required unless we can guarantee that no older clients which don't know this field exists, will send buffers to newer clients which require this field to exist. Speaking of forwards and backwards compatibility, enums is another blind spot. Specifically the way enums are converted to JSON. I think in order to grant proper backwards and forwards compatibility enums need to be always represented as numbers, even though it looks nicer as text in JSON. But that breaks with new cases and also if a case has to be renamed. |
There are two options how it can be solved in current FlatBuffers implementation.
I guess for the new version this kind of feature could be addressed in a more straight forward way, by being able to define a table scalar field as optional in the schema directly. |
Switching to required by default does not break future compatibility any more than the current attribute. On the other hand we gain possible optimizations and a more intuitive schemas. Today some fields are truly optional and require annotating with an attribute to make required, while other cannot be annotated as such and it's up to the users to guess if they're optional or not. |
There are two options how one can design default behaviour:
Setting optional per default is going with safety first approach. You can add required keyword to the schema any time if you decide that it is good for you. Switching from required to optional however is potentially dangerous in the long run.
Optimizations from technical (performance) perspective? I don't see it, but please I am open for suggestions?
When in doubt always check for |
This will be a bit anecdotal, but every time I have been introducing FB to new people, optional by default was quite surprising to them.
Required things can always be stored inline. This is quite good for performance.
Scalar types don't have a notion of null. That's another thing I would love to see - take a look at Option in Rust. Also, verifying a buffer is a separate subject, so let's not mix it in. Side note, is anyone working on a verifier for Rust, like in cfb? |
Ok lets go with anecdotal 😀. In 2015 I worked on a city builder game where we stored user progress in FlatBuffers. The game was developed in Unity3D, but the town map itself was an isometric view. So a building position could be identified with a What I am trying to visualise with this colourful example, is that with required being default the evolution of a schema becomes a mines field, which only people with experience will be able to avoid. To be honest with you, when we switched from
Also regarding this:
I am not sure why removing required altogether is not the goal? ProtoBuffers version 3 did it and I personally would vote for doing it in FlatBuffers too. 😉 Anyways I think, I wrote enough. The decision lays with @aardappel. |
@mzaks I think there's a misunderstanding somewhere, because your example is quite wrong in this case. First of all, adding a new field to a table (required or not) will not break existing code (assuming the usual schema evolution guidelines). Second - there's domain data known to be required and that gets marked as required, which is perfectly ok. Arguing that everything should be optional is over-defensive and will only lead to frustrations of not having the ability to express the data model properly in messages (with the additional burden of fact-checking every piece of data). Third - saying that someone did something is not a valid technical argument discussion :) That being said, my opinion on optional data for a future protocol, if it ever happens, is:
|
I agree strongly with @rw that using a format specifier as part of the header would definitely be a great way to evolve the format. Lots of great comments on this thread, here's my thoughts to add to the mix:
|
Support for indexed tables where a flatbuffer table is annotated with "key" and "value" annotations. This sounded like a niche use case - but I believe it's an important one and likely to be occupied by something else if not addressed. |
I think 0-terminated strings are not an issue. Removing C-style string getters would solve it. |
It is possible if use
For vtable offsets and types ASN1 or variable-length encoding can be used. It may be a good idea to make the length of fbs-message aligned to 4 or 8 to pre-fetch data without extra checking. auto x = load_uint32(bytes); // it can be uint16 or uint64
auto offset = ((0u - (x & 1u)) & x | (x & 0xFFu) ) >> 1u; // [0x00-0x7F] or [0-0x7fffffff]
bool is_1 = !(x & 1u);
Fast FlexBuffer with writing to pre-allocated memory (without any memory allocations) can be useful as a fast logging core. |
@mikkelfj not sure why you refer to some of these as "slow": like I said, these are all intended to be codegen-ed (or become a template/macro arg) so will all be maximally fast, certainly no slower than any existing feature. None of these feature are intended to have dynamic checks. Not sure what you mean by strong JSON integration. JSON is text and not random access, so serves an entirely different use case than FlexBuffers. I am not sure what it would even mean to use JSON in this case, other than to store a string. A new format could mean a new way to do "drop nested tables" .. what does that mean? |
@AustinSchuh forward building would be a big format change, since now children typically come before parents, and unsigned child offsets would be flipped to always point downwards in memory instead. Retaining the feature that these offsets always point one way and never can form a cycle would be good, I think. A lot of people use FlatBuffers in cases where sparse / random access, or use with |
@maxburke the varints would be very much optional, as they are definitely slower to read. They would be a type, so you'd explicitly write either See above about compression and efficiency. Default required would be problematic for an evolving format. Protobuf came to the opposite conclusion after many years of experience, and FlatBuffers went along with that. |
Yes, and that means that the value is equal to the default, not that the value is not present. So can't be used to test for this purpose. I agree that not being able to differentiate this has been something many users would have wanted. On the other hand being able to access scalars "blindly" without checking for presence, and the storage savings from defaults is not something I would want to miss. |
That would make the distinction dynamic, and costly. Besides, we need to index into this table, which makes varints useless unless all the same size. The idea is that this is a static, codegen feature, associated with a particular type of table.
No, for the same reason. It's extremely important FlatBuffers stays fast, so can't rely on dynamic checks for different encodings. Besides, the idea is to specify them per type, not per buffer. |
I'd rather avoid anything except very core types in th FlatBuffer format, otherwise it never ends. We are just discussing how to add type aliases with attributes (#5597). These can be used to define new types with associated meaning, for example:
The above can only be used in structs. I think it would be nice to allow this in fields as well, and that could happen in a v2 format, but we don't have to define all kinds of specialized types. There are also GPS coordinates, 170 different timestamp formats, and so on. I'm happy with UTF-8 but that is about as advanced as I'd like to get. |
As to priorities, I'll keep this in mind, but it requires a bit more than an afterthought to put forward, and the plate is full atm. Forward writing is a must. I am going to do them one way or the other at some point. One thing I think has not been listed, which I think is important: As to 64-bits. Important, but takes up way too much space unless done very carefully. Maybe there needs to be some flags early in the file to indicate the offset scale. Generated code does not have to support all variants, but should fail on unsupported ones. |
I'm a bit torn on padding. But I think the uses cases where alignment matters are better handled by copying the data, otherwise you just end up copying the entire buffer from some I/O device. If we forego padding, the code becomes MUCH simpler and therefore also faster. When combined with forward writing (StreamBuffers), this allows for efficient streaming rewrites, for example from JSON to FlatBuffers in network sized chunks. Another feature I'm not sure is given much though is the ability to clone parts of a buffer without full metadata knowledge - that is a lot of code generation for clone logic that shouldn't really be necessary. If the vtable has information about in-place vs offset this should be possible. This might counteract the suggestion to remove table sizes (which are otherwise pretty useless except for verification, but verification might be an important argument too? And if we add flags to buffer header, we could indicate if the buffer is a DAG, a Tree, or a general graph (even if we generally disallow these, there are arguments for them when representation is more important than JSON translation and verification, but we do want to know what they are up front). |
Flatbuffers has two main compound types: |
@vglavnyy absolutely agree. Structs must remain padded internally as is with zero filled padding where practically possible (it is hard to guarantee in C but easy to achieve in praxis). Also, force_align as a useful facility here. The big question is if offset to a struct from buffer start must also be aligned. That is more tricky to achieve but C/C++ compilers would expect that. But it only works when the buffer is also aligned. I'm not really sure what to say here. Another thing is that if the buffer ends up having some alignment, I would like to see that in the header of the buffer because alignment so you can check if standard 8 byte alignment is enough via malloc, or you need to do more. This isn't entirely a schema thing since a possible 64-byte cacheline aligned array might never actually be added to the buffer. |
On verified UUIDs, I think a more general solution is to implement a mechanism for custom validation (e.g. plugins to code generation) and release a "common types" schema that uses that mechanism (much like |
Plugins are a great idea. That was the original motivation for flattools.
You should be able to change the generated code by rewriting a template
without having to understand how a large body of C++ code works.
Re: well known types
https://github.com/adsharma/fbs-schemas/blob/main/core.fbs
…On Wed, Jan 6, 2021, 6:43 AM Casper ***@***.***> wrote:
On verified UUIDs, I think a more general solution is to implement a
mechanism for custom validation (e.g. plugins to code generation) and
release a "common types" schema that uses that mechanism (much like
Google.Protobuf.WellKnownTypes).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5875 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFA2A3ZGIL325V3TKX7MB3SYRZHHANCNFSM4MRLM4MA>
.
|
I support 9 RE #6188 something else very powerful would be the ability to construct nested buffers inline in a single meta operation. almost all of my buffer usage is nested, and i shed outer layers and pass around nested objects throughout my application server as i process messages. currently I have to construct each buffer independently, and then copy it into the next one... sometimes I have 3 or 4 layers of this, aka 3 or 4 copies. i've flirted with the idea of writing this myself (specifically expanding the simple single level sequential custom application serialization protocol i use for some of my networking). so i'd definitely be interested in joining a collaboration. |
@victorstewart FlatCC supports creating nested buffers while the parent buffer is being constructed. This ensures correct alignment regardless of the nested buffer content. I hardly ever use nested buffers though. |
I am late to this party (and if this exists then I would love to know!) Supporting Lexicographic ordering on structs for key demarked things would be good. We use flatbuffers for storing things in K/V stores but typically encode the keys external to flatbuffers. My dream would be
|
Even later: 1, 4, 8, 9, 10 (definitely always frame with a length field for streaming, more important than file id IMHO). |
Adding to item 15) in the original list ( Similarly, if you're going to have inline strings (item 8), then having the size field be varint (instead of having to declare 8/16/32) would make a lot of sense. |
This issue is stale because it has been open 6 months with no activity. Please comment or label |
This issue was automatically closed due to no activity for 6 months plus the 14 day notice period. |
For Java, Is it possible to create an array with 64 bit entries as a set of multiple sequential 32 bit arrays. The
function would just shift and mask |
How much space does this take? If trivial, should it be left in to leave the door open for read streaming?
I think it's okay to make an opinionated stand on this. People should use string_view and if they aren't, they can copy the string_view into a null terminated string. |
That sounds incredibly slow, since anything larger (like a string) would have to be lifted out element by element into a secondary buffer before it can even be converted to a string since anything can straddle a buffer boundary.. All code we currently have that can work with the underlying array would stop working or need copies.
This is a single 16-bit quantity in the vtable which is currently still in there in all implementations, and typically set to 0. So yes, it could still be used for some future functionality, maybe. |
FlatBuffer's binary format has been set in stone for 6.5 years now, because we value binary forwards/backwards compatibility highly, and because we have a large investment in 15? or so language implementations / parsers etc. that would not be easy to redo.
So "V2" in the sense of a new format that breaks backwards compatibility may never happen. But there is definitely a list of issues with the existing format that if a new format were to ever happen, would be nice to address. I realized I never made such a list. It would be nice to at least fantasize what such a format could look like :)
Please comment what you would like to see. Note that this list is purely for things that would change the binary encoding, or larger additions to the binary encoding. Anything that can be solved with code / new APIs outside of the binary format does not belong on this list.
string_view
recently, and both have been usingsize_t
arguments for a long time rather than relying onstrlen
. Other languages don't use it. For passing to super-old C APIs that expect 0-termination, either swap the terminating byte temporarily while passing that string, or copy."a"
goes from 12 bytes (2 vtable + 4 offset + 4 size + 1 string + 1 terminator) to 3 bytes (1 vtable + 1 size + 1 string). Of course this very inflexible and special purpose, but gives users more options for compact data storage. Again, like all format variation above, this comes at no runtime cost, just some codegen complexity.varint
type for fields. They could be added to the existing format but make a lot more sense in a system with no alignment.@rw @mikkelfj @vglavnyy @mzaks @mustiikhalil @dnfield @dbaileychess @lu-wang-g @stewartmiles @alexames @paulovap @AustinSchuh @maxburke @svenk177 @jean-airoldie @krojew @iceb0y @evanw @KageKirin @llchan @schoetbi @evolutional
The text was updated successfully, but these errors were encountered: