-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exposing flatbuffer compilation internals #6428
Comments
fwiw, I'm totally biased towards M3. protbuffers and captnproto do something like that: There's a |
A combination of M3 and M2 might have a few extra benefits in addition to
what you outline.
* Make flatbuffers internals easy to understand
* Modularization - parser, buffer manipulation, template based codegen
* Integrate with projects like pyserde
…On Fri, Jan 29, 2021, 7:39 AM Casper ***@***.***> wrote:
fwiw, I'm totally biased towards M3. protbuffers
<https://developers.google.com/protocol-buffers/docs/reference/other> and
captnproto <https://capnproto.org/otherlang.html> do something like that:
There's a CodeGenerator(Request|Response) exchange between the compiler
and generator.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6428 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFA2A2AXT5N3AHGXG3GWPDS4LJCDANCNFSM4WZBV7XA>
.
|
Exposing intermediate representation looks like a good idea - it's essentially what modern compilers do. |
I'm leaning towards M3 from a flatcc perspective. I believe M2 is useful, but it will not happen for flatcc for portability reasons - scripting engines are not as portable as native. Except if the scripting engine runs in an external process on top of M2. The would also allow easier interop between flatc and flatcc. Even with out of process scripting, flatcc will maintain an internal code generator for C for portability, but it might be rewritten to rely on M3. Except there is a chicken and egg problem for accessing the bfbs interface, so I am not fully convinced here even if attractive. To clarify. For flatcc and any language other than C, M3 is certainly of interest, and out of process M2 also, via M3. |
I mean, that's only true if you want to bootstrap the code-generator for language X in that language, and i don't think that's such an important use case that we'd design around it. Even still, we can make the schema json-compatible and optionally output the IR in that form. |
For flattools/python case I'm not sure if a json representation is super
useful. Parsing the source is easier (there is a ply-python parser that
works well).
But a side effect of such a refactoring is that a standalone serde
independent of the parser is generated and that sounds interesting.
…On Sat, Jan 30, 2021, 3:05 PM Casper ***@***.***> wrote:
Except there is a chicken and egg problem for accessing the bfbs
interface, so I am not fully convinced here even if attractive.
I mean, that's only true if you want to bootstrap the code-generator for
language X in that language, and i don't think that's such an important use
case that we'd design around it. Even still, we can make the schema
json-compatible and optionally output the IR in that form.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6428 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFA2A3LP46ILDXXQOEDB3LS4SGEDANCNFSM4WZBV7XA>
.
|
Parsing the source is not easier because parsing is only 10% of what a Flatbuffer compiler does before generating code. |
Sure - there is more work to do beyond pure parsing. What I was getting at is the somewhat sorry world of json schema and validating that the input json is well formed. I much prefer parsing a fbs file vs parsing json and then validating that it conforms to some json schema. |
I think this thread is a little off topic. One can assume that flatc would only output valid bfbs/json/$favorite_serde_format, and also the resulting schema has been type checked, name resolved, etc checked, fits a predefined schema, and is good for code generation in all respects.
How would this work? It seems like pyserde defines data types in python which may conflict with the flatbuffer schema. Which would be the source of truth? |
On Mon, Feb 1, 2021 at 10:32 AM Casper ***@***.***> wrote:
- Integrate with projects like pyserde
How would this work? It seems like pyserde defines data types in python
which may conflict with the flatbuffer schema. Which would be the source of
truth?
fbs file will be the source of truth. flatc.py can generate a dataclass
from the fbs file, which could be decorated with decorators provided by
pyserde.
|
I personally would prefer M2, but only if we committed to cleaning up / converting all the C++ generators we have into said scripting language. M1 is kinda useless, since it still restricts the languages a generator can be written in a great deal. It also requires we deal with cross-platform dynamic linking, and making sure there is a dll/so for each platform.. not much of an improvement over just static linking it into M3 is definitely useful. One thing I wouldn't like is if we ended up with An extension of M3/M3 is if we took care of invoking scripts from |
I imagine you pipe bfbs or json output from flatc into a script. And possible add a wrapper script to simplify things. |
One of the things that would actually be worth it, is allowing the flatc integrate with other already existing generators (GRPC) as an example. This is one of the points I talked with the guys over at swift-grpc. Since this would allow us to keep our code base consistent and not break each time there is a new version of GRPC
which is what you suggested in M3 I believe |
I think you mean M3 will allow the swift-grpc codebase to own the flatbuffers-swift-grpc generator, which will be invoked by flatc. Is that correct? |
Yeah basically. I am not sure about the intricate details. but i would assume that's how its done in protobufs |
I think that use case, or more generally "invoking pre-existing code generators", is a good argument in favor of M3 over M2. The latter restricts tooling to be written in the scripting language which probably prohibits this kind of integration. (Though, I guess M2 and M3 aren't totally mutually exclusive, we can have an IR and also use it with an integrated scripting language.) |
Note that we are a cross-platform project. Maintaining and supplying, say, bash/cmd/powershell scripts is a pain, or requiring Windows users to have Python installed just to run |
I agree, let's not get into the business of cross platform invocations. I think we should make the decision to prioritize M3 over M2, any objections? I think its a good idea because M3 is easier than M2, doesn't actually prevent M2, and is needed for U7. If we go with M3, I can see this as a viable path:
And the next steps for this conversation is to identify how
|
requiring Windows users to have Python installed just to run flatc
Agree that this is a significant pain point for any scripting language.
I've made some progress writing a transpiler that generates rust from a
small subset of python. That way we could ship self contained binaries.
The transpiler still has a long way to go, but something to think about
…On Fri, Feb 5, 2021, 1:10 PM Wouter van Oortmerssen < ***@***.***> wrote:
And possible add a wrapper script to simplify things.
Note that we are a cross-platform project. Maintaining and supplying, say,
bash/cmd/powershell scripts is a pain, or requiring Windows users to have
Python installed just to run flatc. Hence why making flatc do it directly
has some value. But yes, I'd prefer to not be in the business of scripting
commands at all.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6428 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFA2AZ27D5TEIKX3MWQ72DS5RNFJANCNFSM4WZBV7XA>
.
|
@CasperN Making code generators separate binaries? That said, this is the chicken and egg problem I mentioned before: if the code generator uses bfbs, it must necessarily be able to read flatbuffers. Parsing JSON is clumsy. It is not a problem for code generators targeting other languages as long as there is some other generator to create the bfbs interface. Maybe it is just necessary to bite the bullet and hand generate a bfbs interface for supported code generator languages. As to cross platform scripting: it is not difficult to write a small wrapper script as bat and bash files because they don't really need to be maintained and argument parsing is done by the shells. Calling natively is a lot of work in the general case but maybe a simple system call in C is enough. There are also binary path issues, but that is true regardless of method. |
Yes, lets move forward with M3. Like I said, I'd like the to be an enum (set by code or command-line flag) that controls the richness of bfbs information, where the first value is "just enough to make reflection work", then added attributes, then added comments, all the way to "give me everything". We then document clearly which fields are set/non-default at what level. Converting all current generators to work based on bfbs generated code is going to generate a lot of churn, and to some extend is going to create a chicken and egg situation since we require the C++ generator to be compiled to create Would be interesting if we could at some point relegate |
@aardappel What is the benefit of having enum levels? It causes a lot of switches in the generating code, and the reader can ignore surplus information, and size hardly matters. Maybe the benefit is in how much of the reader interface to implement? Or maybe it can simplify life for some limited schema parsers? |
@CasperN since I don't have a lot of time for moving on M3 atm., it would be helpful if you could help out with at least trying to understand the needs of flatcc wrt. a new bfbs format. Some notable examples:
On a related note: flatcc does not support mutual type recursion across definitions in separate fbs although they do in the same file. I know this is important for flatc as it is used in some Google internal systems. The reasons it is not supported in flatcc is partially just that I didn't spent time on it, and partially because of complications with name shadowing and file uniqueness. However, I think it should be possible to support in some limited cases: a file can include an already included file and have access to those symbols even if the file is not physically included again - only the name shadowing will be processed differently for each file scope. The last part, I believe, is already handled by flatcc. It has a visibility map so any symbol lookup is both checked for existence and visibility. |
As to chicken and egg: at least for building bfbs buffers, flatcc has a low level typeless builder interface on top of which is is relatively easy to build buffers. All code builder generated code performs calls into this, and so does the generated JSON parser: https://github.com/dvidelabs/flatcc/blob/master/include/flatcc/flatcc_builder.h |
That is debatable.. for use cases where people want to use run-time reflection, having 5x the memory usage (and cache misses) for lots of metadata they don't use is not great. |
That said, I have been de-emphasizing the reflection use case of this data. The API for it is frankly horribly clumsy, and I'd prefer it if no-one used it. The "mini reflection" functionality in C++ is frankly nicer for that use case. But there are already existing users of bfbs based reflection, so we continue to offer it. |
Maybe we should just keep bfbs more or less as is, and create a separate schema for code generation. bfbs has issues with json though. |
@mikkelfj if you read my replies above, I am explicitly arguing that a separate schema would be undesirable. |
Exposing stable internal structure would be great for many tools. In my previous company (Snap), we had a flatbuffers-based code generator that depends on flatc internal directly. That turns out to be a big hurdle to upgrade along with new flatbuffers because flatbuffers doesn't expose a stable C++ interface. My current open-source project depends on flatbuffers internal representation, and I have to make a small C++ shim to expose structures I need through JSON such that I can write the rest of the codegen in Swift: https://github.com/liuliu/dflat/blob/unstable/src/parser/dflats.cpp Exposing a backward-compatible representation like M3 suggested would help a lot of tools out there that the flatbuffers core doesn't know to upgrade smoother. One other thing I would suggest is to add the ability to retain "unknown" attributes throughout the representation. These can be used in various extensions to generate extension-specific meanings (in my case, "primary", "index" or "unique" used to specify database constraints). This currently requires to work inside the |
Did you use flatbuffers as an IDL or did you actually generate swift code
to parse flatbuffer encoded bytes?
For the former case, there is some support here:
https://github.com/adsharma/flattools/tree/master/lang/swift
Like Wouter noted, the problem with scripting languages is that they're not
as nice from a deployment point of view as small statically linked binaries
natively packaged for popular platforms.
…On Mon, Feb 15, 2021 at 11:17 AM Liu Liu ***@***.***> wrote:
Exposing stable internal structure would be great for many tools. In my
previous company (Snap), we had a flatbuffers-based code generator that
depends on flatc directly. That turns out to be a big hurdle to upgrade
along with new flatbuffers because flatbuffers doesn't expose a stable C++
interface.
My current open-source project depends on flatbuffers internal
representation, and I have to make a small C++ shim to expose structures I
need through JSON such that I can write the rest of the codegen in Swift:
https://github.com/liuliu/dflat/blob/unstable/src/parser/dflats.cpp
Exposing a backward-compatible representation like M3 suggested would help
a lot of tools out there that the flatbuffers core doesn't know to upgrade
smoother.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6428 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFA2AZJU7TKGANTCYURJMDS7FXNZANCNFSM4WZBV7XA>
.
|
@adsharma in our use-case, we do both. We use flatbuffers' utilities to encode / decode bytes from database, and also use it as IDL to describe the data (which field is primary key, which field is indexed) and generate some additional accessors / mutators on top of what flatbuffers' Swift binding already provides. That's why so far, our integration all focused on emitting additional information when running |
@liuliu that's the same use case flattools addresses. The idea is to do template driven codegen. Has been used to generate json/yaml/SQL DDL among many output formats. https://github.com/adsharma/flattools/blob/master/templates/fbs_template_yaml.yaml.j2 is open source. The other templates are not. The templates assume a stable in-memory format like what's being discussed in this thread ( |
@adsharma thanks, will give it a deeper look at some point! @CasperN I need to read more about the new reflection API to have an informed opinion. Previously it just the reflection API not available (we've been using flatbuffers for 3~4 years now). For my case, we use FlatBuffers IDL as the main IDL, but to add more flavors, we need to have more attributes declared and passed down to codegen. |
Yeah - ditto. The code is from Jan 2017. Not sure if reflection API existed
then.
Secondly, I find a pure python parser easier to enhance and modify vs
calling into an API. So then the question becomes: does the compatibility
arise out of a shared grammar or shared in-memory parsed tree. I find them
more or less equivalent although you could argue that in-memory parsed tree
is better for compatibility across code generators written in multiple
languages.
…On Tue, Feb 16, 2021 at 10:28 AM Liu Liu ***@***.***> wrote:
@adsharma <https://github.com/adsharma> thanks, will give it a deeper
look at some point!
@CasperN <https://github.com/CasperN> I need to read more about the new
reflection API to have an informed opinion. Previously it just the
reflection API not available. For my case, we use FlatBuffers IDL as the
main IDL, but to add more flavors, we need to have more attributes declared
and passed down to codegen.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6428 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFA2A37DKN3ISHYKRPFHUDS7K2MDANCNFSM4WZBV7XA>
.
|
I don't think its the grammar or parsed tree that's important but rather the type system that ends up being specified. Name resolution, importing, type checking, parsing, etc are common features that aren't part of the grammar and should be shared. Though, maybe that's what you meant by "parsed tree". |
Ok so I poked around https://github.com/liuliu/dflat/blob/unstable/src/parser/dflats.cpp and it looks like it extracts I hope the same is true for flattools, but based on the discussion at the end of #6014, I suspect there will be subtle semantic differences despite sharing a grammar. |
Difficult question. It seems that most only C++ has reflection functionality at all in the form of I do know we (used to) have internal users of I think if languages want to implement reflection, they should feel free which direction they take it. |
Ok here's my working list of "what to do"
I think reflection.fbs is actually in pretty good shape for being our IR; the few changes I can think of do not seem too difficult (though I'm sure more will come up when moving a code generator onto it). If so, then maybe working on documentation, ergonomics, and discovery of the reflection API for these use cases will be most important, i.e. we just needed a "so you want to make a code generator" page. |
Haha, yes, we need that too :) |
Eventually, after we do the last TODO list, I think we should aim for the following architecture (M3++ if you will):
Long term, I think we need to break up idl_parser.cpp, which imo has gotten far too complex. This refactoring effort will get the code generators off of Parser, isolating it, while also improving modularity. There should be a subsequent effort to simplify the Parser class, though maybe isolation behind the reflection API is sufficient. |
I think I could take a stab at converting the Lua generator over to this IR approach. Just so I know what would be involved. I would do something like:
|
@dbaileychess is the new version of the Lua generator still based on the current C++ code? Because then I would say it would be nicer to have a short cut path to pass the |
I was sort of aiming for the middle ground between the two: 1) Have a clean separation of the API, where the new generator just receives a filepath and/or memory, i.e. no flatc internals and 2) Avoid having a separate binary to deal with. |
So with the Lua generator converted over to using bfbs, the framework in place, and major gaps filled in the reflection.fbs I think we can proceed on converting more of the generators over. |
This issue is stale because it has been open 6 months with no activity. Please comment or label |
This issue was automatically closed due to no activity for 6 months plus the 14 day notice period. |
The maintainers had some conversations and discussions around a significant refactoring of flatc to expose more internals somehow.
However, we're not sure if there are enough use cases to justify such an effort, or even guide its technical direction. So, I'm now crowd sourcing ideas. So... Where does flatbuffers tooling fall short, and can these be addressed by exposing more information that currently exists in flatc? And also how do people feel about the priority of this effort relative to some other large projects such as implementing the ideas in #5875 or #6053
Some use cases:
flatbuffer -> string
methodsSome methods
M1: dynamic linkingM2: scripting languagesreflection.fbs
which may be extended or replaced)Some positive side effects
Some preferences / potential requirements
If we go with M3,we should try to integrate withreflection.fbs
rather than starting from scratchR2: If we go with M2, we should reimplement our code generators in the scripting language.reflection.fbs
; optionally include attributes, comments, etc. This can be configured to "give enough for reflection" and "give enough for compilation / everything"@dbaileychess @mikkelfj @aardappel @krojew @paulovap @adsharma
(I intend to keep this top comment up to date with discussion below)
The text was updated successfully, but these errors were encountered: