Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exposing flatbuffer compilation internals #6428

Closed
CasperN opened this issue Jan 29, 2021 · 52 comments
Closed

Exposing flatbuffer compilation internals #6428

CasperN opened this issue Jan 29, 2021 · 52 comments
Labels

Comments

@CasperN
Copy link
Collaborator

CasperN commented Jan 29, 2021

The maintainers had some conversations and discussions around a significant refactoring of flatc to expose more internals somehow.

However, we're not sure if there are enough use cases to justify such an effort, or even guide its technical direction. So, I'm now crowd sourcing ideas. So... Where does flatbuffers tooling fall short, and can these be addressed by exposing more information that currently exists in flatc? And also how do people feel about the priority of this effort relative to some other large projects such as implementing the ideas in #5875 or #6053

Some use cases:

  • U1. Implementing new code generators for new languages in those languages
  • U2. Easier front end for projects like flattools and dflats
  • U3. Generated validators: example
  • U4. Generated fieldmasks (like this but not based on strings)
  • U5. Might make an XML to/from flatbuffers tool easier to implement Converting XML to Flatbuffers  #6033
  • U6. Generated custom flatbuffer -> string methods
  • U7. Invoke code generators owned by other codebases, e.g. swift-grpc

Some methods

  • M1: dynamic linking
    • This does not improve how many languages flatc can be developed in and forces us to deal with cross platform dynamic linking and does not seem to provide enough marginal benefits per cost
  • M2: scripting languages
    • We will move forward with M3 since M2 does not support U7 and since M3 does not actually prevent M2, should we choose to pursue it in the future.
  • M3: exposing a json/bfbs intermediate representation (We currently have reflection.fbs which may be extended or replaced)

Some positive side effects

  • P1: It'll make flatbuffers internals easier to understand (idl_parser pretty organically grown and messy right now).

Some preferences / potential requirements

  • R1: If we go with M3, we should try to integrate with reflection.fbs rather than starting from scratch
  • R2: If we go with M2, we should reimplement our code generators in the scripting language.
  • R3: We should support be able to specify multiple levels of information stored in reflection.fbs; optionally include attributes, comments, etc. This can be configured to "give enough for reflection" and "give enough for compilation / everything"
  • R4: The IR should be JSON compatible, in case a generator is to be implemented in a language without current flatbuffers support

@dbaileychess @mikkelfj @aardappel @krojew @paulovap @adsharma

(I intend to keep this top comment up to date with discussion below)

@CasperN
Copy link
Collaborator Author

CasperN commented Jan 29, 2021

fwiw, I'm totally biased towards M3. protbuffers and captnproto do something like that: There's a CodeGenerator(Request|Response) exchange between the compiler and generator.

@adsharma
Copy link

adsharma commented Jan 29, 2021 via email

@krojew
Copy link
Contributor

krojew commented Jan 29, 2021

Exposing intermediate representation looks like a good idea - it's essentially what modern compilers do.

@mikkelfj
Copy link
Contributor

mikkelfj commented Jan 30, 2021

I'm leaning towards M3 from a flatcc perspective. I believe M2 is useful, but it will not happen for flatcc for portability reasons - scripting engines are not as portable as native. Except if the scripting engine runs in an external process on top of M2. The would also allow easier interop between flatc and flatcc. Even with out of process scripting, flatcc will maintain an internal code generator for C for portability, but it might be rewritten to rely on M3. Except there is a chicken and egg problem for accessing the bfbs interface, so I am not fully convinced here even if attractive.

To clarify. For flatcc and any language other than C, M3 is certainly of interest, and out of process M2 also, via M3.

@CasperN
Copy link
Collaborator Author

CasperN commented Jan 30, 2021

Except there is a chicken and egg problem for accessing the bfbs interface, so I am not fully convinced here even if attractive.

I mean, that's only true if you want to bootstrap the code-generator for language X in that language, and i don't think that's such an important use case that we'd design around it. Even still, we can make the schema json-compatible and optionally output the IR in that form.

@adsharma
Copy link

adsharma commented Jan 31, 2021 via email

@mikkelfj
Copy link
Contributor

mikkelfj commented Feb 1, 2021

Parsing the source is not easier because parsing is only 10% of what a Flatbuffer compiler does before generating code.

@adsharma
Copy link

adsharma commented Feb 1, 2021

Sure - there is more work to do beyond pure parsing. What I was getting at is the somewhat sorry world of json schema and validating that the input json is well formed. I much prefer parsing a fbs file vs parsing json and then validating that it conforms to some json schema.

@CasperN
Copy link
Collaborator Author

CasperN commented Feb 1, 2021

Sure - there is more work to do beyond pure parsing. What I was getting at is the somewhat sorry world of json schema and validating that the input json is well formed. I much prefer parsing a fbs file vs parsing json and then validating that it conforms to some json schema.

I think this thread is a little off topic. One can assume that flatc would only output valid bfbs/json/$favorite_serde_format, and also the resulting schema has been type checked, name resolved, etc checked, fits a predefined schema, and is good for code generation in all respects.

  • Integrate with projects like pyserde

How would this work? It seems like pyserde defines data types in python which may conflict with the flatbuffer schema. Which would be the source of truth?

@adsharma
Copy link

adsharma commented Feb 1, 2021 via email

@aardappel
Copy link
Collaborator

I personally would prefer M2, but only if we committed to cleaning up / converting all the C++ generators we have into said scripting language.
If the C++ generators are going to continue being a thing, then M1/M3 are better.

M1 is kinda useless, since it still restricts the languages a generator can be written in a great deal. It also requires we deal with cross-platform dynamic linking, and making sure there is a dll/so for each platform.. not much of an improvement over just static linking it into flatc, imho.

M3 is definitely useful. One thing I wouldn't like is if we ended up with reflection.fbs and generator.fbs and they looked pretty similar. We have already enhanced reflection.fbs a few times in the past for the sake of external generators. I'd be in favor of extending reflection.fbs further, with a command-line flag that controls the level of information in it, as opposed to starting from scratch for the purpose of generators.

An extension of M3/M3 is if we took care of invoking scripts from flatc. Imagine you say flatc -gen foo.py schema.fbs, and flatc would first write out the .bfbs, found if python was installed and ran it for you. This saves the user having to write a script that runs both flatc and python, and possibly doing so for multiple platforms. Then again, not sure if we want to be in the business of running commands.

@mikkelfj
Copy link
Contributor

mikkelfj commented Feb 2, 2021

Then again, not sure if we want to be in the business of running commands.

I imagine you pipe bfbs or json output from flatc into a script. And possible add a wrapper script to simplify things.

@mustiikhalil
Copy link
Collaborator

One of the things that would actually be worth it, is allowing the flatc integrate with other already existing generators (GRPC) as an example. This is one of the points I talked with the guys over at swift-grpc. Since this would allow us to keep our code base consistent and not break each time there is a new version of GRPC

fwiw, I'm totally biased towards M3. protbuffers and captnproto do something like that: There's a CodeGenerator(Request|Response) exchange between the compiler and generator.

which is what you suggested in M3 I believe

@CasperN
Copy link
Collaborator Author

CasperN commented Feb 5, 2021

One of the things that would actually be worth it, is allowing the flatc integrate with other already existing generators (GRPC) as an example

I think you mean M3 will allow the swift-grpc codebase to own the flatbuffers-swift-grpc generator, which will be invoked by flatc. Is that correct?

@mustiikhalil
Copy link
Collaborator

mustiikhalil commented Feb 5, 2021

Yeah basically. I am not sure about the intricate details. but i would assume that's how its done in protobufs

@CasperN
Copy link
Collaborator Author

CasperN commented Feb 5, 2021

I think that use case, or more generally "invoking pre-existing code generators", is a good argument in favor of M3 over M2. The latter restricts tooling to be written in the scripting language which probably prohibits this kind of integration. (Though, I guess M2 and M3 aren't totally mutually exclusive, we can have an IR and also use it with an integrated scripting language.)

@aardappel
Copy link
Collaborator

And possible add a wrapper script to simplify things.

Note that we are a cross-platform project. Maintaining and supplying, say, bash/cmd/powershell scripts is a pain, or requiring Windows users to have Python installed just to run flatc. Hence why making flatc do it directly has some value. But yes, I'd prefer to not be in the business of scripting commands at all.

@CasperN
Copy link
Collaborator Author

CasperN commented Feb 10, 2021

I agree, let's not get into the business of cross platform invocations. I think we should make the decision to prioritize M3 over M2, any objections? I think its a good idea because M3 is easier than M2, doesn't actually prevent M2, and is needed for U7.

If we go with M3, I can see this as a viable path:

  1. Figure out how to extend reflection.fbs to support code generators
  2. Refactor the code generators so they take a const Schema & (or whatever comes from step 1) instead of a const Parser&
    • This is a necessary "proof of concept" to verify step 1.
    • we could go so far as to make the code generators separate binaries that are shipped and invoked by flatc but that seems unnecessary.
  3. ???
  4. Start investing in new tooling?

And the next steps for this conversation is to identify how reflection.fbs falls short for existing code generators? Off the top of my head...

  • We probably should finish the Object deprecation, particularly if we want to be able to print things out in JSON
  • We need to identify the presence of features that aren't available in all languages (AdvancedUnions, OptionalScalars, etc)
    • It should be a simple list of strings. (I guess enums and bitflags work too)
    • Code generators will track their own support of these features and will error out if they see an unrecognized one
  • We need to preserve "declared at path/to/file.fbs" info

@adsharma
Copy link

adsharma commented Feb 10, 2021 via email

@mikkelfj
Copy link
Contributor

@CasperN
Agree on M3, I'd like to see it sent to stdout and accept fbs on stdin, at least as an option. This allows for piping, and it make flatc and flatcc mostly replaceable save for actual cli syntax.

Making code generators separate binaries?
The benefit would be that flatc, flatcc, or any other schema compiler would be able to take advantage of code generators. This goes both ways> maybe an existing code generator is abandonware, or is written in a language that does not work on some user platforms, then maybe there is another implementation written for another schema compiler.

That said, this is the chicken and egg problem I mentioned before: if the code generator uses bfbs, it must necessarily be able to read flatbuffers. Parsing JSON is clumsy. It is not a problem for code generators targeting other languages as long as there is some other generator to create the bfbs interface.

Maybe it is just necessary to bite the bullet and hand generate a bfbs interface for supported code generator languages.

As to cross platform scripting: it is not difficult to write a small wrapper script as bat and bash files because they don't really need to be maintained and argument parsing is done by the shells. Calling natively is a lot of work in the general case but maybe a simple system call in C is enough. There are also binary path issues, but that is true regardless of method.

@aardappel
Copy link
Collaborator

Yes, lets move forward with M3.

Like I said, I'd like the to be an enum (set by code or command-line flag) that controls the richness of bfbs information, where the first value is "just enough to make reflection work", then added attributes, then added comments, all the way to "give me everything". We then document clearly which fields are set/non-default at what level.

Converting all current generators to work based on bfbs generated code is going to generate a lot of churn, and to some extend is going to create a chicken and egg situation since we require the C++ generator to be compiled to create reflection_generated.h. I agree we need a rich test case though, to ensure we are storing all information correctly. If we don't have a new test case for this direct bfbs usage, maybe we should start with 1 or a few languages.

Would be interesting if we could at some point relegate Parser to an internal data structure only :)

@CasperN CasperN changed the title Use cases for plugins / exposing flatbuffer compilation internals Exposing flatbuffer compilation internals Feb 10, 2021
@mikkelfj
Copy link
Contributor

@aardappel What is the benefit of having enum levels? It causes a lot of switches in the generating code, and the reader can ignore surplus information, and size hardly matters.

Maybe the benefit is in how much of the reader interface to implement? Or maybe it can simplify life for some limited schema parsers?

@mikkelfj
Copy link
Contributor

@CasperN since I don't have a lot of time for moving on M3 atm., it would be helpful if you could help out with at least trying to understand the needs of flatcc wrt. a new bfbs format.

Some notable examples:

  • The need for idempotency in generated files (shadowing names that an include should not be able to see), and the uniqueness of the schema filenames.
  • The need for unique fbs file detection: flatcc has decided that two schema files with the same uppercase translated name are the same regardless of path and therefore stops recursive inclusion based on that. We could choose another criteria, but I'd like it to be path neutral. Notably I believe a lower case mapping is a saner choice as these are less ambiguous in Unicode, but then again most case neutral file systems use upper case.

On a related note: flatcc does not support mutual type recursion across definitions in separate fbs although they do in the same file. I know this is important for flatc as it is used in some Google internal systems. The reasons it is not supported in flatcc is partially just that I didn't spent time on it, and partially because of complications with name shadowing and file uniqueness. However, I think it should be possible to support in some limited cases: a file can include an already included file and have access to those symbols even if the file is not physically included again - only the name shadowing will be processed differently for each file scope. The last part, I believe, is already handled by flatcc. It has a visibility map so any symbol lookup is both checked for existence and visibility.

@mikkelfj
Copy link
Contributor

As to chicken and egg: at least for building bfbs buffers, flatcc has a low level typeless builder interface on top of which is is relatively easy to build buffers. All code builder generated code performs calls into this, and so does the generated JSON parser:

https://github.com/dvidelabs/flatcc/blob/master/include/flatcc/flatcc_builder.h

@aardappel
Copy link
Collaborator

aardappel commented Feb 11, 2021

@mikkelfj

size hardly matters

That is debatable.. for use cases where people want to use run-time reflection, having 5x the memory usage (and cache misses) for lots of metadata they don't use is not great.

@aardappel
Copy link
Collaborator

That said, I have been de-emphasizing the reflection use case of this data. The API for it is frankly horribly clumsy, and I'd prefer it if no-one used it. The "mini reflection" functionality in C++ is frankly nicer for that use case. But there are already existing users of bfbs based reflection, so we continue to offer it.

@mikkelfj
Copy link
Contributor

Maybe we should just keep bfbs more or less as is, and create a separate schema for code generation. bfbs has issues with json though.

@aardappel
Copy link
Collaborator

@mikkelfj if you read my replies above, I am explicitly arguing that a separate schema would be undesirable.

@liuliu
Copy link
Contributor

liuliu commented Feb 15, 2021

Exposing stable internal structure would be great for many tools. In my previous company (Snap), we had a flatbuffers-based code generator that depends on flatc internal directly. That turns out to be a big hurdle to upgrade along with new flatbuffers because flatbuffers doesn't expose a stable C++ interface.

My current open-source project depends on flatbuffers internal representation, and I have to make a small C++ shim to expose structures I need through JSON such that I can write the rest of the codegen in Swift: https://github.com/liuliu/dflat/blob/unstable/src/parser/dflats.cpp

Exposing a backward-compatible representation like M3 suggested would help a lot of tools out there that the flatbuffers core doesn't know to upgrade smoother.

One other thing I would suggest is to add the ability to retain "unknown" attributes throughout the representation. These can be used in various extensions to generate extension-specific meanings (in my case, "primary", "index" or "unique" used to specify database constraints). This currently requires to work inside the Parser object directly in flatc, which is not ideal.

@adsharma
Copy link

adsharma commented Feb 15, 2021 via email

@liuliu
Copy link
Contributor

liuliu commented Feb 16, 2021

@adsharma in our use-case, we do both. We use flatbuffers' utilities to encode / decode bytes from database, and also use it as IDL to describe the data (which field is primary key, which field is indexed) and generate some additional accessors / mutators on top of what flatbuffers' Swift binding already provides. That's why so far, our integration all focused on emitting additional information when running flatc and use that to generate more code in addition to flatbuffer's own encoder / decoder.

@adsharma
Copy link

@liuliu that's the same use case flattools addresses. The idea is to do template driven codegen. Has been used to generate json/yaml/SQL DDL among many output formats.

https://github.com/adsharma/flattools/blob/master/templates/fbs_template_yaml.yaml.j2

is open source. The other templates are not.

The templates assume a stable in-memory format like what's being discussed in this thread (__fbs_meta__ and friends in the code). There is some output language specific metadata (reserved words in the target language, type maps etc) that get added to
the environment before invoking the template engine.

@CasperN
Copy link
Collaborator Author

CasperN commented Feb 16, 2021

@adsharma @liuliu

In what ways does the reflection API fall short, such that you had to build your own schema reader / inspect flatc internals?

@liuliu
Copy link
Contributor

liuliu commented Feb 16, 2021

@adsharma thanks, will give it a deeper look at some point!

@CasperN I need to read more about the new reflection API to have an informed opinion. Previously it just the reflection API not available (we've been using flatbuffers for 3~4 years now). For my case, we use FlatBuffers IDL as the main IDL, but to add more flavors, we need to have more attributes declared and passed down to codegen.

@adsharma
Copy link

adsharma commented Feb 16, 2021 via email

@CasperN
Copy link
Collaborator Author

CasperN commented Feb 16, 2021

does the compatibility arise out of a shared grammar or shared in-memory parsed tree?

I don't think its the grammar or parsed tree that's important but rather the type system that ends up being specified. Name resolution, importing, type checking, parsing, etc are common features that aren't part of the grammar and should be shared. Though, maybe that's what you meant by "parsed tree".

@CasperN
Copy link
Collaborator Author

CasperN commented Feb 18, 2021

Ok so I poked around https://github.com/liuliu/dflat/blob/unstable/src/parser/dflats.cpp and it looks like it extracts Parser internals, like our code generators do, then dumps them into JSON. If we successfully refactor at least one code generator into using the IR, dflats will work.

I hope the same is true for flattools, but based on the discussion at the end of #6014, I suspect there will be subtle semantic differences despite sharing a grammar.

@aardappel
Copy link
Collaborator

@CasperN

Is the recommendation for languages to come up with their own reflection API, like mini-reflection, and not use reflection.fbs, which will be simultaneously a legacy reflection API and also the new code generation API? How important is backwards compatibility in the reflection API? Can we deprecate Object and remove JSON-exploding cycles?

Difficult question. It seems that most only C++ has reflection functionality at all in the form of reflection.fbs and the mini-reflection. For C++, the mini-reflection seems simpler, easier to use, doesn't require a file, but is also more limited. reflection.fbs is in theory nicer because, you know, it uses FlatBuffers itself :)

I do know we (used to) have internal users of reflection.fbs to major breakage should maybe be avoided, but small breakage (renaming/deprecating fields) is maybe acceptable.

I think if languages want to implement reflection, they should feel free which direction they take it.

@CasperN
Copy link
Collaborator Author

CasperN commented Feb 19, 2021

Ok here's my working list of "what to do"

I think reflection.fbs is actually in pretty good shape for being our IR; the few changes I can think of do not seem too difficult (though I'm sure more will come up when moving a code generator onto it). If so, then maybe working on documentation, ergonomics, and discovery of the reflection API for these use cases will be most important, i.e. we just needed a "so you want to make a code generator" page.

@aardappel
Copy link
Collaborator

"so you want to make a code generator"

Haha, yes, we need that too :)

@CasperN
Copy link
Collaborator Author

CasperN commented Jun 7, 2021

Eventually, after we do the last TODO list, I think we should aim for the following architecture (M3++ if you will):

  • flatc will be the main interface for all users. It will read .fbs files and produce .bfbs files
  • flatc will invoke plugins, and pass them the reflection.fbs schema, transpiled to JSON/flexbuffers/flatbuffers
    • perhaps std::system would work, but we can figure this out
    • There will be a config file defining where the plugins live, which transpilation they use, flagname to invoke them with, etc

Long term, I think we need to break up idl_parser.cpp, which imo has gotten far too complex. This refactoring effort will get the code generators off of Parser, isolating it, while also improving modularity. There should be a subsequent effort to simplify the Parser class, though maybe isolation behind the reflection API is sufficient.

@dbaileychess
Copy link
Collaborator

I think I could take a stab at converting the Lua generator over to this IR approach.

Just so I know what would be involved. I would do something like:

  1. When flatc --lua ${SRCS} is invoked, have it actually do flatc -b --schema ${SRCS} and output the .bfbs to a temporary file.
  2. Then invoke the new Lua Generator and pass it the filepath of the temporary .bfbs file.
  3. This new generator will read the .bfbs in and deserialize it back into reflection::Schema object.
  4. The generator will loop over the the fields of reflection::Schema and generate the corresponding Lua generated files.

@aardappel
Copy link
Collaborator

@dbaileychess is the new version of the Lua generator still based on the current C++ code? Because then I would say it would be nicer to have a short cut path to pass the .bfbs in memory, which means you don't need to deal with temp files on multiple platforms etc. I know we'll need those temp files eventually but no need to complicate the first step. Or is part of the exercise to make the Lua generator its own exe? Maybe in a v2?

@dbaileychess
Copy link
Collaborator

I was sort of aiming for the middle ground between the two: 1) Have a clean separation of the API, where the new generator just receives a filepath and/or memory, i.e. no flatc internals and 2) Avoid having a separate binary to deal with.

@dbaileychess
Copy link
Collaborator

So with the Lua generator converted over to using bfbs, the framework in place, and major gaps filled in the reflection.fbs I think we can proceed on converting more of the generators over.

@github-actions
Copy link

github-actions bot commented Mar 4, 2023

This issue is stale because it has been open 6 months with no activity. Please comment or label not-stale, or this will be closed in 14 days.

@github-actions github-actions bot added the stale label Mar 4, 2023
@github-actions
Copy link

This issue was automatically closed due to no activity for 6 months plus the 14 day notice period.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants