Add the file a symbol is declared in to Reflection #6613

CasperN · 2021-05-03T23:23:17Z

Add the file a symbol is declared in to Reflection data.

If we make a code-generator to depend on Reflection,
it may need to know the name of .fbs files to name the
generated files properly.

#6428

(This should not be submitted until after the release)

If we move a code-generator to depend on Reflection, it may need to know which file something was declared in to properly name generated files.

CasperN · 2021-05-03T23:25:11Z

@aardappel

This is currently a bit broken since the .bfbs files contains absolute paths with idiosyncratic stuff from my laptop, which will break against the CI tests. I think the solution is to make paths relative to where flatc was invoked

aardappel · 2021-05-03T23:29:50Z

While I really appreciate small incremental PRs, it is also hard to say if this makes sense, since I can't tell how a codegen would use it. Do current codegens that need this information use paths relative to the current dir? or output dir? I thought they mostly use namespaces as directory names?

Also, rather than a string that is duplicated many times over, an index into a vector of unique filenames may be nicer?

CasperN · 2021-05-03T23:38:50Z

While I really appreciate small incremental PRs, it is also hard to say if this makes sense, since I can't tell how a codegen would use it. Do current codegens that need this information use paths relative to the current dir? or output dir? I thought they mostly use namespaces as directory names?

I was under the impression that current codegen generates a file for every .fbs file. In the future, I think flatc should output one bfbs which is consumed by the code generator to create whatever structure it wants -- perhaps one generated file per namespace.

flatc **.fbs --reflection-fbs > compilation.bfbs
${lang}_generator compilation.bfbs # ${lang}_generator decides on the output _generated files

Also, rather than a string that is duplicated many times over, an index into a vector of unique filenames may be nicer?

I thought about that, but decided using shared strings is more convenient with the same de-duplication.

aardappel · 2021-05-06T20:36:05Z

So then a bfbs based codegen has to go find all declarations with the same filename to replicate the behavior of current codegens? Would be simpler to emit 1 bfbs per fbs, but that will introduce a lot of duplication in the bfbs files.

It's somewhat tricky, because the previous model gave exact control over what gets re-generated. For example if you have a.fbs and b.fbs that both include c.fbs, then all 3 of these can be regenerated into a .h independently. But the bfbs must include the included definitions, and we may not want to regenerate the included definitions.

So we could have the situation where we write flatc a.fbs d.fbs --reflection-fbs > compilation.bfbs (where as before, c.fbs is included), but the codegen should then only emit for a and d, not c.

CasperN · 2021-05-07T19:43:41Z

So then a bfbs based codegen has to go find all declarations with the same filename to replicate the behavior of current codegens?

Yes. I think reading bfbs and reconstructing the symbols per file will be very easy.

Would be simpler to emit 1 bfbs per fbs, but that will introduce a lot of duplication in the bfbs files.

The 1-bfbs-per-file might be a better idea for highly parallel incremental compilation... I'm just going to assume having 1 big bfbs is sufficient for all current use cases.

It's somewhat tricky, because the previous model gave exact control over what gets re-generated. For example if you have a.fbs and b.fbs that both include c.fbs, then all 3 of these can be regenerated into a .h independently. But the bfbs must include the included definitions, and we may not want to regenerate the included definitions.

If we go with the model where compilation.bfbs is the only input to the code generator (which I slightly prefer, even if the code generator and parser are in the same binary), then we'd need to plumb through something like IdlOptions through the bfbs, and part of that data would include this option to not regenerate included definitions.

aardappel · 2021-05-07T22:58:23Z

and part of that data would include this option to not regenerate included definitions.

Well, that means you need to mark which definitions were in included, and in such a way that you can distinguish between "included" and "included but also appeared on the command-line to create this file"

CasperN · 2021-05-07T23:54:59Z

Well, that means you need to mark which definitions were in included, and in such a way that you can distinguish between "included" and "included but also appeared on the command-line to create this file"

I think the options/metadata can just include the list of files from the command line invocation.

CasperN · 2021-05-15T22:22:13Z

The current problem is that the source_filename argument comes from argv.
When I run make clean && make, monster_test.bfbs will have full paths in there.
When I run cd tests/; ./generate_code.sh, monster_test.bfbs will have relative paths.
This diff is breaking CI.

I could make cmake cd into tests to build the bfbs files, so it matches ./generate_code.sh.
That'll make tests pass but seems a little unsatisfactory -- since it pushes complexity into the flatc user.
I'm considering adding a --project_root flag to flatc and normalize paths inside flatc.
This also makes it easier to hide personal filesystem data if users pass around the reflection bfbs.

Thoughts @aardappel ?

Also, before this is submitted, I should add a --bfbs-sourcefiles flag to make this extra data optional.

aardappel · 2021-05-17T18:38:47Z

looping in @mikkelfj who was interested in the topic of bfbs for codegen, and @dbaileychess who may have an opinion.

The question is how to encode file paths.

Though if this is complicated, maybe that's a hint we shouldn't be adding file paths, but instead represent this information differently. We should check how current generators use paths, and see what would be enough information for that generator to run from a bfbs, but no more.

CasperN · 2021-05-17T19:42:58Z

Though if this is complicated, maybe that's a hint we shouldn't be adding file paths, but instead represent this information differently. We should check how current generators use paths, and see what would be enough information for that generator to run from a bfbs, but no more.

If we get signoff from all language maintainers (besides C++) that we can generate files based on namespaces, and not fbs filepaths, then I agree we should skip this whole problem.

However, I see some challenges w.r.t. incremental compilation: How will that work when generated files are disconnected from .fbs files? Maybe this will make blaze integration hard somehow?

Fwiw, protobuffers keeps filenames in there, "relative to the root of the source tree"
https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.compiler.plugin.pb
https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.descriptor.pb

mikkelfj · 2021-05-17T21:40:44Z

FlatCC currently depends on case-insentitive basename of the fbs file (mapped to uppercase although lowercase would have been better). FlatCC will not process an included fbs file if the same basename has already been seen, regardless of path. Output is placed in the current directory or the specified -o path and there will be no conflicts because output files are prefixed by basename. Also, in generated C code, dependencies are protected with include guards based on the basename. This together ensures that the generated output is always the same for a given fbs file, regardless of how it is included.

If FlatCC were to generate code from a new version of bfbs files, it would need the basename of the filename, but the path would not be important.

Note that paths can be ambigous due to hardlinks or symbolic mappings.

There is a limit to this approach because there could be two separate fbs files with the same basename, but it is simpler to disallow such ambiguity and rename files if necessary.

Note that this is orthogonal to namespaces. In C each symbol is prefixed by the namespace name which is unrelated to the file basename.

CasperN · 2021-05-17T22:29:09Z

If FlatCC were to generate code from a new version of bfbs files, it would need the basename of the filename, but the path would not be important.

@mikkelfj, what if you generate a file based on the namespace? I guess that'd be a pretty big, backwards incompatible, change.

Backwards compatibility seems like a compelling enough reason to support filenames. I'm now leaning towards normalizing paths w.r.t a given --project_root location.

aardappel · 2021-05-17T22:41:00Z

yes, most generators seem to place files relative to the -o path (which a bfbs file would still get specified externally if used for codegen), and namespace inside of that.. it should not need paths.

aardappel · 2021-05-17T22:41:57Z

we do need a bool to indicate any types that are from files that were included but not present on the command-line.

mikkelfj · 2021-05-18T08:29:50Z

I should add that this approach also is intended to (and do) work well for incremental builds. Notably it does not matter if output is generated by one instance of flatcc processing several included files, or several instances of flatcc for each fbs file, or several fbs files on the command line, and it does not matter from which directory flatcc is instantiated. It may result in output files being overridden by seperate instances of flatcc, but the output would be replaced by the same content then.

However, the relative paths are still important insided fbs include statements for the purpose of locating the included files.

By extension: if flatcc were to generate code from bfbs-v2 files, it should not matter if code is generated from one or several bfbs files, but one bfbs file can contain the data of multiple fbs files.

I don't think that bfbs files should include other bfbs files, but hold information for all included fbs files.
It could see a theoretical problem with very large sets of included files and potentially long incremental compilation times to build the bfbs file upon change. But the alternative seems to be one bfbs file per fbs file which isn't really helpful.

mikkelfj · 2021-05-18T08:33:01Z

@mikkelfj, what if you generate a file based on the namespace? I guess that'd be a pretty big, backwards incompatible, change.

That would not be practical. The primary concern is how the tool is being used in build systems, and idempotent output.

If you used namespace based files, you would have a conflict if two seperate instances of flatcc process separate fbs files that refer to the same namespace. Also, the namespace could potentially be very large in terms of symbolic content. This is akin to C++ std namespace where you don't put all of the namespace in one file.

mikkelfj · 2021-05-18T08:36:56Z

we do need a bool to indicate any types that are from files that were included but not present on the command-line.

I'm not sure if this is important, at least I don't think it matters to flatcc. However, visibility does matter: if root file A.fbs includes files B.fbs and C.fbs and B defines a symbol that might affect C, then that symbol should be invisible to C. Otherwise the output of C depends on whether it is included by A or not. An exception to this is attribute declarations where you can include a file that declares attributes and use it in other files. I tend to think that attribute declarations are not that useful in fbs files.

mikkelfj · 2021-05-18T08:41:57Z

Backwards compatibility seems like a compelling enough reason to support filenames. I'm now leaning towards normalizing paths w.r.t a given --project_root location.

I don't have an issue with normalizing paths relative to some root as long as it is not permitted to have conflicting basenames (case insensitive due to certain file systems). The path could still prove useful for some language backends. FlatCC would just extract the basename from that path.

FlatCC might also learn to create subdirectories based on relative paths at least as an option, but I do not see any compelling reason for it as long as basenames are unique.

mikkelfj · 2021-05-18T08:46:35Z

Also, on namespace based filenames: flatcc generates a large number of inline functions for each symbol - this is usually not a problem, but it starts to see some of the problems where C++ require precompiled headers. Therefore it is helpful to keep files relatively small such that end-user programs only need to include what they need.

…eclfile

CasperN · 2021-05-25T20:44:47Z

Ok, so given the above, I've renormalized filepaths relative to a project root, specified with the --project-root flag, which defaults to ./. Given --project-root foo, this converts foo/bar/baz.fbs to //bar/baz.fbs. I borrowed the // representing "project root" convention from protobuffers.

CasperN · 2021-06-08T13:45:14Z

WRT visibility and idempotentce, we also need to know which files a file has included, or some other kind of visibility map.

I think visibility checks should be on the side of flatc or flatcc, not on the code generator. Code generators wouldn't need this information so it shouldn't be in the IR. To properly import things in header files, you can back out the include graph by examining the types. We can provide a helper library for that. Whether flatc needs to implement visibility/idempotence checks, that should be a separate discussion and PR.

CasperN · 2021-06-10T17:55:38Z

Bump @mikkelfj @aardappel

mikkelfj · 2021-06-10T18:50:31Z

Not sure if you bumped for last comment on visibility or general review. For general review I have not checked the commits but if paths are added as discussed and if the basenames (case insenstive) do not conflict when stripping path prefix, then I'm OK with it. I am a bit concerned that flatcc may have to pull in a complex path library to normalize paths (I have one, I just don't want to publish it unless necessary - notably it gets complicated with Windows and drives). Instead flatcc might just record paths by their basename, at least initially.

Wrt. visibility:

A symbol can be visible in one context and not in another. The current reflection schema can only tell where the symbol is defined, not which other files the symbol may be available to. However, as @CasperN suggests, this can be checked at parser semantics level and indeed this is what flatcc does on its AST.

There are both type references and enum name references. Enums are bit more tricky as discussed below:

In flatcc the code generator has access to the root schema object which has a visiblity map:
https://github.com/dvidelabs/flatcc/blob/32aabef14ccc0e3ac15602c99476d25dcd230d19/src/compiler/symbols.h#L388-L394

Here is an example of where a check is made:
https://github.com/dvidelabs/flatcc/blob/32aabef14ccc0e3ac15602c99476d25dcd230d19/src/compiler/semantics.c#L312-L313

Assume and include structure such as A includes B and C. B and C includes nothing. If a a type is defined in C it is not visible in B and vice versa because include order is irrelevant. B and C also cannot see symbols defined in A. A can see symbols in both A and B because include statements are always first in a schema file. If there are name conflicts between B and C, then A should complain about this, but if the conflicting type is not referenced, the error is survivable.

If a type defined in C references a type defined in C or in A, that should raise an error at parse time before bfbs generation. This is how flatcc works. The code generators do not reference the visibility map for this problem.

However, there is an exception to this: the flatcc generated JSON parser checks if an enum definition is visible in the code generator. I don't remember exactly why, but I assume this is because enums can be can referenced at runtime in input JSON files and the code generator need to know which enums it should add to its lookup logic. This cannot be checked at the semantic stage.

CasperN · 2021-06-14T20:19:56Z

notably it gets complicated with Windows and drives).

Yea that's one risk that I'm a bit concerned about. I'm not at all familiar with windows but am hoping that its a reasonable assumption that all .fbs files can be normalized relative to some parent directory. Is it even possible to include things from different drives (without symlinks or something), I'd imagine our compiler would error...

assume and include structure such as A includes B and C...

Okay, I made an issue for this, #6697. However, aside from noting the visibility check should happen before the reflection/IR, its not relevant to this PR.

CasperN · 2021-06-15T22:00:18Z

@aardappel @mikkelfj ok it seems the last check failed due to CI rate limiting. Unless there are more comments, I'll submit this tomorrow.

mikkelfj · 2021-06-16T02:59:43Z

Okay, I made an issue for this, #6697. However, aside from noting the visibility check should happen before the reflection/IR, its not relevant to this PR.

I agree about the check per the above, but any thoughts on dealing with enum visibility for parser generation?

CasperN · 2021-06-17T15:49:58Z

I think the json parser should know the type of the enum its trying to read out of the JSON, based on the surrounding table/struct field, and try to match accordingly.

mikkelfj · 2021-06-17T16:05:09Z

@CasperN That is the default, but you can assign other enum values to an enum field, or to any integer field. These enums are found based enums defined in the same namespace, or if prefixed, in other namespaces. In the default case, the enum is necessarily visible if the field is valid. In the other cases it depends on the include path.

CasperN · 2021-06-17T16:16:11Z

you can assign other enum values to an enum field, or to any integer field

Oh, that's annoying that we allow this... I think we should just not check visibility based on .fbs include path. I think the future is to invoke something like

 json2fbs myschema.bfbs foo.json > foo.bfbs

in which case, all the .fbs visibility stuff was covered by whatever generated myschema.bfbs. Once we're after that stage, imo, the visibility rules in foo.json should be C++ name resolution rules. I.e. if assigning to foo::bar::Table::field you should look up the enum in ::foo::bar then ::foo then ::

mikkelfj · 2021-06-17T16:25:53Z

flatcc generates a fully compiled and optimized JSON parser, it cannot process a schema ad hoc.
Also, it can be very highly useful to assign enums to other integer fields.

mikkelfj · 2021-06-17T16:31:16Z

That is, flatcc takes a schema and generates a schema specific parser in C which is compiled before the parser ever sees a JSON file. If the parser were to be generated by a bfbs file, it would need enum visibility information stored in the bfbs file for the above reasons.

mikkelfj · 2021-06-17T16:51:38Z

From memory: The json parser parses any integer field by first trying an integer, then if that fails, parsing an unqualified enum name in the same namespace as the table or struct being parsed. If that fails, it attempts to parse a qualified enum name with the enum type as prefix, and if that fails, it attempts a namespace prefix also. I don't think it parses namespace dot unqualified enum name.

I do not recall what happens if two enum names conflict in this approach, or maybe enums must be qualified when the field is not an enum type. If the type is an enum type, it will of course prioritize the names of that enum.

In praxis this is done by maintaing an array of enum parsers for each integer field being parsed. Each fbs file produces a number of enum parsers that can be used be different fields. If an enum is defined in an included schema file, the parser will be defined in the correspodingly generated JSON parser file. An enum parser may call parsers from an included file. That is, a symbolic name parser can cover multiple enums at once by matching a qualified name and then calling the corresponding enum parser.

mikkelfj · 2021-06-17T16:59:43Z

Also, I'm not sure exactly what happens about searching parent namespaces, but I agree that in principle names should be searched inside out.

EDIT: I think that flatcc requires a fully qualified name if the name is not local, but it can be with or without enum type prefix depending on context and still being considered a local name.

mikkelfj · 2021-06-17T17:08:07Z

Anyway, scope resolution isn't really the issue here. That information is available in the bfbs file (assuming enums list what namespace they belong to). The issue is which enums are visible to a specific fbs file.

It could be solved by having a table of filenames at schema level where each file lists the files that it directly includes either by name or by index in the file table.

CasperN · 2021-06-17T17:17:17Z

Each fbs file produces a number of enum parsers that can be used be different fields.

I think you should either:

not generate parsers per .fbs file
create a reflection.bfbs for each .fbs file and generate enum parsers from those

mikkelfj · 2021-06-17T17:36:32Z

First option goes against the entire compilation model of flatcc.
The second option could almost work because it implies where an enum is defined, but it fails because a) it would not work with flatc generated bfbs files unless this approach is adopted universally, and b) because knowledge of other bfbs files are required, and thus you need a list of included files - which could work - and which actually better matches the flatcc compilation model, but comes at some complexity in reading the bfbs files.

The thing is, parsing an fbs file and generating all included content into a single bfbs file sort of goes against the flatcc compilation model, but nor really: The thing is that the output generated for a single fbs file does not, or should not, depend on whether it comes from one huge bfbs file or from a smaller bfbs file. This is because of idempotence. It may happen that output is overwritten by processing multiple bfbs files, but they will be overwritten witht the same content.

A list of fbs files that a given bfbs file covers, and what each individual fbs file includes, will cover that information.

If you also added a numeric index holding the parsing order of each type in the bfbs file, you could in principle largely recreate the input fbs files, although I do not argue that you should.

mikkelfj · 2021-06-17T17:44:36Z

Somewhat unrelated but to my last comment:

If you also added a numeric index holding the parsing order of each type in the bfbs file, you could in principle largely recreate the input fbs files, although I do not argue that you should.

Some users rely on fbs parsed into bfbs files to manage their own schemas for completely other purposes just because fbs is a convenient format and bfbs is easy to process. Adding the parsing order could be useful to those users, although this is of course a fringe use case. The file list might also be useful for the same reasons.

CasperN · 2021-06-17T18:59:38Z

Some users rely on fbs parsed into bfbs files to manage their own schemas for completely other purposes just because fbs is a convenient format and bfbs is easy to process. Adding the parsing order could be useful to those users, although this is of course a fringe use case. The file list might also be useful for the same reasons.

yea ok, fair enough and it sounds like you can't get your use case without the includes list.

I'll make a backwards-incompatible change and replace fbs_files: [string]; with fbs_files: [SchemaFile];
where

table SchemaFile {
  name: string (required, key);
  includes: [string];
}

mikkelfj · 2021-06-17T19:06:29Z

yes, that is fine. The only thing is - is the includes path relative to the including file, or to the project?

CasperN · 2021-06-17T19:18:48Z

To the project

Add the file a symbol is declared in to Reflection

6280eb7

If we move a code-generator to depend on Reflection, it may need to know which file something was declared in to properly name generated files.

github-actions bot added c++ codegen Involving generating code from schema labels May 3, 2021

Doc comments in reflection, and more precise tests

3c7bdde

Casper Neo added 3 commits May 21, 2021 18:40

Merge branch 'master' of https://github.com/google/flatbuffers into d…

0f6b8ae

…eclfile

Add --project-root flag to flatc, normalize declaraion_file to this root

82fcc97

fix --project-root stuff

ea6f474

Casper Neo added 2 commits May 25, 2021 20:54

posixpath

340a390

fix scripts

ae1e466

github-actions bot added the grpc label May 25, 2021

Casper Neo added 3 commits June 8, 2021 10:20

Remove accidental change to docs

02513c5

Merge branch 'declfile' of github.com:caspern/flatbuffers into declfile

5ef7010

Remove accidental change to docs

5889549

github-actions bot removed the documentation Documentation label Jun 8, 2021

Pool strings

af1c2ef

CasperN merged commit c58ae94 into google:master Jun 17, 2021

CasperN mentioned this pull request Jun 17, 2021

Exposing flatbuffer compilation internals #6428

Closed

CasperN deleted the declfile branch June 17, 2021 18:09

CasperN mentioned this pull request Jun 18, 2021

Replace filenames in reflection with filenames+includes. #6703

Merged

dbaileychess mentioned this pull request Sep 22, 2021

Make and generate_code.sh generate same outputs #6855

Merged

Add the file a symbol is declared in to Reflection #6613

Add the file a symbol is declared in to Reflection #6613

Uh oh!

Conversation

CasperN commented May 3, 2021

Uh oh!

CasperN commented May 3, 2021

Uh oh!

aardappel commented May 3, 2021

Uh oh!

CasperN commented May 3, 2021

Uh oh!

aardappel commented May 6, 2021

Uh oh!

CasperN commented May 7, 2021

Uh oh!

aardappel commented May 7, 2021

Uh oh!

CasperN commented May 7, 2021

Uh oh!

CasperN commented May 15, 2021

Uh oh!

aardappel commented May 17, 2021

Uh oh!

CasperN commented May 17, 2021

Uh oh!

mikkelfj commented May 17, 2021

Uh oh!

CasperN commented May 17, 2021

Uh oh!

aardappel commented May 17, 2021

Uh oh!

aardappel commented May 17, 2021

Uh oh!

mikkelfj commented May 18, 2021

Uh oh!

mikkelfj commented May 18, 2021

Uh oh!

mikkelfj commented May 18, 2021

Uh oh!

mikkelfj commented May 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikkelfj commented May 18, 2021

Uh oh!

CasperN commented May 25, 2021

Uh oh!

CasperN commented Jun 8, 2021

Uh oh!

CasperN commented Jun 10, 2021

Uh oh!

mikkelfj commented Jun 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CasperN commented Jun 14, 2021

Uh oh!

CasperN commented Jun 15, 2021

Uh oh!

mikkelfj commented Jun 16, 2021

Uh oh!

CasperN commented Jun 17, 2021

Uh oh!

mikkelfj commented Jun 17, 2021

Uh oh!

CasperN commented Jun 17, 2021

Uh oh!

mikkelfj commented Jun 17, 2021

Uh oh!

mikkelfj commented Jun 17, 2021

Uh oh!

mikkelfj commented Jun 17, 2021

Uh oh!

mikkelfj commented Jun 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikkelfj commented Jun 17, 2021

Uh oh!

CasperN commented Jun 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikkelfj commented May 18, 2021 •

edited

Loading

mikkelfj commented Jun 10, 2021 •

edited

Loading

mikkelfj commented Jun 17, 2021 •

edited

Loading

CasperN commented Jun 17, 2021 •

edited

Loading