Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the file a symbol is declared in to Reflection #6613

Merged
merged 23 commits into from
Jun 17, 2021

Conversation

CasperN
Copy link
Collaborator

@CasperN CasperN commented May 3, 2021

Add the file a symbol is declared in to Reflection data.

If we make a code-generator to depend on Reflection,
it may need to know the name of .fbs files to name the
generated files properly.

#6428

(This should not be submitted until after the release)

If we move a code-generator to depend on Reflection,
it may need to know which file something was declared in
to properly name generated files.
@github-actions github-actions bot added c++ codegen Involving generating code from schema labels May 3, 2021
@CasperN
Copy link
Collaborator Author

CasperN commented May 3, 2021

@aardappel

This is currently a bit broken since the .bfbs files contains absolute paths with idiosyncratic stuff from my laptop, which will break against the CI tests. I think the solution is to make paths relative to where flatc was invoked

@aardappel
Copy link
Collaborator

While I really appreciate small incremental PRs, it is also hard to say if this makes sense, since I can't tell how a codegen would use it. Do current codegens that need this information use paths relative to the current dir? or output dir? I thought they mostly use namespaces as directory names?

Also, rather than a string that is duplicated many times over, an index into a vector of unique filenames may be nicer?

@CasperN
Copy link
Collaborator Author

CasperN commented May 3, 2021

While I really appreciate small incremental PRs, it is also hard to say if this makes sense, since I can't tell how a codegen would use it. Do current codegens that need this information use paths relative to the current dir? or output dir? I thought they mostly use namespaces as directory names?

I was under the impression that current codegen generates a file for every .fbs file. In the future, I think flatc should output one bfbs which is consumed by the code generator to create whatever structure it wants -- perhaps one generated file per namespace.

flatc **.fbs --reflection-fbs > compilation.bfbs
${lang}_generator compilation.bfbs # ${lang}_generator decides on the output _generated files

Also, rather than a string that is duplicated many times over, an index into a vector of unique filenames may be nicer?

I thought about that, but decided using shared strings is more convenient with the same de-duplication.

@aardappel
Copy link
Collaborator

So then a bfbs based codegen has to go find all declarations with the same filename to replicate the behavior of current codegens? Would be simpler to emit 1 bfbs per fbs, but that will introduce a lot of duplication in the bfbs files.

It's somewhat tricky, because the previous model gave exact control over what gets re-generated. For example if you have a.fbs and b.fbs that both include c.fbs, then all 3 of these can be regenerated into a .h independently. But the bfbs must include the included definitions, and we may not want to regenerate the included definitions.

So we could have the situation where we write flatc a.fbs d.fbs --reflection-fbs > compilation.bfbs (where as before, c.fbs is included), but the codegen should then only emit for a and d, not c.

@CasperN
Copy link
Collaborator Author

CasperN commented May 7, 2021

So then a bfbs based codegen has to go find all declarations with the same filename to replicate the behavior of current codegens?

Yes. I think reading bfbs and reconstructing the symbols per file will be very easy.

Would be simpler to emit 1 bfbs per fbs, but that will introduce a lot of duplication in the bfbs files.

The 1-bfbs-per-file might be a better idea for highly parallel incremental compilation... I'm just going to assume having 1 big bfbs is sufficient for all current use cases.

It's somewhat tricky, because the previous model gave exact control over what gets re-generated. For example if you have a.fbs and b.fbs that both include c.fbs, then all 3 of these can be regenerated into a .h independently. But the bfbs must include the included definitions, and we may not want to regenerate the included definitions.

If we go with the model where compilation.bfbs is the only input to the code generator (which I slightly prefer, even if the code generator and parser are in the same binary), then we'd need to plumb through something like IdlOptions through the bfbs, and part of that data would include this option to not regenerate included definitions.

@aardappel
Copy link
Collaborator

and part of that data would include this option to not regenerate included definitions.

Well, that means you need to mark which definitions were in included, and in such a way that you can distinguish between "included" and "included but also appeared on the command-line to create this file"

@CasperN
Copy link
Collaborator Author

CasperN commented May 7, 2021

Well, that means you need to mark which definitions were in included, and in such a way that you can distinguish between "included" and "included but also appeared on the command-line to create this file"

I think the options/metadata can just include the list of files from the command line invocation.

@CasperN
Copy link
Collaborator Author

CasperN commented May 15, 2021

The current problem is that the source_filename argument comes from argv.
When I run make clean && make, monster_test.bfbs will have full paths in there.
When I run cd tests/; ./generate_code.sh, monster_test.bfbs will have relative paths.
This diff is breaking CI.

I could make cmake cd into tests to build the bfbs files, so it matches ./generate_code.sh.
That'll make tests pass but seems a little unsatisfactory -- since it pushes complexity into the flatc user.
I'm considering adding a --project_root flag to flatc and normalize paths inside flatc.
This also makes it easier to hide personal filesystem data if users pass around the reflection bfbs.

Thoughts @aardappel ?


Also, before this is submitted, I should add a --bfbs-sourcefiles flag to make this extra data optional.

@aardappel
Copy link
Collaborator

looping in @mikkelfj who was interested in the topic of bfbs for codegen, and @dbaileychess who may have an opinion.

The question is how to encode file paths.

Though if this is complicated, maybe that's a hint we shouldn't be adding file paths, but instead represent this information differently. We should check how current generators use paths, and see what would be enough information for that generator to run from a bfbs, but no more.

@CasperN
Copy link
Collaborator Author

CasperN commented May 17, 2021

Though if this is complicated, maybe that's a hint we shouldn't be adding file paths, but instead represent this information differently. We should check how current generators use paths, and see what would be enough information for that generator to run from a bfbs, but no more.

If we get signoff from all language maintainers (besides C++) that we can generate files based on namespaces, and not fbs filepaths, then I agree we should skip this whole problem.

However, I see some challenges w.r.t. incremental compilation: How will that work when generated files are disconnected from .fbs files? Maybe this will make blaze integration hard somehow?

Fwiw, protobuffers keeps filenames in there, "relative to the root of the source tree"
https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.compiler.plugin.pb
https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.descriptor.pb

@mikkelfj
Copy link
Contributor

FlatCC currently depends on case-insentitive basename of the fbs file (mapped to uppercase although lowercase would have been better). FlatCC will not process an included fbs file if the same basename has already been seen, regardless of path. Output is placed in the current directory or the specified -o path and there will be no conflicts because output files are prefixed by basename. Also, in generated C code, dependencies are protected with include guards based on the basename. This together ensures that the generated output is always the same for a given fbs file, regardless of how it is included.

If FlatCC were to generate code from a new version of bfbs files, it would need the basename of the filename, but the path would not be important.

Note that paths can be ambigous due to hardlinks or symbolic mappings.

There is a limit to this approach because there could be two separate fbs files with the same basename, but it is simpler to disallow such ambiguity and rename files if necessary.

Note that this is orthogonal to namespaces. In C each symbol is prefixed by the namespace name which is unrelated to the file basename.

@CasperN
Copy link
Collaborator Author

CasperN commented May 17, 2021

If FlatCC were to generate code from a new version of bfbs files, it would need the basename of the filename, but the path would not be important.

@mikkelfj, what if you generate a file based on the namespace? I guess that'd be a pretty big, backwards incompatible, change.

Backwards compatibility seems like a compelling enough reason to support filenames. I'm now leaning towards normalizing paths w.r.t a given --project_root location.

@aardappel
Copy link
Collaborator

yes, most generators seem to place files relative to the -o path (which a bfbs file would still get specified externally if used for codegen), and namespace inside of that.. it should not need paths.

@aardappel
Copy link
Collaborator

we do need a bool to indicate any types that are from files that were included but not present on the command-line.

@mikkelfj
Copy link
Contributor

I should add that this approach also is intended to (and do) work well for incremental builds. Notably it does not matter if output is generated by one instance of flatcc processing several included files, or several instances of flatcc for each fbs file, or several fbs files on the command line, and it does not matter from which directory flatcc is instantiated. It may result in output files being overridden by seperate instances of flatcc, but the output would be replaced by the same content then.

However, the relative paths are still important insided fbs include statements for the purpose of locating the included files.

By extension: if flatcc were to generate code from bfbs-v2 files, it should not matter if code is generated from one or several bfbs files, but one bfbs file can contain the data of multiple fbs files.

I don't think that bfbs files should include other bfbs files, but hold information for all included fbs files.
It could see a theoretical problem with very large sets of included files and potentially long incremental compilation times to build the bfbs file upon change. But the alternative seems to be one bfbs file per fbs file which isn't really helpful.

@mikkelfj
Copy link
Contributor

@mikkelfj, what if you generate a file based on the namespace? I guess that'd be a pretty big, backwards incompatible, change.

That would not be practical. The primary concern is how the tool is being used in build systems, and idempotent output.

If you used namespace based files, you would have a conflict if two seperate instances of flatcc process separate fbs files that refer to the same namespace. Also, the namespace could potentially be very large in terms of symbolic content. This is akin to C++ std namespace where you don't put all of the namespace in one file.

@mikkelfj
Copy link
Contributor

we do need a bool to indicate any types that are from files that were included but not present on the command-line.

I'm not sure if this is important, at least I don't think it matters to flatcc. However, visibility does matter: if root file A.fbs includes files B.fbs and C.fbs and B defines a symbol that might affect C, then that symbol should be invisible to C. Otherwise the output of C depends on whether it is included by A or not. An exception to this is attribute declarations where you can include a file that declares attributes and use it in other files. I tend to think that attribute declarations are not that useful in fbs files.

@mikkelfj
Copy link
Contributor

mikkelfj commented May 18, 2021

Backwards compatibility seems like a compelling enough reason to support filenames. I'm now leaning towards normalizing paths w.r.t a given --project_root location.

I don't have an issue with normalizing paths relative to some root as long as it is not permitted to have conflicting basenames (case insensitive due to certain file systems). The path could still prove useful for some language backends. FlatCC would just extract the basename from that path.

FlatCC might also learn to create subdirectories based on relative paths at least as an option, but I do not see any compelling reason for it as long as basenames are unique.

@mikkelfj
Copy link
Contributor

Also, on namespace based filenames: flatcc generates a large number of inline functions for each symbol - this is usually not a problem, but it starts to see some of the problems where C++ require precompiled headers. Therefore it is helpful to keep files relatively small such that end-user programs only need to include what they need.

@CasperN
Copy link
Collaborator Author

CasperN commented May 25, 2021

Ok, so given the above, I've renormalized filepaths relative to a project root, specified with the --project-root flag, which defaults to ./. Given --project-root foo, this converts foo/bar/baz.fbs to //bar/baz.fbs. I borrowed the // representing "project root" convention from protobuffers.

@github-actions github-actions bot added the grpc label May 25, 2021
@CasperN
Copy link
Collaborator Author

CasperN commented Jun 8, 2021

WRT visibility and idempotentce, we also need to know which files a file has included, or some other kind of visibility map.

I think visibility checks should be on the side of flatc or flatcc, not on the code generator. Code generators wouldn't need this information so it shouldn't be in the IR. To properly import things in header files, you can back out the include graph by examining the types. We can provide a helper library for that. Whether flatc needs to implement visibility/idempotence checks, that should be a separate discussion and PR.

@github-actions github-actions bot removed the documentation Documentation label Jun 8, 2021
@CasperN
Copy link
Collaborator Author

CasperN commented Jun 10, 2021

Bump @mikkelfj @aardappel

@mikkelfj
Copy link
Contributor

mikkelfj commented Jun 10, 2021

Not sure if you bumped for last comment on visibility or general review. For general review I have not checked the commits but if paths are added as discussed and if the basenames (case insenstive) do not conflict when stripping path prefix, then I'm OK with it. I am a bit concerned that flatcc may have to pull in a complex path library to normalize paths (I have one, I just don't want to publish it unless necessary - notably it gets complicated with Windows and drives). Instead flatcc might just record paths by their basename, at least initially.

Wrt. visibility:

A symbol can be visible in one context and not in another. The current reflection schema can only tell where the symbol is defined, not which other files the symbol may be available to. However, as @CasperN suggests, this can be checked at parser semantics level and indeed this is what flatcc does on its AST.

There are both type references and enum name references. Enums are bit more tricky as discussed below:

In flatcc the code generator has access to the root schema object which has a visiblity map:
https://github.com/dvidelabs/flatcc/blob/32aabef14ccc0e3ac15602c99476d25dcd230d19/src/compiler/symbols.h#L388-L394

Here is an example of where a check is made:
https://github.com/dvidelabs/flatcc/blob/32aabef14ccc0e3ac15602c99476d25dcd230d19/src/compiler/semantics.c#L312-L313

Assume and include structure such as A includes B and C. B and C includes nothing. If a a type is defined in C it is not visible in B and vice versa because include order is irrelevant. B and C also cannot see symbols defined in A. A can see symbols in both A and B because include statements are always first in a schema file. If there are name conflicts between B and C, then A should complain about this, but if the conflicting type is not referenced, the error is survivable.

If a type defined in C references a type defined in C or in A, that should raise an error at parse time before bfbs generation. This is how flatcc works. The code generators do not reference the visibility map for this problem.

However, there is an exception to this: the flatcc generated JSON parser checks if an enum definition is visible in the code generator. I don't remember exactly why, but I assume this is because enums can be can referenced at runtime in input JSON files and the code generator need to know which enums it should add to its lookup logic. This cannot be checked at the semantic stage.

@CasperN
Copy link
Collaborator Author

CasperN commented Jun 14, 2021

notably it gets complicated with Windows and drives).

Yea that's one risk that I'm a bit concerned about. I'm not at all familiar with windows but am hoping that its a reasonable assumption that all .fbs files can be normalized relative to some parent directory. Is it even possible to include things from different drives (without symlinks or something), I'd imagine our compiler would error...

assume and include structure such as A includes B and C...

Okay, I made an issue for this, #6697. However, aside from noting the visibility check should happen before the reflection/IR, its not relevant to this PR.

@CasperN
Copy link
Collaborator Author

CasperN commented Jun 15, 2021

@aardappel @mikkelfj ok it seems the last check failed due to CI rate limiting. Unless there are more comments, I'll submit this tomorrow.

@mikkelfj
Copy link
Contributor

Okay, I made an issue for this, #6697. However, aside from noting the visibility check should happen before the reflection/IR, its not relevant to this PR.

I agree about the check per the above, but any thoughts on dealing with enum visibility for parser generation?

@CasperN
Copy link
Collaborator Author

CasperN commented Jun 17, 2021

I think the json parser should know the type of the enum its trying to read out of the JSON, based on the surrounding table/struct field, and try to match accordingly.

@CasperN CasperN merged commit c58ae94 into google:master Jun 17, 2021
@mikkelfj
Copy link
Contributor

@CasperN That is the default, but you can assign other enum values to an enum field, or to any integer field. These enums are found based enums defined in the same namespace, or if prefixed, in other namespaces. In the default case, the enum is necessarily visible if the field is valid. In the other cases it depends on the include path.

@CasperN
Copy link
Collaborator Author

CasperN commented Jun 17, 2021

you can assign other enum values to an enum field, or to any integer field

Oh, that's annoying that we allow this... I think we should just not check visibility based on .fbs include path. I think the future is to invoke something like

 json2fbs myschema.bfbs foo.json > foo.bfbs

in which case, all the .fbs visibility stuff was covered by whatever generated myschema.bfbs. Once we're after that stage, imo, the visibility rules in foo.json should be C++ name resolution rules. I.e. if assigning to foo::bar::Table::field you should look up the enum in ::foo::bar then ::foo then ::

@mikkelfj
Copy link
Contributor

flatcc generates a fully compiled and optimized JSON parser, it cannot process a schema ad hoc.
Also, it can be very highly useful to assign enums to other integer fields.

@mikkelfj
Copy link
Contributor

That is, flatcc takes a schema and generates a schema specific parser in C which is compiled before the parser ever sees a JSON file. If the parser were to be generated by a bfbs file, it would need enum visibility information stored in the bfbs file for the above reasons.

@mikkelfj
Copy link
Contributor

From memory: The json parser parses any integer field by first trying an integer, then if that fails, parsing an unqualified enum name in the same namespace as the table or struct being parsed. If that fails, it attempts to parse a qualified enum name with the enum type as prefix, and if that fails, it attempts a namespace prefix also. I don't think it parses namespace dot unqualified enum name.

I do not recall what happens if two enum names conflict in this approach, or maybe enums must be qualified when the field is not an enum type. If the type is an enum type, it will of course prioritize the names of that enum.

In praxis this is done by maintaing an array of enum parsers for each integer field being parsed. Each fbs file produces a number of enum parsers that can be used be different fields. If an enum is defined in an included schema file, the parser will be defined in the correspodingly generated JSON parser file. An enum parser may call parsers from an included file. That is, a symbolic name parser can cover multiple enums at once by matching a qualified name and then calling the corresponding enum parser.

@mikkelfj
Copy link
Contributor

mikkelfj commented Jun 17, 2021

Also, I'm not sure exactly what happens about searching parent namespaces, but I agree that in principle names should be searched inside out.

EDIT: I think that flatcc requires a fully qualified name if the name is not local, but it can be with or without enum type prefix depending on context and still being considered a local name.

@mikkelfj
Copy link
Contributor

Anyway, scope resolution isn't really the issue here. That information is available in the bfbs file (assuming enums list what namespace they belong to). The issue is which enums are visible to a specific fbs file.

It could be solved by having a table of filenames at schema level where each file lists the files that it directly includes either by name or by index in the file table.

@CasperN
Copy link
Collaborator Author

CasperN commented Jun 17, 2021

Each fbs file produces a number of enum parsers that can be used be different fields.

I think you should either:

  • not generate parsers per .fbs file
  • create a reflection.bfbs for each .fbs file and generate enum parsers from those

@mikkelfj
Copy link
Contributor

First option goes against the entire compilation model of flatcc.
The second option could almost work because it implies where an enum is defined, but it fails because a) it would not work with flatc generated bfbs files unless this approach is adopted universally, and b) because knowledge of other bfbs files are required, and thus you need a list of included files - which could work - and which actually better matches the flatcc compilation model, but comes at some complexity in reading the bfbs files.

The thing is, parsing an fbs file and generating all included content into a single bfbs file sort of goes against the flatcc compilation model, but nor really: The thing is that the output generated for a single fbs file does not, or should not, depend on whether it comes from one huge bfbs file or from a smaller bfbs file. This is because of idempotence. It may happen that output is overwritten by processing multiple bfbs files, but they will be overwritten witht the same content.

A list of fbs files that a given bfbs file covers, and what each individual fbs file includes, will cover that information.

If you also added a numeric index holding the parsing order of each type in the bfbs file, you could in principle largely recreate the input fbs files, although I do not argue that you should.

@mikkelfj
Copy link
Contributor

Somewhat unrelated but to my last comment:

If you also added a numeric index holding the parsing order of each type in the bfbs file, you could in principle largely recreate the input fbs files, although I do not argue that you should.

Some users rely on fbs parsed into bfbs files to manage their own schemas for completely other purposes just because fbs is a convenient format and bfbs is easy to process. Adding the parsing order could be useful to those users, although this is of course a fringe use case. The file list might also be useful for the same reasons.

@CasperN CasperN deleted the declfile branch June 17, 2021 18:09
@CasperN
Copy link
Collaborator Author

CasperN commented Jun 17, 2021

Some users rely on fbs parsed into bfbs files to manage their own schemas for completely other purposes just because fbs is a convenient format and bfbs is easy to process. Adding the parsing order could be useful to those users, although this is of course a fringe use case. The file list might also be useful for the same reasons.

yea ok, fair enough and it sounds like you can't get your use case without the includes list.

I'll make a backwards-incompatible change and replace fbs_files: [string]; with fbs_files: [SchemaFile];
where

table SchemaFile {
  name: string (required, key);
  includes: [string];
}

@mikkelfj
Copy link
Contributor

yes, that is fine. The only thing is - is the includes path relative to the including file, or to the project?

@CasperN
Copy link
Collaborator Author

CasperN commented Jun 17, 2021

To the project

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ codegen Involving generating code from schema grpc rust
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants