-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the file a symbol is declared in to Reflection #6613
Conversation
If we move a code-generator to depend on Reflection, it may need to know which file something was declared in to properly name generated files.
This is currently a bit broken since the .bfbs files contains absolute paths with idiosyncratic stuff from my laptop, which will break against the CI tests. I think the solution is to make paths relative to where flatc was invoked |
While I really appreciate small incremental PRs, it is also hard to say if this makes sense, since I can't tell how a codegen would use it. Do current codegens that need this information use paths relative to the current dir? or output dir? I thought they mostly use namespaces as directory names? Also, rather than a string that is duplicated many times over, an index into a vector of unique filenames may be nicer? |
I was under the impression that current codegen generates a file for every .fbs file. In the future, I think flatc should output one bfbs which is consumed by the code generator to create whatever structure it wants -- perhaps one generated file per namespace. flatc **.fbs --reflection-fbs > compilation.bfbs
${lang}_generator compilation.bfbs # ${lang}_generator decides on the output _generated files
I thought about that, but decided using shared strings is more convenient with the same de-duplication. |
So then a It's somewhat tricky, because the previous model gave exact control over what gets re-generated. For example if you have a.fbs and b.fbs that both include c.fbs, then all 3 of these can be regenerated into a So we could have the situation where we write |
Yes. I think reading bfbs and reconstructing the symbols per file will be very easy.
The 1-bfbs-per-file might be a better idea for highly parallel incremental compilation... I'm just going to assume having 1 big bfbs is sufficient for all current use cases.
If we go with the model where |
Well, that means you need to mark which definitions were in included, and in such a way that you can distinguish between "included" and "included but also appeared on the command-line to create this file" |
I think the options/metadata can just include the list of files from the command line invocation. |
The current problem is that the I could make cmake cd into tests to build the bfbs files, so it matches Thoughts @aardappel ? Also, before this is submitted, I should add a |
looping in @mikkelfj who was interested in the topic of bfbs for codegen, and @dbaileychess who may have an opinion. The question is how to encode file paths. Though if this is complicated, maybe that's a hint we shouldn't be adding file paths, but instead represent this information differently. We should check how current generators use paths, and see what would be enough information for that generator to run from a bfbs, but no more. |
If we get signoff from all language maintainers (besides C++) that we can generate files based on namespaces, and not fbs filepaths, then I agree we should skip this whole problem. However, I see some challenges w.r.t. incremental compilation: How will that work when generated files are disconnected from .fbs files? Maybe this will make blaze integration hard somehow? Fwiw, protobuffers keeps filenames in there, "relative to the root of the source tree" |
FlatCC currently depends on case-insentitive basename of the fbs file (mapped to uppercase although lowercase would have been better). FlatCC will not process an included fbs file if the same basename has already been seen, regardless of path. Output is placed in the current directory or the specified -o path and there will be no conflicts because output files are prefixed by basename. Also, in generated C code, dependencies are protected with include guards based on the basename. This together ensures that the generated output is always the same for a given fbs file, regardless of how it is included. If FlatCC were to generate code from a new version of bfbs files, it would need the basename of the filename, but the path would not be important. Note that paths can be ambigous due to hardlinks or symbolic mappings. There is a limit to this approach because there could be two separate fbs files with the same basename, but it is simpler to disallow such ambiguity and rename files if necessary. Note that this is orthogonal to namespaces. In C each symbol is prefixed by the namespace name which is unrelated to the file basename. |
@mikkelfj, what if you generate a file based on the namespace? I guess that'd be a pretty big, backwards incompatible, change. Backwards compatibility seems like a compelling enough reason to support filenames. I'm now leaning towards normalizing paths w.r.t a given |
yes, most generators seem to place files relative to the |
we do need a |
I should add that this approach also is intended to (and do) work well for incremental builds. Notably it does not matter if output is generated by one instance of flatcc processing several included files, or several instances of flatcc for each fbs file, or several fbs files on the command line, and it does not matter from which directory flatcc is instantiated. It may result in output files being overridden by seperate instances of flatcc, but the output would be replaced by the same content then. However, the relative paths are still important insided fbs include statements for the purpose of locating the included files. By extension: if flatcc were to generate code from bfbs-v2 files, it should not matter if code is generated from one or several bfbs files, but one bfbs file can contain the data of multiple fbs files. I don't think that bfbs files should include other bfbs files, but hold information for all included fbs files. |
That would not be practical. The primary concern is how the tool is being used in build systems, and idempotent output. If you used namespace based files, you would have a conflict if two seperate instances of flatcc process separate fbs files that refer to the same namespace. Also, the namespace could potentially be very large in terms of symbolic content. This is akin to C++ std namespace where you don't put all of the namespace in one file. |
I'm not sure if this is important, at least I don't think it matters to flatcc. However, visibility does matter: if root file A.fbs includes files B.fbs and C.fbs and B defines a symbol that might affect C, then that symbol should be invisible to C. Otherwise the output of C depends on whether it is included by A or not. An exception to this is attribute declarations where you can include a file that declares attributes and use it in other files. I tend to think that attribute declarations are not that useful in fbs files. |
I don't have an issue with normalizing paths relative to some root as long as it is not permitted to have conflicting basenames (case insensitive due to certain file systems). The path could still prove useful for some language backends. FlatCC would just extract the basename from that path. FlatCC might also learn to create subdirectories based on relative paths at least as an option, but I do not see any compelling reason for it as long as basenames are unique. |
Also, on namespace based filenames: flatcc generates a large number of inline functions for each symbol - this is usually not a problem, but it starts to see some of the problems where C++ require precompiled headers. Therefore it is helpful to keep files relatively small such that end-user programs only need to include what they need. |
Ok, so given the above, I've renormalized filepaths relative to a project root, specified with the |
I think visibility checks should be on the side of |
Bump @mikkelfj @aardappel |
Not sure if you bumped for last comment on visibility or general review. For general review I have not checked the commits but if paths are added as discussed and if the basenames (case insenstive) do not conflict when stripping path prefix, then I'm OK with it. I am a bit concerned that flatcc may have to pull in a complex path library to normalize paths (I have one, I just don't want to publish it unless necessary - notably it gets complicated with Windows and drives). Instead flatcc might just record paths by their basename, at least initially. Wrt. visibility: A symbol can be visible in one context and not in another. The current reflection schema can only tell where the symbol is defined, not which other files the symbol may be available to. However, as @CasperN suggests, this can be checked at parser semantics level and indeed this is what flatcc does on its AST. There are both type references and enum name references. Enums are bit more tricky as discussed below: In flatcc the code generator has access to the root schema object which has a visiblity map: Here is an example of where a check is made: Assume and include structure such as A includes B and C. B and C includes nothing. If a a type is defined in C it is not visible in B and vice versa because include order is irrelevant. B and C also cannot see symbols defined in A. A can see symbols in both A and B because include statements are always first in a schema file. If there are name conflicts between B and C, then A should complain about this, but if the conflicting type is not referenced, the error is survivable. If a type defined in C references a type defined in C or in A, that should raise an error at parse time before bfbs generation. This is how flatcc works. The code generators do not reference the visibility map for this problem. However, there is an exception to this: the flatcc generated JSON parser checks if an enum definition is visible in the code generator. I don't remember exactly why, but I assume this is because enums can be can referenced at runtime in input JSON files and the code generator need to know which enums it should add to its lookup logic. This cannot be checked at the semantic stage. |
Yea that's one risk that I'm a bit concerned about. I'm not at all familiar with windows but am hoping that its a reasonable assumption that all
Okay, I made an issue for this, #6697. However, aside from noting the visibility check should happen before the reflection/IR, its not relevant to this PR. |
@aardappel @mikkelfj ok it seems the last check failed due to CI rate limiting. Unless there are more comments, I'll submit this tomorrow. |
I agree about the check per the above, but any thoughts on dealing with enum visibility for parser generation? |
I think the json parser should know the type of the enum its trying to read out of the JSON, based on the surrounding table/struct field, and try to match accordingly. |
@CasperN That is the default, but you can assign other enum values to an enum field, or to any integer field. These enums are found based enums defined in the same namespace, or if prefixed, in other namespaces. In the default case, the enum is necessarily visible if the field is valid. In the other cases it depends on the include path. |
Oh, that's annoying that we allow this... I think we should just not check visibility based on json2fbs myschema.bfbs foo.json > foo.bfbs in which case, all the |
flatcc generates a fully compiled and optimized JSON parser, it cannot process a schema ad hoc. |
That is, flatcc takes a schema and generates a schema specific parser in C which is compiled before the parser ever sees a JSON file. If the parser were to be generated by a bfbs file, it would need enum visibility information stored in the bfbs file for the above reasons. |
From memory: The json parser parses any integer field by first trying an integer, then if that fails, parsing an unqualified enum name in the same namespace as the table or struct being parsed. If that fails, it attempts to parse a qualified enum name with the enum type as prefix, and if that fails, it attempts a namespace prefix also. I don't think it parses namespace dot unqualified enum name. I do not recall what happens if two enum names conflict in this approach, or maybe enums must be qualified when the field is not an enum type. If the type is an enum type, it will of course prioritize the names of that enum. In praxis this is done by maintaing an array of enum parsers for each integer field being parsed. Each fbs file produces a number of enum parsers that can be used be different fields. If an enum is defined in an included schema file, the parser will be defined in the correspodingly generated JSON parser file. An enum parser may call parsers from an included file. That is, a symbolic name parser can cover multiple enums at once by matching a qualified name and then calling the corresponding enum parser. |
Also, I'm not sure exactly what happens about searching parent namespaces, but I agree that in principle names should be searched inside out. EDIT: I think that flatcc requires a fully qualified name if the name is not local, but it can be with or without enum type prefix depending on context and still being considered a local name. |
Anyway, scope resolution isn't really the issue here. That information is available in the bfbs file (assuming enums list what namespace they belong to). The issue is which enums are visible to a specific fbs file. It could be solved by having a table of filenames at schema level where each file lists the files that it directly includes either by name or by index in the file table. |
I think you should either:
|
First option goes against the entire compilation model of flatcc. The thing is, parsing an fbs file and generating all included content into a single bfbs file sort of goes against the flatcc compilation model, but nor really: The thing is that the output generated for a single fbs file does not, or should not, depend on whether it comes from one huge bfbs file or from a smaller bfbs file. This is because of idempotence. It may happen that output is overwritten by processing multiple bfbs files, but they will be overwritten witht the same content. A list of fbs files that a given bfbs file covers, and what each individual fbs file includes, will cover that information. If you also added a numeric index holding the parsing order of each type in the bfbs file, you could in principle largely recreate the input fbs files, although I do not argue that you should. |
Somewhat unrelated but to my last comment:
Some users rely on fbs parsed into bfbs files to manage their own schemas for completely other purposes just because fbs is a convenient format and bfbs is easy to process. Adding the parsing order could be useful to those users, although this is of course a fringe use case. The file list might also be useful for the same reasons. |
yea ok, fair enough and it sounds like you can't get your use case without the includes list. I'll make a backwards-incompatible change and replace
|
yes, that is fine. The only thing is - is the includes path relative to the including file, or to the project? |
To the project |
Add the file a symbol is declared in to Reflection data.
If we make a code-generator to depend on Reflection,
it may need to know the name of .fbs files to name the
generated files properly.
#6428
(This should not be submitted until after the release)