Skip to content

Conversation

@edsko
Copy link
Collaborator

@edsko edsko commented Nov 15, 2025

Closes #1220.

@edsko
Copy link
Collaborator Author

edsko commented Nov 15, 2025

@TravisCardwell @dschrempf This PR works, but only for llvm-19.1.0 and up. The problem goes back to the detection of named vs anonymous declarations (#795, #856), something we've struggled quite a bit with. I'm not entirely sure how to proceed here; I can see two ways forward.

Option 1

First, we can skip the new test I add in this PR for llvm older than llvm-19.1. This also means that we'd then need to change qualPrelimDeclId so that for ordinary names, it ignores the UserProvided/ClangGenerated generated distinction, simply assuming that the name is user provided instead.

However, the way that we deal with macros in names is not actually great. The solution that we adopted previously is

-- If the expansion location and the spelling location /of the name/
-- are different, this means that the name is constructed using a
-- macro. This must therefore have been done by the user.

However, this is not quite accurate. If we have a macro that would construct the entire struct, for example, it may very well be that the struct anonymous.

Option 2

I'm wondering if there isn't an alternative solution. The heart of the problem is that later clang (not sure in which version this changed exactly) we cannot easily distinguish between these two declarations:

typedef struct     { .. } foo;
typedef struct foo { .. } foo;

because clang fills in that name foo for the struct. This is why we go through such lengths to figure out what the actual token in the file is, but that is awkward in the presence of macros, and even more awkward because we don't even get accurate location information about macro expansions in libclang < 19.1.

So here's my proposal: what if we don't distinguish either? In other words, we treat these two examples as exactly the same, just like clang does. For older clang, this would mean substituting foo for the empty string "" for the name of the struct, but that is easily done, I think. This does not mean we no longer have anonymous structs; for example, in

struct outer {
  struct {
    int inner_field;
  } outer_field;
}

that inner struct is still anonymous; clang does fill in a name here too, but we can easily recognize it (it's something like "struct (unnamed at ....)").

Proposal

So concretely, I'm proposing: an unnamed struct defined inside a typedef is not considered anonymous; instead, we consider its tag to be derived from the typedef name.

I think this can work, and it means we no longer have to jump through so many hoops (and then still fail) to make things work with libclang. It's a bit inconsistent, perhaps, but I feel like it's the better option given the constraints.

Thoughts?

@TravisCardwell
Copy link
Collaborator

TravisCardwell commented Nov 16, 2025

Consider the following example, where struct foo and the anonymous struct are different types.

struct foo { char a; char b; };
typedef struct { double x; double y; } foo;

Such terrible (yet valid) C code is an example of code that we cannot handle well by default: we generate conflicting data Foo types. Users may resolve the issue with a prescriptive binding specification that specifies different names. This is currently possible because we distinguish the following names.

  • struct foo
  • struct @foo
  • foo

With this proposal, we would no longer distinguish the first two; they would both be struct foo. I suspect that this may cause major issues internally. (Even the PrelimDeclIds would conflict, making it impossible to construct a consistent UseDeclGraph, DeclUseGraph, and DeclIndex like we currently do.) As for binding specifications, cname: struct foo would be ambiguous. How would we specify different names for each of these types?

@TravisCardwell
Copy link
Collaborator

I have been thinking about possibilities... While I have not yet thought of any particularly good ones, my best candidate so far is something that we previously decided to not do: global renaming. After parsing declarations, where anonymous declarations declared inside of a typedef are given the typedef name, we could rename declarations to resolve conflicts, based on the set of all declarations.

For example, we could rename by appending _# where # is a natural number. With the above C, there are two declarations named struct foo. After sorting, the first of these could keep the name struct foo, while the second must be renamed. We could select the first name in a list of candidates (["struct foo_" ++ show i | i <- [1..]]) that is not already used.

Note that using source locations would be a bad idea, because that would be extremely fragile.

This strategy is not very elegant, but users would only be exposed to it when dealing with particularly terrible C code. It would allow users to distinguish all declarations in binding specifications as well as selection predicates.

We could provide an hs-bindgen-cli info cnames command to (concisely) show which C names map to which source locations, to assist users in cases where the naming is confusing. (We might want to implement this anyway!)

Thoughts?

@TravisCardwell
Copy link
Collaborator

Note that test golden/typenames.h generates conflicting types with the same name. It is a minimal example, without using anonymous declarations, but I am mentioning it here because it relates to the global renaming proposal.

(I ran across this while reviewing changes for the hstypes work. The binding specification for this test has issues that are expected given the name collision.)

@dschrempf
Copy link
Collaborator

Is this only affecting structs and unions? Could we separate the situations

typedef struct foo {..} foo;
typedef struct     {..} foo;

using a hand-rolled parser on clang_tokenize? It seems like an easy task (but there may be many complications).

If not, globally checking for conflicting declarations may be the only way to go. I am a bit skeptical though, because, even though the bindings will be deterministic, there may be unexpected renamings. It also means that the names of the translated declarations of a module will depend on other modules in the closure which is a bit of a red flag.

@TravisCardwell
Copy link
Collaborator

Parsing tokens is indeed straightforward if there is no CPP. Macros significantly complicate things, and use of macros to generate type declarations is not uncommon practice. (Sometimes macros are used to implement naming conventions, with the body outside the macro. Sometimes macros are used to implement templating of similar types, with the whole declaration generated by the macro.)

This is what we are doing now. My understanding is that we are discovering that differences across different versions of LLVM/Clang (due to improvements as well as bugs) make this difficult/impossible to get correct across all supported versions.

Indeed, we avoided global renaming for good reasons. I just wonder if it is less complicated than the alternatives, especially given that users will only see it when translating particularly poorly written C.

It also means that the names of the translated declarations of a module will depend on other modules in the closure which is a bit of a red flag.

I am not sure I understand what you mean. Imports in a module are all qualified (aside from a small set of standard Prelude imports), so we only need to rename based on the set of names for the current main files being processed ([C.Decl]).

@dschrempf
Copy link
Collaborator

I am not sure I understand what you mean. Imports in a module are all qualified (aside from a small set of standard Prelude imports), so we only need to rename based on the set of names for the current main files being processed ([C.Decl]).

I think I miscomunicated, using module (which is related to Haskell mdoules) instead of "C file".

Is it true that global renaming means that other "C files" in the translation unit can influence the name of a declaration? That is, can the name of a declaration at hand change depending on other files we include, or do not include in the translation unit?

@TravisCardwell
Copy link
Collaborator

Yes: we would need to consider names for all of the declarations that we "parse," which (may) include declarations across transitive includes (depending on the parse predicate).

As a minimal example of when this matters, consider the following foo.h header.

struct foo { char a; char b; };

That declaration is translated to Haskell type Foo. Perhaps you later add #include "bar.h" to the top of that header, where bar.h is as follows.

typedef struct { double x; double y; } foo;

These types are valid C. There are no conflicts because struct foo is in the tagged namespace, while typedef ... foo is in the ordinary namespace. There is a name collision in Haskell, however. With my example strategy, we would name the typedef as Foo in Haskell and struct foo would get renamed to Foo_1. Therefore, the Haskell name for struct foo depends on the existence of that #include.

Note, however, that this case only arises when there is a name collision, caused by different C types where the tag of one type is the same as the ordinary name of another type. Currently, we generate Haskell types with name collisions, and it is up to users to provide a prescriptive binding specification in order to even generate valid Haskell. With this change, we would generate valid Haskell by default, and users may optionally use a prescriptive binding specification to configure better names.

@dschrempf
Copy link
Collaborator

Thank you, this clarifies things. This means we should fall back to global renaming only if we don't find another solution. On the other hand, it seems like we have tried hard, and didn't find one so far.

@edsko
Copy link
Collaborator Author

edsko commented Nov 18, 2025

Interestingly, for @TravisCardwell ' example of

struct foo { char a; char b; };
typedef struct { double x; double y; } foo;

clang reports

clang_getCursorSpelling: ("./test.h:1:8","foo")
clang_getCursorSpelling: ("./test.h:2:9","foo")
clang_getCursorSpelling: ("./test.h:2:40","foo")

So we indeed have two structs both called foo, according to clang. If that's good enough for clang, it should be good enough for us...?

For me numbering declarations is a no; we've worked really hard not do to that. For now I'll go with "Option 1" above as a stop-gap measure and I'll open a ticket for us to reconsider this.

@TravisCardwell
Copy link
Collaborator

So we indeed have two structs both called foo, according to clang. If that's good enough for clang, it should be good enough for us...?

IMHO no, but it should not matter much if such C code is rare. 😄

Perhaps we should document that hs-bindgen is unable to handle such code, but users can work around issues by changing the C source. A wrapper header could be used as a workaround (only?) if typedefs around the problematic types are made opaque.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Investigate CI issue with LLVM 15 and anonymous declarations of ordinary type

4 participants