load_program() performance (with large include hierarchies)

When I include the main [glm](https://github.com/g-truc/glm) header within an RTC source, the runtime of `load_program()` increases to over 60 seconds (from what was already ~10 seconds), so I started investigating the problem.

I discovered that to discover include files, Jitify calls `nvrtcCompileProgram()`, catches the failure and parses the missing include from the error log, and then repeats the process until compilation succeeds or fails for a different reason.

GLM's include hierarchy adds ~109 separate glm header files, and we have ~20 internal library headers which are included too (although we recently dynamically pruned 20 down to ~14 for a small improvement). *(I calculated these values with [output](https://gist.github.com/Robadob/53701a5217dc9089800f5a37716fc69b) from [`pcpp`](https://github.com/ned14/pcpp) so they might be a little off as I haven't tested the flattened file it output)*

The problem is, that each call to `nvrtcCompileProgram()` causes it to reparse the include hierarchy, so the cost grows from an initial ~100ms, to ~600ms as each header is added. I logged it doing 198 failed calls to `nvrtcCompileProgram()`. This is highly inefficient, leading to the `load_program()` function taking 60+ seconds with the final successful nvrtc call taking around 1 second.

In comparison `pcpp`, was able to pre-process the full include hierarchy in 3 seconds. So it's fair to assume it could theoretically be compiled in 4 seconds with appropriate include pre-processing, 15x faster.

In our use-case, we perform multiple individual RTC compilations with the same include hierarchy, so we have this unnecessary cost. Worst case being our test suite, 85 minutes total with `glm` included in all agent functions, 25 minutes total with `glm` only included where used and 11 minutes with `glm` tests and include disabled. But even in that case, probably 10 minutes are spent doing RTC for the small number of RTC tests (we have most the RTC tests in our python test suite).

I'd be interested to know your thoughts on whether Jitify can address this, or even whether NVRTC could be extended by the team who develops it to actually load include files from disk (given implementing a full pre-processor within Jitify is impractical) and perhaps also mimic Jitify's missing include ignoring behaviour.

The optimal performance in our case ([FLAMEGPU2](https://github.com/FLAMEGPU/FLAMEGPU2)) would probably be achieved by pre-processing and flattening the header hierarchy that we pass to all RTC sources (to even reduce the ~3s of header parse/loading each time), but the high compile costs due to include processing might be scaring other users away from using RTC if they're finding it via Jitify and building things with larger include hierarchies (In our case, we've always had the ~20 library headers and thought the ~10s expense was just an unfortunate byproduct of RTC for our non-simple use-case).

Irrespective of NVRTC, Jitify might be able to improve the cost of multiple compiles, by caching loaded headers and always passing them to NVRTC regardless of if they're required? This would still leave the 1st compilation costly, but would make any subsequent compilations with the same headers much cheaper.

------------

As I quick test, I was able to flatten the majority of our includes using pcpp, and this reduces the number of false calls to `nvrtcCompileProgram()` from 198 down to 32. I believe these remaining 32, are due to our dynamic header, which has a delayed include which can't be flattened the same, and the various system includes which I haven't flattened yet (there are around 60, but I think they're mostly repeats). But already the RTC time was down to ~8 seconds.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

load_program() performance (with large include hierarchies) #90

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

load_program() performance (with large include hierarchies) #90

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions