Skip to content

load_program() performance (with large include hierarchies) #90

@Robadob

Description

@Robadob

When I include the main glm header within an RTC source, the runtime of load_program() increases to over 60 seconds (from what was already ~10 seconds), so I started investigating the problem.

I discovered that to discover include files, Jitify calls nvrtcCompileProgram(), catches the failure and parses the missing include from the error log, and then repeats the process until compilation succeeds or fails for a different reason.

GLM's include hierarchy adds ~109 separate glm header files, and we have ~20 internal library headers which are included too (although we recently dynamically pruned 20 down to ~14 for a small improvement). (I calculated these values with output from pcpp so they might be a little off as I haven't tested the flattened file it output)

The problem is, that each call to nvrtcCompileProgram() causes it to reparse the include hierarchy, so the cost grows from an initial ~100ms, to ~600ms as each header is added. I logged it doing 198 failed calls to nvrtcCompileProgram(). This is highly inefficient, leading to the load_program() function taking 60+ seconds with the final successful nvrtc call taking around 1 second.

In comparison pcpp, was able to pre-process the full include hierarchy in 3 seconds. So it's fair to assume it could theoretically be compiled in 4 seconds with appropriate include pre-processing, 15x faster.

In our use-case, we perform multiple individual RTC compilations with the same include hierarchy, so we have this unnecessary cost. Worst case being our test suite, 85 minutes total with glm included in all agent functions, 25 minutes total with glm only included where used and 11 minutes with glm tests and include disabled. But even in that case, probably 10 minutes are spent doing RTC for the small number of RTC tests (we have most the RTC tests in our python test suite).

I'd be interested to know your thoughts on whether Jitify can address this, or even whether NVRTC could be extended by the team who develops it to actually load include files from disk (given implementing a full pre-processor within Jitify is impractical) and perhaps also mimic Jitify's missing include ignoring behaviour.

The optimal performance in our case (FLAMEGPU2) would probably be achieved by pre-processing and flattening the header hierarchy that we pass to all RTC sources (to even reduce the ~3s of header parse/loading each time), but the high compile costs due to include processing might be scaring other users away from using RTC if they're finding it via Jitify and building things with larger include hierarchies (In our case, we've always had the ~20 library headers and thought the ~10s expense was just an unfortunate byproduct of RTC for our non-simple use-case).

Irrespective of NVRTC, Jitify might be able to improve the cost of multiple compiles, by caching loaded headers and always passing them to NVRTC regardless of if they're required? This would still leave the 1st compilation costly, but would make any subsequent compilations with the same headers much cheaper.


As I quick test, I was able to flatten the majority of our includes using pcpp, and this reduces the number of false calls to nvrtcCompileProgram() from 198 down to 32. I believe these remaining 32, are due to our dynamic header, which has a delayed include which can't be flattened the same, and the various system includes which I haven't flattened yet (there are around 60, but I think they're mostly repeats). But already the RTC time was down to ~8 seconds.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions