-
Notifications
You must be signed in to change notification settings - Fork 30
Description
Currently cugraph-gnn builds packages using the same approach as the rest of RAPIDS, reusing all of the shared infrastructure including Docker images and build matrices. RAPIDS builds are predicated on leveraging CUDA's minor version compatibility rules, which states that libraries built with any version of the toolkit within a given major release can be run on any system with the oldest driver from that release. There are some caveats, most importantly that occasionally the runtime will introduce new APIs that strictly require new driver support, but: 1) in such cases the runtime guarantees to fail gracefully and provide meaningful user errors to request the driver upgrade; 2) such APIs are introduced infrequently, with the last critical one for RAPIDS being the asynchronous memory allocations introduced in CUDA 11.5; and 3) RAPIDS libraries have historically taken great care to guard usage of such APIs with runtime checks. The end result is that RAPIDS libraries can all be built using the latest version of the runtime and then run on systems with the minimum driver required within the major CUDA family (aside: we could make the guarantee even stronger by statically linking the runtime, but right now we achieve the same outcome since our primary distribution mechanisms of pip and conda both support installing the runtime so we can guarantee getting a sufficiently new libcudart shared library at install time; if we wanted to support installing tarballs or something we would want to move to static linking to provide the same guarantees).
Unlike every other RAPIDS library wholegraph makes direct usage of the CUDA driver APIs. That means that the current build infrastructure is actually set up incorrectly for wholegraph because the compatibility guarantees above do not apply. If a library using the driver API is built against the latest version of the driver, there is no guarantee that it will not embed symbols that are not present on older versions of the driver. Even if only older public driver APIs are used, their implementation could have changed internally to leverage newer internal functions/structs/etc that are not present in older driver binaries that break the build. The driver library has never broken ABI, though, so the correct way to handle the necessary compatibility is for wholegraph binaries to instead be built on runners using the oldest driver, not the newest.
In practice, I think wholegraph is probably saved by the fact that we do run at least one test job using the oldest rather than newest driver, so if wholegraph libraries were in fact using newer symbols we would almost certainly see breakage. That being the case, I don't think it is crucial for us to fix this problem. We have two choices that I see:
- The more correct option would be to modify wholegraph's CI matrices to build against the oldest version of the driver. That would require either a change to shared-workflows, or maintaining the full matrix and build logic in wholegraph. I would recommend the former since then 1) we could reuse the matrix if needed elsewhere and 2) we could continue reusing the other infrastructure of shared-workflows by adding a single option to select the matrix.
- The other option would be to accept the status quo where we just rely on the test runs to find missing symbols. Aside from the fact that it is not a "pure" solution, the main practical problem with this approach is that it is imperfect. This approach only works if the wholegraph test suite has 100% coverage of all code paths that leverage driver APIs. If it does not, then depending on how our binaries are compiled CI may not fail if we never leverage the problematic symbols in tests, and we would not know that anything is wrong until an end-user calls the function in question.
I would advocate for option 1, but I do not think it is of the highest priority given that we already do option 2 and it gives us "pretty high" confidence that our binaries are OK.