Skip to content

Conversation

raayandhar
Copy link

In theory this should work to fix #4343. But I can't reproduce locally yet so not sure. Working on repro-ing locally first.

@raayandhar raayandhar marked this pull request as ready for review October 15, 2025 20:03
@raayandhar
Copy link
Author

raayandhar commented Oct 15, 2025

Managed to repro locally and do some digging into what's happening - this manages to build now locally, hopefully it will work with CI as well.

@zjgarvey
Copy link
Collaborator

So I'm not sure if it is sufficient to grep for the CXX abi version in the shared object file.

E.g., if I install the nightly build we have pinned at 8/20, the torch._C._PYBIND_BUILD_ABI returns _cxxabi1018, but the grep has only up to version 11:

CXXABI_1.3.2
CXXABI_1.3.3
CXXABI_1.3.5
CXXABI_1.3.7
CXXABI_1.3.8
CXXABI_1.3.9
CXXABI_1.3.11

We might need to figure out a different approach.

@zjgarvey
Copy link
Collaborator

However, I do find _cxxabi1018 inside the libtorch_python.so. I'm not sure if this will be present in the newer release. I'll double check now.

@raayandhar
Copy link
Author

However, I do find _cxxabi1018 inside the libtorch_python.so. I'm not sure if this will be present in the newer release. I'll double check now.

If I remember correctly, I had tried looking yesterday and did not find any _cxxabi10... in the newer release (torch-stable). But worth double checking.

@zjgarvey
Copy link
Collaborator

Yeah, honestly, I have no idea what the right fix is. I've tried modifying a few things, e.g., updating pybind to 3.0.1 to match pytorch and removing explicit CXX_ABI flags. No matter what, if it compiles, then it fails tests due to an opaque error about types (likely meaning something went wrong with the abi compatibility).

I'm really not sure if it is tenable to fix this in a short amount of time, and considering this is blocking all work on this repo, I'm going to work on pulling out the e2e testing from projects/pt1 and reworking all essential dev tools to not rely on jit_ir_importer.

@raayandhar
Copy link
Author

The recent error in CI, with these failures:

  Failed Tests (9):
    TORCH_MLIR_PYTHON :: annotations-sugar.py
    TORCH_MLIR_PYTHON :: compile_api/already_scripted.py
    TORCH_MLIR_PYTHON :: compile_api/already_traced.py
    TORCH_MLIR_PYTHON :: compile_api/backend_legal_ops.py
    TORCH_MLIR_PYTHON :: compile_api/basic.py
    TORCH_MLIR_PYTHON :: compile_api/make_fx.py
    TORCH_MLIR_PYTHON :: compile_api/multiple_methods.py
    TORCH_MLIR_PYTHON :: compile_api/output_type_spec.py
    TORCH_MLIR_PYTHON :: compile_api/tracing.py
  
  
  Testing Time: 6.20s
  
  Total Discovered Tests: 17
    Passed: 8 (47.06%)
    Failed: 9 (52.94%)

I can't reproduce - I wasn't able to reproduce on stable before, then tried updating the nightly commit to something newer, and I pass all of these tests as well (when they were previously failing, just the last CI had these same errors on nightly side)...

@raayandhar
Copy link
Author

raayandhar commented Oct 17, 2025

Really not understanding how the error in CI is being caused in the Python regression tests. It's obviously related to the compiled C++ bindings, but cannot reproduce locally on stable or nightly (with the nightly version I'm using here). Might be related to cache stuff, no idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Build failures for new torch stable 2.9.0

2 participants