Open
Description
Describe the bug
A pip installed (as in the instructions) version of transformer engine does not find the CuDNN libs installed via pip (in site-packages/nvidia/*/lib/*.so
)
Steps/Code to reproduce bug
In an PyTorch env (so usual cuda libs installed):
# For PyTorch integration
pip install --no-build-isolation transformer_engine[pytorch]
python3 -c "import transformer_engine"
Expected behavior
A clear and concise description of what you expected to happen.
Environment overview (please complete the following information)
- Environment location: Lightning.AI Studio
- Method of Transformer Engine install: pip, see above.
Environment details
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
- OS version Ubuntu 24.04
~ pip list | grep 'torch|nvidia'
nvfuser_cu128_torch27 0.2.27.dev20250601
nvidia-cublas-cu12 12.8.3.14
nvidia-cuda-cupti-cu12 12.8.57
nvidia-cuda-nvrtc-cu12 12.8.61
nvidia-cuda-runtime-cu12 12.8.57
nvidia-cudnn-cu12 9.7.1.26
nvidia-cudnn-frontend 1.12.0
nvidia-cufft-cu12 11.3.3.41
nvidia-cufile-cu12 1.13.0.11
nvidia-curand-cu12 10.3.9.55
nvidia-cusolver-cu12 11.7.2.55
nvidia-cusparse-cu12 12.5.7.53
nvidia-cusparselt-cu12 0.6.3
nvidia-nccl-cu12 2.26.2
nvidia-nvjitlink-cu12 12.8.61
nvidia-nvtx-cu12 12.8.55
pytorch-lightning 2.5.1.post0
torch 2.7.0+cu128
torchmetrics 1.3.1
torchvision 0.22.0+cu128
transformer_engine_torch 2.3.0
Device details
- H100 from GCP
Additional context
The nvidia packages provided above do contain the required libs, TE should probably look for them there. A workaround is LD_LIBRARY_PATH, but it's tedious.