Skip to content

Reduce ONNX Runtime GPU wheel size using fatbin compression #26282

@XXXXRT666

Description

@XXXXRT666

Describe the issue

Following PR #26002, certain GPU architectures were removed due to the overall wheel size exceeding GitHub and PyPI limits.

However, newer versions of nvcc (introduced since CUDA 12.8) support fatbin compression (-Xfatbin=-compress-all -compress-mode=MODE), which can significantly reduce binary size without affecting functionality.

Below are my test results comparing different compression modes (size, balance, speed) under both CUDA 12.8 and CUDA 13.0.

default indicates the mode automatically applied by the corresponding nvcc version

Note: enabling --compress-mode requires a driver version at least as new as the one shipped with CUDA 12.4, which is why PyTorch only enables it for wheels built with CUDA 13.0 and later.

CUDA Ver Mode Wheel Size (MB) Compile Time (min:s)
13.0 Speed 923.9 187m51.531s
13.0 Balance (Default) 516.1 179m6.947s
13.0 Size 360.0 191m42.614s
12.8 Speed (Default) 689.5 159m1.095s
12.8 Balance 435.5 185m53.233s
12.8 Size 309.3 182m14.164s

Build Script

First replace set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xfatbin=-compress-all") with set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xfatbin=-compress-all -compress-mode=YOUR_MODE")

docker run -it -v ~/onnxruntime:/root/onnxruntime --name ort continuumio/miniconda3

cd ~/onnxruntime
conda create -n onnx python=3.12 -y && conda activate onnx
pip install -r requirements.txt

conda install cuda-nvcc=13.0/12.8 cuda-toolkit cudnn -c nvidia -y
conda install gcc gxx cmake ninja -c conda-forge

ln -s /opt/conda/envs/onnx/targets/sbsa-linux/include/cccl/cuda /opt/conda/envs/onnx/targets/sbsa-linux/include/cuda
apt update && apt-get install -y patch

export CC=gcc CXX=g++

bash build.sh \
  --config Release \
  --build_shared_lib \
  --cmake_generator Ninja \
  --parallel 6 \
  --nvcc_threads 1 \
  --use_cuda \
  --cuda_version 13.0/12.8 \
  --cuda_home $CONDA_PREFIX \
  --cudnn_home $CONDA_PREFIX \
  --build_wheel \
  --skip_tests \
  --cmake_extra_defines onnxruntime_BUILD_UNIT_TESTS=OFF \
  --allow_running_as_root \
  --compile_no_warning_as_error

Platform

Linux

ONNX Runtime Installation

Built from Source

ONNX Runtime API

Python

Architecture

ARM64

Execution Provider

CUDA

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions