Reduce ONNX Runtime GPU wheel size using `fatbin` compression

### Describe the issue

Following [PR #26002](https://github.com/microsoft/onnxruntime/pull/26002), certain GPU architectures were removed due to the overall wheel size exceeding GitHub and PyPI limits.  

However, newer versions of `nvcc` (introduced since CUDA 12.8) support `fatbin` compression (`-Xfatbin=-compress-all -compress-mode=MODE`), which can significantly reduce binary size without affecting functionality.

Below are my test results comparing different compression modes (`size`, `balance`, `speed`) under both CUDA 12.8 and CUDA 13.0.  

`default` indicates the mode automatically applied by the corresponding `nvcc` version

Note: enabling `--compress-mode` requires a driver version at least as new as the one shipped with CUDA 12.4, which is why `PyTorch` only enables it for wheels built with CUDA 13.0 and later.

| CUDA Ver | Mode       | Wheel Size (MB) | Compile Time (min:s) | 
|-----------|-------------|-----------------|----------------------|
| 13.0      | Speed       | 923.9           | 187m51.531s          | 
| 13.0      | Balance (Default)     | 516.1           | 179m6.947s           | 
| 13.0      | Size        | 360.0           | 191m42.614s          | 
| 12.8      | Speed (Default)       | 689.5           | 159m1.095s           | 
| 12.8      | Balance     | 435.5           | 185m53.233s          | 
| 12.8      | Size        | 309.3           | 182m14.164s          |



### Build Script

First replace `set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xfatbin=-compress-all")` with `set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xfatbin=-compress-all -compress-mode=YOUR_MODE")`

```bash
docker run -it -v ~/onnxruntime:/root/onnxruntime --name ort continuumio/miniconda3

cd ~/onnxruntime
conda create -n onnx python=3.12 -y && conda activate onnx
pip install -r requirements.txt

conda install cuda-nvcc=13.0/12.8 cuda-toolkit cudnn -c nvidia -y
conda install gcc gxx cmake ninja -c conda-forge

ln -s /opt/conda/envs/onnx/targets/sbsa-linux/include/cccl/cuda /opt/conda/envs/onnx/targets/sbsa-linux/include/cuda
apt update && apt-get install -y patch

export CC=gcc CXX=g++

bash build.sh \
  --config Release \
  --build_shared_lib \
  --cmake_generator Ninja \
  --parallel 6 \
  --nvcc_threads 1 \
  --use_cuda \
  --cuda_version 13.0/12.8 \
  --cuda_home $CONDA_PREFIX \
  --cudnn_home $CONDA_PREFIX \
  --build_wheel \
  --skip_tests \
  --cmake_extra_defines onnxruntime_BUILD_UNIT_TESTS=OFF \
  --allow_running_as_root \
  --compile_no_warning_as_error
```

### Platform

Linux


### ONNX Runtime Installation

Built from Source

### ONNX Runtime API

Python

### Architecture

ARM64

### Execution Provider

CUDA


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce ONNX Runtime GPU wheel size using `fatbin` compression #26282

Describe the issue

Build Script

Platform

ONNX Runtime Installation

ONNX Runtime API

Architecture

Execution Provider

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUDA Ver	Mode	Wheel Size (MB)	Compile Time (min:s)
13.0	Speed	923.9	187m51.531s
13.0	Balance (Default)	516.1	179m6.947s
13.0	Size	360.0	191m42.614s
12.8	Speed (Default)	689.5	159m1.095s
12.8	Balance	435.5	185m53.233s
12.8	Size	309.3	182m14.164s

Reduce ONNX Runtime GPU wheel size using fatbin compression #26282

Description

Describe the issue

Build Script

Platform

ONNX Runtime Installation

ONNX Runtime API

Architecture

Execution Provider

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Reduce ONNX Runtime GPU wheel size using `fatbin` compression #26282