-
Notifications
You must be signed in to change notification settings - Fork 390
Open
Labels
Description
Follow up of #1747, here is what we plan to do to make torchao ABI compatible and closer to python only, after this is done, torchao will be compatible with all pytorch versions and we don't need to worry about #2919
please feel free to pick up the tasks by adding your name to Status column
| Status | file | description | Plan |
|---|---|---|---|
| @howardzhang-cv | torchao/csrc/cuda/fp6_llm/fp6_linear.cu | FP6 linear layer | delete |
| torchao/csrc/cuda/marlin_qqq/marlin_qqq_kernel.cu | Marlin QQQ quantization | delete | |
| torchao/csrc/cuda/activation24/sparse_gemm.cu | 2:4 sparse GEMM | delete | |
| torchao/csrc/cuda/activation24/sparsify24.cu | 2:4 sparsification | delete | |
| torchao/csrc/cuda/sparse_marlin/marlin_kernel_nm.cu | N:M sparse Marlin | delete | |
| torchao/csrc/cuda/tensor_core_tiled_layout/tensor_core_tiled_layout.cu | Tensor core tiled layout | delete | |
| @danielvegamyhre | torchao/csrc/cuda/mx_kernels/mxfp8_cuda.cu | MXFP8 CUDA kernels | Make ABI compatible |
| torchao/csrc/cuda/mx_kernels/mx_fp_cutlass_kernels.cu | MX format CUTLASS MXFP4 | delete | |
| torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s4s4.cu | S4S4 row-wise scaled linear | delete | |
| torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s8s4.cu | S8S4 row-wise scaled linear | delete | |
| torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e4m3.cu | Sparse E4M3xE4M3 | Make ABI compatible, not build by default | |
| torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e5m2.cu | Sparse E4M3xE5M2 | Make ABI compatible, not build by default | |
| torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e4m3.cu | Sparse E5M2xE4M3 | Make ABI compatible, not build by default | |
| torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e5m2.cu | Sparse E5M2xE5M2 | Make ABI compatible, not build by default | |
| torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_f8f8.cu | Sparse FP8xFP8 | Make ABI compatible, not build by default | |
| torchao/csrc/cuda/to_sparse_semi_structured_cutlass_sm9x/to_sparse_semi_structured_cutlass_sm9x_f8.cu | Semi-structured sparse FP8 | Make ABI compatible, not build by default | |
| torchao/_models/sam2/csrc/connected_components.cu | Connected components (SAM2) | Move sam2 to somewhere else? |
| Status | file | description | Plan |
|---|---|---|---|
| @howardzhang-cv | torchao/csrc/cuda/fp6_llm/utils_core.cuh | FP6 core utilities | delete |
| @howardzhang-cv | torchao/csrc/cuda/fp6_llm/kernel_reduction.cuh | FP6 reduction kernel | delete |
| @howardzhang-cv | torchao/csrc/cuda/fp6_llm/ptx_mma.cuh | FP6 PTX MMA | delete |
| @howardzhang-cv | torchao/csrc/cuda/fp6_llm/kernel_matmul.cuh | FP6 matmul kernel | delete |
| @howardzhang-cv | torchao/csrc/cuda/fp6_llm/utils_gmem.cuh | FP6 global memory utils | delete |
| @howardzhang-cv | torchao/csrc/cuda/fp6_llm/utils_parallel_dequant.cuh | FP6 parallel dequant utils | delete |
| @howardzhang-cv | torchao/csrc/cuda/fp6_llm/ptx_cp.async.cuh | FP6 PTX async copy | delete |
| torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass.cuh | Sparse CUTLASS header | Make ABI compatible | |
| torchao/csrc/cuda/to_sparse_semi_structured_cutlass_sm9x/to_sparse_semi_structured_cutlass_sm9x.cuh | Semi-structured sparse header | Make ABI compatible | |
| torchao/csrc/cuda/mx_kernels/mxfp8_quantize.cuh | MXFP8 quantize header | Make ABI compatible (has no torch C++ anyway) | |
| torchao/csrc/cuda/mx_kernels/ptx.cuh | MX PTX header | Make ABI compatible | |
| torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass.cuh | CUTLASS header | delete |
After the above is done, we can explore making torchao python only through:
- compile on demand: https://github.com/NVIDIA/Megatron-LM/blob/v2.0/megatron/fused_kernels/__init__.py
- move kernels to pytorch core, mslk
- for cutlass kernels, we can use cute DSL as mentioned by @tonyf