Add FP8 per-tensor scaling to cuBLASLt for fair comparison by yurekami · Pull Request #272 · deepseek-ai/DeepGEMM

yurekami · 2026-01-01T06:38:44Z

Summary

This PR addresses #199 where the cuBLASLt comparison in test_fp8.py uses tensorwise scaling (no scaling) while DeepGEMM uses blockwise scaling, making the performance comparison misleading.

Changes

Add scale_a and scale_b optional parameters to call_cublaslt_api()
Set CUBLASLT_MATMUL_DESC_A_SCALE_POINTER and B_SCALE_POINTER when provided
Add new cublaslt_fp8_gemm_nt() function that accepts scale tensors
Export the new function via pybind11 and Python __init__.py
Update test_fp8.py to use per-tensor scaling (max of blockwise scales)

How it works

The test now computes the maximum value from the blockwise scales and uses it as a single per-tensor scale for cuBLASLt:

scale_a_max = a[1].max().reshape(1).to(torch.float32).contiguous()
scale_b_max = b[1].max().reshape(1).to(torch.float32).contiguous()
cublas_t, split_k_t = bench_kineto(
    lambda: deep_gemm.cublaslt_fp8_gemm_nt(a[0], b[0], d, c=c, scale_a=scale_a_max, scale_b=scale_b_max),
    ('nvjet', 'reduce'), suppress_kineto_output=True
)

This provides a fairer comparison where both implementations apply some form of FP8 dequantization.

Limitations

This is an approximation - cuBLASLt uses per-tensor scaling while DeepGEMM uses more precise blockwise (128x128) scaling. For a fully aligned comparison, cuBLASLt would need blockwise scaling support using CUBLASLT_MATMUL_SCALE_MODE_BLOCK_2D (available in CUDA 12.0+). This could be a future enhancement.

Test Plan

Verify test_fp8.py runs without errors
Verify cuBLASLt now uses scaling in the benchmark comparison
Compare performance numbers before and after

Fixes #199

🤖 Generated with Claude Code

This PR addresses the issue where the cuBLASLt comparison in test_fp8.py uses tensorwise scaling (no scaling) while DeepGEMM uses blockwise scaling, making the performance comparison misleading. Changes: - Add scale_a and scale_b optional parameters to call_cublaslt_api() - Set CUBLASLT_MATMUL_DESC_A_SCALE_POINTER and B_SCALE_POINTER when provided - Add new cublaslt_fp8_gemm_nt() function that accepts scale tensors - Export the new function via pybind11 and Python __init__.py - Update test_fp8.py to use per-tensor scaling (max of blockwise scales) The test now computes the maximum value from the blockwise scales and uses it as a single per-tensor scale for cuBLASLt, providing a fairer comparison where both implementations apply some form of FP8 dequantization. Note: This is an approximation - cuBLASLt uses per-tensor scaling while DeepGEMM uses more precise blockwise (128x128) scaling. For a fully aligned comparison, cuBLASLt would need blockwise scaling support (available in CUDA 12.0+ with CUBLASLT_MATMUL_SCALE_MODE_BLOCK_2D). Fixes deepseek-ai#199 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

zheanxu closed this Jan 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FP8 per-tensor scaling to cuBLASLt for fair comparison#272

Add FP8 per-tensor scaling to cuBLASLt for fair comparison#272
yurekami wants to merge 1 commit intodeepseek-ai:mainfrom
yurekami:fix/cublaslt-fp8-blockwise-scaling

yurekami commented Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yurekami commented Jan 1, 2026

Summary

Changes

How it works

Limitations

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants