Skip to content

Conversation

@yurekami
Copy link
Contributor

@yurekami yurekami commented Jan 1, 2026

Summary

This PR addresses #199 where the cuBLASLt comparison in test_fp8.py uses tensorwise scaling (no scaling) while DeepGEMM uses blockwise scaling, making the performance comparison misleading.

Changes

  • Add scale_a and scale_b optional parameters to call_cublaslt_api()
  • Set CUBLASLT_MATMUL_DESC_A_SCALE_POINTER and B_SCALE_POINTER when provided
  • Add new cublaslt_fp8_gemm_nt() function that accepts scale tensors
  • Export the new function via pybind11 and Python __init__.py
  • Update test_fp8.py to use per-tensor scaling (max of blockwise scales)

How it works

The test now computes the maximum value from the blockwise scales and uses it as a single per-tensor scale for cuBLASLt:

scale_a_max = a[1].max().reshape(1).to(torch.float32).contiguous()
scale_b_max = b[1].max().reshape(1).to(torch.float32).contiguous()
cublas_t, split_k_t = bench_kineto(
    lambda: deep_gemm.cublaslt_fp8_gemm_nt(a[0], b[0], d, c=c, scale_a=scale_a_max, scale_b=scale_b_max),
    ('nvjet', 'reduce'), suppress_kineto_output=True
)

This provides a fairer comparison where both implementations apply some form of FP8 dequantization.

Limitations

This is an approximation - cuBLASLt uses per-tensor scaling while DeepGEMM uses more precise blockwise (128x128) scaling. For a fully aligned comparison, cuBLASLt would need blockwise scaling support using CUBLASLT_MATMUL_SCALE_MODE_BLOCK_2D (available in CUDA 12.0+). This could be a future enhancement.

Test Plan

  • Verify test_fp8.py runs without errors
  • Verify cuBLASLt now uses scaling in the benchmark comparison
  • Compare performance numbers before and after

Fixes #199

🤖 Generated with Claude Code

This PR addresses the issue where the cuBLASLt comparison in test_fp8.py
uses tensorwise scaling (no scaling) while DeepGEMM uses blockwise scaling,
making the performance comparison misleading.

Changes:
- Add scale_a and scale_b optional parameters to call_cublaslt_api()
- Set CUBLASLT_MATMUL_DESC_A_SCALE_POINTER and B_SCALE_POINTER when provided
- Add new cublaslt_fp8_gemm_nt() function that accepts scale tensors
- Export the new function via pybind11 and Python __init__.py
- Update test_fp8.py to use per-tensor scaling (max of blockwise scales)

The test now computes the maximum value from the blockwise scales and uses
it as a single per-tensor scale for cuBLASLt, providing a fairer comparison
where both implementations apply some form of FP8 dequantization.

Note: This is an approximation - cuBLASLt uses per-tensor scaling while
DeepGEMM uses more precise blockwise (128x128) scaling. For a fully aligned
comparison, cuBLASLt would need blockwise scaling support (available in
CUDA 12.0+ with CUBLASLT_MATMUL_SCALE_MODE_BLOCK_2D).

Fixes deepseek-ai#199

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The comparison between cublaslt and deepgemm in test_fp8.py is unaligned

1 participant