Add FP8 per-tensor scaling to cuBLASLt for fair comparison#272
Closed
yurekami wants to merge 1 commit intodeepseek-ai:mainfrom
Closed
Add FP8 per-tensor scaling to cuBLASLt for fair comparison#272yurekami wants to merge 1 commit intodeepseek-ai:mainfrom
yurekami wants to merge 1 commit intodeepseek-ai:mainfrom
Conversation
This PR addresses the issue where the cuBLASLt comparison in test_fp8.py uses tensorwise scaling (no scaling) while DeepGEMM uses blockwise scaling, making the performance comparison misleading. Changes: - Add scale_a and scale_b optional parameters to call_cublaslt_api() - Set CUBLASLT_MATMUL_DESC_A_SCALE_POINTER and B_SCALE_POINTER when provided - Add new cublaslt_fp8_gemm_nt() function that accepts scale tensors - Export the new function via pybind11 and Python __init__.py - Update test_fp8.py to use per-tensor scaling (max of blockwise scales) The test now computes the maximum value from the blockwise scales and uses it as a single per-tensor scale for cuBLASLt, providing a fairer comparison where both implementations apply some form of FP8 dequantization. Note: This is an approximation - cuBLASLt uses per-tensor scaling while DeepGEMM uses more precise blockwise (128x128) scaling. For a fully aligned comparison, cuBLASLt would need blockwise scaling support (available in CUDA 12.0+ with CUBLASLT_MATMUL_SCALE_MODE_BLOCK_2D). Fixes deepseek-ai#199 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR addresses #199 where the cuBLASLt comparison in
test_fp8.pyuses tensorwise scaling (no scaling) while DeepGEMM uses blockwise scaling, making the performance comparison misleading.Changes
scale_aandscale_boptional parameters tocall_cublaslt_api()CUBLASLT_MATMUL_DESC_A_SCALE_POINTERandB_SCALE_POINTERwhen providedcublaslt_fp8_gemm_nt()function that accepts scale tensors__init__.pytest_fp8.pyto use per-tensor scaling (max of blockwise scales)How it works
The test now computes the maximum value from the blockwise scales and uses it as a single per-tensor scale for cuBLASLt:
This provides a fairer comparison where both implementations apply some form of FP8 dequantization.
Limitations
This is an approximation - cuBLASLt uses per-tensor scaling while DeepGEMM uses more precise blockwise (128x128) scaling. For a fully aligned comparison, cuBLASLt would need blockwise scaling support using
CUBLASLT_MATMUL_SCALE_MODE_BLOCK_2D(available in CUDA 12.0+). This could be a future enhancement.Test Plan
test_fp8.pyruns without errorsFixes #199
🤖 Generated with Claude Code