Add FP8 per-tensor scaling to cuBLASLt for fair comparison #272
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR addresses #199 where the cuBLASLt comparison in
test_fp8.pyuses tensorwise scaling (no scaling) while DeepGEMM uses blockwise scaling, making the performance comparison misleading.Changes
scale_aandscale_boptional parameters tocall_cublaslt_api()CUBLASLT_MATMUL_DESC_A_SCALE_POINTERandB_SCALE_POINTERwhen providedcublaslt_fp8_gemm_nt()function that accepts scale tensors__init__.pytest_fp8.pyto use per-tensor scaling (max of blockwise scales)How it works
The test now computes the maximum value from the blockwise scales and uses it as a single per-tensor scale for cuBLASLt:
This provides a fairer comparison where both implementations apply some form of FP8 dequantization.
Limitations
This is an approximation - cuBLASLt uses per-tensor scaling while DeepGEMM uses more precise blockwise (128x128) scaling. For a fully aligned comparison, cuBLASLt would need blockwise scaling support using
CUBLASLT_MATMUL_SCALE_MODE_BLOCK_2D(available in CUDA 12.0+). This could be a future enhancement.Test Plan
test_fp8.pyruns without errorsFixes #199
🤖 Generated with Claude Code