[Not for land] remove workaround for slow rowwise cutlass gemm #2185

danielvegamyhre · 2025-05-08T01:21:01Z

Microbenchmarking float8 linear

fp8 speedup is nearly identical with cutlass rowwise GEMM vs tensorwise cublas GEMM with rescale

tensorwise w/ rescale
        name                shape                                       scaling_repr  compiled  use_fast_accum  ref_time_sec  pt_fp8_time_sec  pt_fp8_speedup
0  attn.wqkv  (16384, 8192, 1280)  in_features=8192, out_features=1280, bias=Fals...      True            True      0.001954         0.001757        1.111733
1    attn.w0  (16384, 1024, 8192)  in_features=1024, out_features=8192, bias=Fals...      True            True      0.001701         0.001193        1.426141
2    ffn.w13  (16384, 8192, 7168)  in_features=8192, out_features=7168, bias=Fals...      True            True      0.009871         0.005707        1.729502
3     ffn.w2  (16384, 3584, 8192)  in_features=3584, out_features=8192, bias=Fals...      True            True      0.005192         0.003034        1.711090
4  attn.wqkv  (16384, 8192, 1280)  in_features=8192, out_features=1280, bias=Fals...      True           False      0.001973         0.001773        1.112681
5    attn.w0  (16384, 1024, 8192)  in_features=1024, out_features=8192, bias=Fals...      True           False      0.001688         0.001196        1.410966
6    ffn.w13  (16384, 8192, 7168)  in_features=8192, out_features=7168, bias=Fals...      True           False      0.009926         0.005763        1.722493
7     ffn.w2  (16384, 3584, 8192)  in_features=3584, out_features=8192, bias=Fals...      True           False      0.005197         0.003040        1.709187
rowwise
        name                shape                                       scaling_repr  compiled  use_fast_accum  ref_time_sec  pt_fp8_time_sec  pt_fp8_speedup
0  attn.wqkv  (16384, 8192, 1280)  in_features=8192, out_features=1280, bias=Fals...      True            True      0.001953         0.001763        1.107712
1    attn.w0  (16384, 1024, 8192)  in_features=1024, out_features=8192, bias=Fals...      True            True      0.001697         0.001198        1.416346
2    ffn.w13  (16384, 8192, 7168)  in_features=8192, out_features=7168, bias=Fals...      True            True      0.009902         0.005706        1.735248
3     ffn.w2  (16384, 3584, 8192)  in_features=3584, out_features=8192, bias=Fals...      True            True      0.005167         0.003045        1.697165
4  attn.wqkv  (16384, 8192, 1280)  in_features=8192, out_features=1280, bias=Fals...      True           False      0.001993         0.001781        1.118983
5    attn.w0  (16384, 1024, 8192)  in_features=1024, out_features=8192, bias=Fals...      True           False      0.001712         0.001206        1.420238
6    ffn.w13  (16384, 8192, 7168)  in_features=8192, out_features=7168, bias=Fals...      True           False      0.009927         0.005783        1.716519
7     ffn.w2  (16384, 3584, 8192)  in_features=3584, out_features=8192, bias=Fals...      True           False      0.005211         0.003052        1.707381

E2E training benchmarks with torchtitan

tensorwise GEMM + rescaling:

Median Tokens/Second (excluding step 1): 7064.0
Max Memory Usage: 38.16 GiB

rowwise GEMM:

Median Tokens/Second (excluding step 1): 7056.0
Max Memory Usage: 38.16 GiB

pytorch-bot · 2025-05-08T01:21:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2185

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit e1d087d with merge base cdced21 ():

NEW FAILURES - The following jobs have failed:

PR Label Check / Check PR Labels (gh)
Process completed with exit code 1.
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh)
test/integration/test_integration.py::TestUtils::test_get_model_size_autoquant_5_cuda

This comment was automatically generated by Dr. CI and updates every 15 minutes.

remove workaround for slow rowwise gemm

e1d087d

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 8, 2025

danielvegamyhre marked this pull request as draft May 8, 2025 01:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Not for land] remove workaround for slow rowwise cutlass gemm #2185

[Not for land] remove workaround for slow rowwise cutlass gemm #2185

danielvegamyhre commented May 8, 2025 •

edited

Loading

pytorch-bot bot commented May 8, 2025 •

edited

Loading

[Not for land] remove workaround for slow rowwise cutlass gemm #2185

Are you sure you want to change the base?

[Not for land] remove workaround for slow rowwise cutlass gemm #2185

Conversation

danielvegamyhre commented May 8, 2025 • edited Loading

Microbenchmarking float8 linear

E2E training benchmarks with torchtitan

pytorch-bot bot commented May 8, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2185

❌ 2 New Failures

danielvegamyhre commented May 8, 2025 •

edited

Loading

pytorch-bot bot commented May 8, 2025 •

edited

Loading