Skip to content

Conversation

@vincentzed
Copy link

@vincentzed vincentzed commented Dec 31, 2025

πŸ“Œ Description

πŸ” Related Issues

πŸš€ Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

βœ… Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

πŸ§ͺ Tests

flashinfer ❯ FLASHINFER_DISABLE_VERSION_CHECK=1 python benchmarks/bench_tgv_gemm.py
Starting BF16 TGV GEMM SM100 Tests
==================================================

=== Testing correctness ===
Cosine similarity: 1.000000
Max difference: 1.000000
Mean difference: 0.036133
βœ“ Correctness test PASSED

=== Testing tgv_gemm_bf16_sm100 with different sizes ===

--- deepseekv3, o_proj, tp=8: M=1, N=7168, K=2048, has_bias=False ---
CUBLAS average time: 0.006327 ms, 4.640 TFLOPS
2025-12-31 20:13:54,908 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:13:54,938 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.005773 ms, 5.086 TFLOPS, speedup: 1.10x

Testing with PDL...
PDL average time: 0.004935 ms, 5.949 TFLOPS, speedup: 1.28x

--- deepseekv3, o_proj, tp=8: M=4, N=7168, K=2048, has_bias=False ---
CUBLAS average time: 0.005939 ms, 19.773 TFLOPS
2025-12-31 20:13:56,853 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:13:56,882 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.005521 ms, 21.273 TFLOPS, speedup: 1.08x

Testing with PDL...
PDL average time: 0.004974 ms, 23.609 TFLOPS, speedup: 1.19x

--- deepseekv3, o_proj, tp=8: M=8, N=7168, K=2048, has_bias=False ---
CUBLAS average time: 0.005265 ms, 44.609 TFLOPS
2025-12-31 20:13:58,645 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:13:58,661 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.005214 ms, 45.047 TFLOPS, speedup: 1.01x

Testing with PDL...
PDL average time: 0.004740 ms, 49.548 TFLOPS, speedup: 1.11x

--- deepseekv3, o_proj, tp=8: M=16, N=7168, K=2048, has_bias=False ---
CUBLAS average time: 0.005853 ms, 80.256 TFLOPS
2025-12-31 20:14:00,615 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:00,631 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.005446 ms, 86.265 TFLOPS, speedup: 1.07x

Testing with PDL...
PDL average time: 0.004678 ms, 100.424 TFLOPS, speedup: 1.25x

--- deepseekv3, o_proj, tp=8: M=32, N=7168, K=2048, has_bias=False ---
CUBLAS average time: 0.006024 ms, 155.956 TFLOPS
2025-12-31 20:14:02,470 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:02,488 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.005762 ms, 163.056 TFLOPS, speedup: 1.05x

Testing with PDL...
PDL average time: 0.004952 ms, 189.740 TFLOPS, speedup: 1.22x

--- deepseekv3, o_proj, tp=8: M=64, N=7168, K=2048, has_bias=False ---
CUBLAS average time: 0.006218 ms, 302.190 TFLOPS
2025-12-31 20:14:04,326 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:04,345 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.006346 ms, 296.077 TFLOPS, speedup: 0.98x

Testing with PDL...
PDL average time: 0.005899 ms, 318.527 TFLOPS, speedup: 1.05x

--- deepseekv3, q_b_proj, tp=8: M=1, N=3072, K=1536, has_bias=False ---
CUBLAS average time: 0.006919 ms, 1.364 TFLOPS
2025-12-31 20:14:06,119 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:06,134 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.004602 ms, 2.051 TFLOPS, speedup: 1.50x

Testing with PDL...
PDL average time: 0.004133 ms, 2.283 TFLOPS, speedup: 1.67x

--- deepseekv3, q_b_proj, tp=8: M=4, N=3072, K=1536, has_bias=False ---
CUBLAS average time: 0.005842 ms, 6.461 TFLOPS
2025-12-31 20:14:08,004 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:08,032 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.004786 ms, 7.887 TFLOPS, speedup: 1.22x

Testing with PDL...
PDL average time: 0.003971 ms, 9.506 TFLOPS, speedup: 1.47x

--- deepseekv3, q_b_proj, tp=8: M=8, N=3072, K=1536, has_bias=False ---
CUBLAS average time: 0.005846 ms, 12.915 TFLOPS
2025-12-31 20:14:09,741 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:09,757 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.004584 ms, 16.471 TFLOPS, speedup: 1.28x

Testing with PDL...
PDL average time: 0.004153 ms, 18.178 TFLOPS, speedup: 1.41x

--- deepseekv3, q_b_proj, tp=8: M=16, N=3072, K=1536, has_bias=False ---
CUBLAS average time: 0.004388 ms, 34.412 TFLOPS
2025-12-31 20:14:11,529 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:11,545 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.004785 ms, 31.557 TFLOPS, speedup: 0.92x

Testing with PDL...
PDL average time: 0.003980 ms, 37.934 TFLOPS, speedup: 1.10x

--- deepseekv3, q_b_proj, tp=8: M=32, N=3072, K=1536, has_bias=False ---
CUBLAS average time: 0.004577 ms, 65.983 TFLOPS
2025-12-31 20:14:13,403 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:13,419 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.004798 ms, 62.940 TFLOPS, speedup: 0.95x

Testing with PDL...
PDL average time: 0.004133 ms, 73.075 TFLOPS, speedup: 1.11x

--- deepseekv3, q_b_proj, tp=8: M=64, N=3072, K=1536, has_bias=False ---
CUBLAS average time: 0.004911 ms, 122.992 TFLOPS
2025-12-31 20:14:15,211 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:15,228 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.004883 ms, 123.690 TFLOPS, speedup: 1.01x

Testing with PDL...
PDL average time: 0.004382 ms, 137.827 TFLOPS, speedup: 1.12x

--- gpt-oss-120b, qkv_proj, tp=4: M=1, N=1280, K=2880, has_bias=True ---
CUBLAS average time: 0.007843 ms, 0.940 TFLOPS
2025-12-31 20:14:17,051 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:17,067 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.006012 ms, 1.226 TFLOPS, speedup: 1.30x

Testing with PDL...
PDL average time: 0.005520 ms, 1.336 TFLOPS, speedup: 1.42x

--- gpt-oss-120b, qkv_proj, tp=4: M=4, N=1280, K=2880, has_bias=True ---
CUBLAS average time: 0.007246 ms, 4.070 TFLOPS
2025-12-31 20:14:18,836 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:18,865 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.006200 ms, 4.757 TFLOPS, speedup: 1.17x

Testing with PDL...
PDL average time: 0.005407 ms, 5.455 TFLOPS, speedup: 1.34x

--- gpt-oss-120b, qkv_proj, tp=4: M=8, N=1280, K=2880, has_bias=True ---
CUBLAS average time: 0.007241 ms, 8.145 TFLOPS
2025-12-31 20:14:20,748 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:20,764 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.006087 ms, 9.690 TFLOPS, speedup: 1.19x

Testing with PDL...
PDL average time: 0.005399 ms, 10.925 TFLOPS, speedup: 1.34x

--- gpt-oss-120b, qkv_proj, tp=4: M=16, N=1280, K=2880, has_bias=True ---
CUBLAS average time: 0.006055 ms, 19.483 TFLOPS
2025-12-31 20:14:22,591 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:22,607 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.005997 ms, 19.670 TFLOPS, speedup: 1.01x

Testing with PDL...
PDL average time: 0.005415 ms, 21.786 TFLOPS, speedup: 1.12x

--- gpt-oss-120b, qkv_proj, tp=4: M=32, N=1280, K=2880, has_bias=True ---
CUBLAS average time: 0.006202 ms, 38.039 TFLOPS
2025-12-31 20:14:24,459 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:24,476 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.006008 ms, 39.268 TFLOPS, speedup: 1.03x

Testing with PDL...
PDL average time: 0.005514 ms, 42.785 TFLOPS, speedup: 1.12x

--- gpt-oss-120b, qkv_proj, tp=4: M=64, N=1280, K=2880, has_bias=True ---
CUBLAS average time: 0.006204 ms, 76.059 TFLOPS
2025-12-31 20:14:26,203 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:26,221 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.006421 ms, 73.485 TFLOPS, speedup: 0.97x

Testing with PDL...
PDL average time: 0.005615 ms, 84.033 TFLOPS, speedup: 1.10x

--- gpt-oss-120b, qkv_proj, tp=4: M=128, N=1280, K=2880, has_bias=True ---
CUBLAS average time: 0.006359 ms, 148.408 TFLOPS
2025-12-31 20:14:28,057 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:28,075 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.006428 ms, 146.816 TFLOPS, speedup: 0.99x

Testing with PDL...
PDL average time: 0.005672 ms, 166.388 TFLOPS, speedup: 1.12x

--- gpt-oss-120b, o_proj, tp=4: M=1, N=2880, K=1024, has_bias=True ---
CUBLAS average time: 0.005422 ms, 1.088 TFLOPS
2025-12-31 20:14:29,911 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:29,925 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.004073 ms, 1.448 TFLOPS, speedup: 1.33x

Testing with PDL...
PDL average time: 0.003560 ms, 1.657 TFLOPS, speedup: 1.52x

--- gpt-oss-120b, o_proj, tp=4: M=4, N=2880, K=1024, has_bias=True ---
CUBLAS average time: 0.004809 ms, 4.906 TFLOPS
2025-12-31 20:14:31,788 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:31,816 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.003976 ms, 5.934 TFLOPS, speedup: 1.21x

Testing with PDL...
PDL average time: 0.003555 ms, 6.636 TFLOPS, speedup: 1.35x

--- gpt-oss-120b, o_proj, tp=4: M=8, N=2880, K=1024, has_bias=True ---
CUBLAS average time: 0.004791 ms, 9.848 TFLOPS
2025-12-31 20:14:33,632 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:33,648 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.004183 ms, 11.280 TFLOPS, speedup: 1.15x

Testing with PDL...
PDL average time: 0.003485 ms, 13.541 TFLOPS, speedup: 1.37x

--- gpt-oss-120b, o_proj, tp=4: M=16, N=2880, K=1024, has_bias=True ---
CUBLAS average time: 0.003918 ms, 24.084 TFLOPS
2025-12-31 20:14:35,393 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:35,409 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.004205 ms, 22.442 TFLOPS, speedup: 0.93x

Testing with PDL...
PDL average time: 0.003552 ms, 26.572 TFLOPS, speedup: 1.10x

--- gpt-oss-120b, o_proj, tp=4: M=32, N=2880, K=1024, has_bias=True ---
CUBLAS average time: 0.004081 ms, 46.244 TFLOPS
2025-12-31 20:14:37,317 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:37,333 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.004162 ms, 45.344 TFLOPS, speedup: 0.98x

Testing with PDL...
PDL average time: 0.003576 ms, 52.782 TFLOPS, speedup: 1.14x

--- gpt-oss-120b, o_proj, tp=4: M=64, N=2880, K=1024, has_bias=True ---
CUBLAS average time: 0.004178 ms, 90.358 TFLOPS
2025-12-31 20:14:39,160 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:39,178 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.004539 ms, 83.172 TFLOPS, speedup: 0.92x

Testing with PDL...
PDL average time: 0.003768 ms, 100.177 TFLOPS, speedup: 1.11x

--- gpt-oss-120b, o_proj, tp=4: M=128, N=2880, K=1024, has_bias=True ---
CUBLAS average time: 0.004576 ms, 164.985 TFLOPS
2025-12-31 20:14:40,987 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:41,006 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.004792 ms, 157.552 TFLOPS, speedup: 0.95x

Testing with PDL...
PDL average time: 0.004384 ms, 172.222 TFLOPS, speedup: 1.04x

=== Writing results to bf16_tgv_gemm_benchmark_results.csv ===
Benchmark results saved to bf16_tgv_gemm_benchmark_results.csv
Total test cases: 26

==================================================
All BF16 TGV GEMM SM100 tests completed successfully!
  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Chores
    • Refactored benchmark infrastructure to improve consistency and maintainability across benchmark variants.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: vincentzed <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 31, 2025

πŸ“ Walkthrough

Walkthrough

The benchmark script refactors its timing implementation to use a centralized bench_gpu_time_with_cudagraph utility function instead of manual CUDA graph capture and timing logic. This consolidation replaces explicit timing blocks across CUBLAS, TGV, and PDL code paths with consistent utility calls.

Changes

Cohort / File(s) Summary
Benchmark Timing Refactoring
benchmarks/bench_tgv_gemm.py
Replaces manual CUDA graph capture and timing with bench_gpu_time_with_cudagraph utility calls. Updates import statements, removes explicit time measurement logic, and recalculates speedups based on aggregated timing results from the utility function. Applies unified benching approach to CUBLAS, TGV, and PDL variants.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 With graphs now captured neat and clean,
Our benchmarks streamlined, lean and keen!
No more hand-timed delays to fightβ€”
One helper function shines so bright. ✨
Speedups calculated, straight and true,
Old timing code we bid adieu! πŸ‘‹

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description contains only the empty template with unchecked checklist items and benchmark output, lacking any meaningful explanation of the changes, their purpose, or rationale. Fill in the Description section to explain what the PR does (refactoring benchmark timing logic), why it's needed, and include context about the changes. Provide reviewer notes if applicable.
Title check ❓ Inconclusive The title 'Tiny fix bench tgv gemm' is vague and uses non-descriptive language that obscures the actual change. Use a more descriptive title that clearly indicates the refactoring, such as 'Refactor bench_tgv_gemm to use bench_gpu_time_with_cudagraph utility' or 'Replace manual CUDA graph timing with bench_gpu_time_with_cudagraph in TGV benchmark'.
βœ… Passed checks (1 passed)
Check name Status Explanation
Docstring Coverage βœ… Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • πŸ“ Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❀️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @vincentzed, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the TGV GEMM benchmarking script by integrating a new, centralized utility function for GPU time measurement. The change aims to standardize the benchmarking process, improve code readability, and ensure consistent performance evaluation across CUBLAS, TGV, and PDL GEMM implementations.

Highlights

  • Benchmarking Utility Integration: Replaced manual CUDA graph benchmarking logic with a dedicated utility function, bench_gpu_time_with_cudagraph, to standardize and streamline performance measurements for GEMM operations.
  • Code Simplification and Maintainability: Removed redundant manual CUDA graph setup, warmup, and timing code, resulting in a cleaner, more concise, and easier-to-maintain benchmark script.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with πŸ‘ and πŸ‘Ž on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the bench_tgv_gemm.py benchmark script to use the bench_gpu_time_with_cudagraph utility function, which simplifies the code and improves maintainability by removing boilerplate CUDA graph benchmarking logic.

While the refactoring is a good improvement, I've noticed a change in the benchmarking methodology. The number of iterations captured within the CUDA graph has been implicitly changed from 100 to the default of 10. This can affect the benchmark results by changing how kernel launch overhead is amortized. I've added comments with suggestions to restore the original number of iterations to ensure benchmark consistency.

Comment on lines +83 to +88
cublas_times = bench_gpu_time_with_cudagraph(
lambda: F.linear(A, B.T, bias),
dry_run_time_ms=100,
repeat_time_ms=500,
cold_l2_cache=False,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The previous implementation captured 100 iterations within the CUDA graph to amortize launch overhead. The bench_gpu_time_with_cudagraph function defaults to num_iters_within_graph=10. To maintain consistency with the previous benchmarking methodology and ensure better amortization of kernel launch overhead, it's recommended to explicitly set num_iters_within_graph=100.

Suggested change
cublas_times = bench_gpu_time_with_cudagraph(
lambda: F.linear(A, B.T, bias),
dry_run_time_ms=100,
repeat_time_ms=500,
cold_l2_cache=False,
)
cublas_times = bench_gpu_time_with_cudagraph(
lambda: F.linear(A, B.T, bias),
dry_run_time_ms=100,
repeat_time_ms=500,
cold_l2_cache=False,
num_iters_within_graph=100,
)

Comment on lines +101 to +106
tgv_times = bench_gpu_time_with_cudagraph(
lambda: tgv_gemm_sm100(A, B, bias),
dry_run_time_ms=100,
repeat_time_ms=500,
cold_l2_cache=False,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The previous implementation captured 100 iterations within the CUDA graph to amortize launch overhead. The bench_gpu_time_with_cudagraph function defaults to num_iters_within_graph=10. To maintain consistency with the previous benchmarking methodology and ensure better amortization of kernel launch overhead, it's recommended to explicitly set num_iters_within_graph=100.

Suggested change
tgv_times = bench_gpu_time_with_cudagraph(
lambda: tgv_gemm_sm100(A, B, bias),
dry_run_time_ms=100,
repeat_time_ms=500,
cold_l2_cache=False,
)
tgv_times = bench_gpu_time_with_cudagraph(
lambda: tgv_gemm_sm100(A, B, bias),
dry_run_time_ms=100,
repeat_time_ms=500,
cold_l2_cache=False,
num_iters_within_graph=100,
)

Comment on lines +114 to +119
pdl_times = bench_gpu_time_with_cudagraph(
lambda: tgv_gemm_sm100(A, B, bias, pdl=True),
dry_run_time_ms=100,
repeat_time_ms=500,
cold_l2_cache=False,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The previous implementation captured 100 iterations within the CUDA graph to amortize launch overhead. The bench_gpu_time_with_cudagraph function defaults to num_iters_within_graph=10. To maintain consistency with the previous benchmarking methodology and ensure better amortization of kernel launch overhead, it's recommended to explicitly set num_iters_within_graph=100.

Suggested change
pdl_times = bench_gpu_time_with_cudagraph(
lambda: tgv_gemm_sm100(A, B, bias, pdl=True),
dry_run_time_ms=100,
repeat_time_ms=500,
cold_l2_cache=False,
)
pdl_times = bench_gpu_time_with_cudagraph(
lambda: tgv_gemm_sm100(A, B, bias, pdl=True),
dry_run_time_ms=100,
repeat_time_ms=500,
cold_l2_cache=False,
num_iters_within_graph=100,
)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
benchmarks/bench_tgv_gemm.py (3)

83-89: Consider using input_args to avoid capturing loop variables in lambda.

The lambda captures A, B, and bias from the loop scope. While this works because bench_gpu_time_with_cudagraph executes immediately, using input_args would be more explicit and eliminate the static analysis warning.

πŸ”Ž Suggested refactor using input_args

Per the bench_gpu_time_with_cudagraph docstring, you can pass arguments explicitly:

-cublas_times = bench_gpu_time_with_cudagraph(
-    lambda: F.linear(A, B.T, bias),
-    dry_run_time_ms=100,
-    repeat_time_ms=500,
-    cold_l2_cache=False,
-)
+cublas_times = bench_gpu_time_with_cudagraph(
+    F.linear,
+    dry_run_time_ms=100,
+    repeat_time_ms=500,
+    cold_l2_cache=False,
+    input_args=(A, B.T, bias),
+)

101-107: Consider using input_args to avoid capturing loop variables in lambda.

Same pattern as the CUBLAS benchmark: the lambda captures loop-scoped variables. Using input_args would eliminate the static analysis warning.

πŸ”Ž Suggested refactor using input_args
-tgv_times = bench_gpu_time_with_cudagraph(
-    lambda: tgv_gemm_sm100(A, B, bias),
-    dry_run_time_ms=100,
-    repeat_time_ms=500,
-    cold_l2_cache=False,
-)
+tgv_times = bench_gpu_time_with_cudagraph(
+    tgv_gemm_sm100,
+    dry_run_time_ms=100,
+    repeat_time_ms=500,
+    cold_l2_cache=False,
+    input_args=(A, B, bias),
+)

114-120: Consider using input_args and input_kwargs to avoid capturing loop variables in lambda.

Same lambda closure pattern, but with a keyword argument. Using input_args and input_kwargs would eliminate the static analysis warning.

πŸ”Ž Suggested refactor using input_args and input_kwargs
-pdl_times = bench_gpu_time_with_cudagraph(
-    lambda: tgv_gemm_sm100(A, B, bias, pdl=True),
-    dry_run_time_ms=100,
-    repeat_time_ms=500,
-    cold_l2_cache=False,
-)
+pdl_times = bench_gpu_time_with_cudagraph(
+    tgv_gemm_sm100,
+    dry_run_time_ms=100,
+    repeat_time_ms=500,
+    cold_l2_cache=False,
+    input_args=(A, B, bias),
+    input_kwargs={"pdl": True},
+)
πŸ“œ Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 747b0cb and 46f87f8.

πŸ“’ Files selected for processing (1)
  • benchmarks/bench_tgv_gemm.py
🧰 Additional context used
🧬 Code graph analysis (1)
benchmarks/bench_tgv_gemm.py (1)
flashinfer/testing/utils.py (1)
  • bench_gpu_time_with_cudagraph (1259-1481)
πŸͺ› Ruff (0.14.10)
benchmarks/bench_tgv_gemm.py

84-84: Function definition does not bind loop variable A

(B023)


84-84: Function definition does not bind loop variable B

(B023)


84-84: Function definition does not bind loop variable bias

(B023)


102-102: Function definition does not bind loop variable A

(B023)


102-102: Function definition does not bind loop variable B

(B023)


102-102: Function definition does not bind loop variable bias

(B023)


115-115: Function definition does not bind loop variable A

(B023)


115-115: Function definition does not bind loop variable B

(B023)


115-115: Function definition does not bind loop variable bias

(B023)

πŸ”‡ Additional comments (1)
benchmarks/bench_tgv_gemm.py (1)

10-10: LGTM!

The import of bench_gpu_time_with_cudagraph enables cleaner timing logic by replacing manual CUDA graph capture and replay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant