[benchmarks][vllm] Unified Attention benchmark (paged attention)#5348
Merged
Egor-Krivov merged 20 commits intomainfrom Nov 14, 2025
Merged
[benchmarks][vllm] Unified Attention benchmark (paged attention)#5348Egor-Krivov merged 20 commits intomainfrom
Egor-Krivov merged 20 commits intomainfrom
Conversation
anmyachev
approved these changes
Nov 14, 2025
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR adds a paged attention benchmark for the vLLM library, implementing both 2D and 3D unified attention kernels with tensor descriptor optimizations. The benchmark compares performance against PyTorch reference implementations and reports both throughput (GB/s) and compute (TFlops) metrics.
Key changes:
- Implementation of unified attention benchmark with paged KV cache support
- Enhanced memory bandwidth calculations accounting for actual token usage
- Extended result transformation to report GB/s metrics alongside TFlops
- CI/CD workflow updates to run the new benchmark
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| benchmarks/third_party/vllm/unified_attention_benchmark.py | New comprehensive benchmark for vLLM's unified attention with 2D/3D kernels, supporting various model configurations and attention features |
| benchmarks/third_party/vllm/transform_results.py | Enhanced to handle non-integer parameter values and report both TFlops and GB/s metrics |
| benchmarks/third_party/vllm/batched_moe_benchmark.py | Improved memory bandwidth calculation to account for actual activated experts and token usage |
| .github/workflows/third-party-benchmarks.yml | Added unified attention benchmark to CI workflow and improved command formatting |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
chengjunlu
pushed a commit
that referenced
this pull request
Nov 27, 2025
* src/main: [benchmarks][vllm] Paged attention benchmark (#5348)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #5257
I also started reporting gbps to the database because many benchmarks are memory bound