-
Notifications
You must be signed in to change notification settings - Fork 74
Pull the Llama4 inference benchmark from lightning-thunder #5578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Greptile OverviewGreptile SummaryPulls the latest Llama4 inference benchmark script from lightning-thunder repository into the fuser repo. This is part of migrating the script from lightning-thunder (where it will be deleted) to fuser for ongoing maintenance. Key Changes:
Dependency Concern:
Confidence Score: 3/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant User
participant main
participant InferenceBenchmark
participant Model
participant Thunder
participant nvFuser
User->>main: Run benchmark script
main->>main: parse_args()
main->>main: _register_nvfp4_ops()
Note over main: Register nvFP4 custom ops<br/>with Thunder/nvFuser
main->>InferenceBenchmark: __init__(config)
InferenceBenchmark->>Model: _load_model()
Note over Model: Load on meta device
InferenceBenchmark->>Model: _replace_llama4_moe()
Note over Model: Replace HF MoE with custom<br/>Llama4MoE using GroupedSwiGLU
InferenceBenchmark->>Model: parallelize_module()
Note over Model: Apply tensor parallelism
InferenceBenchmark->>Model: to_empty(device)
Note over Model: Materialize on GPU
InferenceBenchmark->>Model: _quantize_llama4()
Note over Model: Replace GroupedSwiGLU with<br/>NVFP4InferenceGroupedSwiGLU
InferenceBenchmark->>Thunder: _compile_model()
Thunder->>nvFuser: Apply transforms
Note over Thunder,nvFuser: thunderfx/thunder.jit compilation
InferenceBenchmark->>InferenceBenchmark: run_benchmark()
loop warmup_iterations
InferenceBenchmark->>Model: generate()
Model->>Thunder: forward()
Thunder->>nvFuser: Execute fused kernels
end
loop num_iterations
InferenceBenchmark->>Model: measure_inference_step()
Model->>Model: prefill()
Model->>Model: decode_one_token() x N
InferenceBenchmark->>InferenceBenchmark: Track metrics
end
InferenceBenchmark->>User: print_results()
|
|
Review updated until commit f5d05ae Auto-merge Status✅ PR is approved Description
|
| Relevant files | |||||
|---|---|---|---|---|---|
| Enhancement |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 No relevant tests |
| 🔒 No security concerns identified |
| ⚡ Recommended focus areas for review |
Missing Tests
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, no comments
wujingyue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't review the nvfp4 stuff. Other changes LGTM!
| group_outs = [] | ||
| for group_a, group_b in zip(a.split(group_sizes), b.unbind()): | ||
| group_outs.append(group_a @ group_b) | ||
| for idx, group_a in enumerate(a.split(group_sizes)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this fallback implementation is necessary any more. Lightning-AI/lightning-thunder#2721
But this can come as a different PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we only build against the latest stable and nightly
790d7d7 to
f5d05ae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, no comments
|
!test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
benchmarks/python/layers_for_inference_benchmark.py, line 550 (link)style: imports from test module
thunder.tests.llama4_moe- verify this is intended to be part of thunder's public API when installed as a dependency, or if it should use a different public module
2 files reviewed, 1 comment
|
The latest internal pipeline run actually had two failures jit_binary_distributed_tests_20_GB200 and jit_python_distributed_tests_20_GB200 . Those were not detected by the auto-merge workflow because I missed the pagination of CI status check and only checked the latest 30 statuses, which were all successful. I'm working on a fix for that.
|
This fixes a severe bug where the auto-merge workflow only checked the first 30 commit statuses, causing it to miss failures and incorrectly merge PRs with failing checks. Root cause analysis: - PR #5578 had 2 failed GB200 tests at 23:23-23:25 UTC - By 03:29 UTC, 27+ new successful statuses pushed failures past position 30 - Workflow only fetched first page (30 items), saw 0 failures, and merged Fixed 4 critical pagination issues: 1. listCommitStatusesForRef (line 140) - CRITICAL: Only saw 30 of 57 statuses 2. checks.listForRef (line 173) - Could miss failed checks if >30 exist 3. issues.listComments (line 349) - Wouldn't find status comment if >30 comments 4. pulls.list (line 64) - Could miss PR if >30 open PRs on branch All API calls now use github.paginate() to retrieve complete results. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
## Summary Fixes a critical bug where the auto-merge workflow only fetched the first 30 results from GitHub API list operations, causing it to miss failed checks and incorrectly merge PRs. ## Root Cause PR #5578 had 2 failed GB200 tests that occurred early in the CI run. By the time the auto-merge action ran 4+ hours later, 27 newer successful statuses had been created. Since the workflow used unpaginated API calls (default limit: 30 items), the failed statuses were pushed beyond the first page and never detected. ## Changes Fixed 4 GitHub API calls to use `github.paginate()`: 1. `listCommitStatusesForRef` - Was only checking 30 of 57+ statuses 2. `checks.listForRef` - Could miss failed checks if >30 exist 3. `issues.listComments` - Could miss status comment if >30 comments 4. `pulls.list` - Could miss PR if >30 open PRs on branch Also simplified the `pr_approved` check logic which was deriving approval status from `mergeable_state` in a confusing way. The workflow now shows the actual `mergeable_state` value in status comments for transparency. ## Impact The auto-merge workflow will now correctly detect ALL failures regardless of how many statuses exist, preventing incorrect merges like #5578. --------- Co-authored-by: Claude <[email protected]>
Pull the latest version of this script after substantial changes in the lightning repo. The script is being "moved" into the fuser repo - it will be deleted from lightning-thunder, and subsequent changes will be merged into the fuser repo.
The script does not have any nvfuser changes at this moment. No configs or model code are pulled inside yet.