Skip to content

Conversation

@tbqh
Copy link
Collaborator

@tbqh tbqh commented Nov 21, 2025

Pull the latest version of this script after substantial changes in the lightning repo. The script is being "moved" into the fuser repo - it will be deleted from lightning-thunder, and subsequent changes will be merged into the fuser repo.

The script does not have any nvfuser changes at this moment. No configs or model code are pulled inside yet.

@tbqh tbqh requested review from crcrpar and wujingyue November 21, 2025 20:40
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 21, 2025

Greptile Overview

Greptile Summary

Pulls the latest Llama4 inference benchmark script from lightning-thunder repository into the fuser repo. This is part of migrating the script from lightning-thunder (where it will be deleted) to fuser for ongoing maintenance.

Key Changes:

  • Added comprehensive inference benchmark with support for Thunder/nvFuser compilation modes
  • Implemented custom MoE layers (Llama4MoE, GroupedSwiGLU, GroupedLinear) to use grouped matrix multiplication
  • Added nvFP4 quantization support for inference optimization
  • Includes tensor parallelism support via PyTorch distributed
  • Contains extensive imports from thunder package that must be available as an external dependency

Dependency Concern:
The script heavily depends on the thunder package (lines 47-60 in benchmark_inference.py and line 550 in layers_for_inference_benchmark.py), importing from modules like thunder.dynamo, thunder.benchmarks, thunder.tests.distributed, and thunder.transforms. Since this is being "moved" into the fuser repo, verify that:

  1. The thunder package will be available as an installed dependency when running these benchmarks
  2. The import paths match the actual structure when thunder is installed (e.g., thunder.benchmarks.layers_for_inference_benchmark and thunder.tests.distributed.test_moe)
  3. Dependencies are documented in requirements or installation instructions

Confidence Score: 3/5

  • Safe to merge with dependency verification - code is well-structured but relies on external thunder package that needs to be available at runtime
  • Score reflects that while the code itself appears well-written and properly pulled from the source repository, there's a critical dependency on the thunder package that must be resolved. The script cannot run without thunder installed, and the import paths need verification. The PR description mentions this is a "move" from lightning-thunder but doesn't address how the dependency will be handled. Additionally, there's an import from thunder.tests.llama4_moe which references test code that may not be part of the public API.
  • Pay close attention to benchmarks/python/benchmark_inference.py - verify all thunder imports resolve correctly when thunder is installed as a dependency

Important Files Changed

File Analysis

Filename Score Overview
benchmarks/python/benchmark_inference.py 3/5 Added comprehensive Llama4 inference benchmark with Thunder/nvFuser integration. Contains numerous imports from thunder package which must be available as external dependency. Import paths reference thunder package structure that may not match fuser repo layout.
benchmarks/python/layers_for_inference_benchmark.py 4/5 Added custom layer implementations for inference benchmarking including GroupedLinear, GroupedSwiGLU, Llama4MoE, and nvFP4 quantization support. Includes reference to thunder.tests.llama4_moe.Config import that needs verification.

Sequence Diagram

sequenceDiagram
    participant User
    participant main
    participant InferenceBenchmark
    participant Model
    participant Thunder
    participant nvFuser

    User->>main: Run benchmark script
    main->>main: parse_args()
    main->>main: _register_nvfp4_ops()
    Note over main: Register nvFP4 custom ops<br/>with Thunder/nvFuser
    main->>InferenceBenchmark: __init__(config)
    InferenceBenchmark->>Model: _load_model()
    Note over Model: Load on meta device
    InferenceBenchmark->>Model: _replace_llama4_moe()
    Note over Model: Replace HF MoE with custom<br/>Llama4MoE using GroupedSwiGLU
    InferenceBenchmark->>Model: parallelize_module()
    Note over Model: Apply tensor parallelism
    InferenceBenchmark->>Model: to_empty(device)
    Note over Model: Materialize on GPU
    InferenceBenchmark->>Model: _quantize_llama4()
    Note over Model: Replace GroupedSwiGLU with<br/>NVFP4InferenceGroupedSwiGLU
    InferenceBenchmark->>Thunder: _compile_model()
    Thunder->>nvFuser: Apply transforms
    Note over Thunder,nvFuser: thunderfx/thunder.jit compilation
    InferenceBenchmark->>InferenceBenchmark: run_benchmark()
    loop warmup_iterations
        InferenceBenchmark->>Model: generate()
        Model->>Thunder: forward()
        Thunder->>nvFuser: Execute fused kernels
    end
    loop num_iterations
        InferenceBenchmark->>Model: measure_inference_step()
        Model->>Model: prefill()
        Model->>Model: decode_one_token() x N
        InferenceBenchmark->>InferenceBenchmark: Track metrics
    end
    InferenceBenchmark->>User: print_results()
Loading

greptile-apps[bot]

This comment was marked as off-topic.

@github-actions
Copy link

github-actions bot commented Nov 21, 2025

Review updated until commit f5d05ae

Auto-merge Status

✅ PR is approved
✅ Internal CI is finished
✅ No failed checks
✅ PR is mergeable

Description

  • Pull latest inference benchmark from lightning-thunder with substantial updates

  • Add NVFP4 quantization support for GroupedSwiGLU layers in MoE architectures

  • Implement tensor parallel support for both custom and HuggingFace MoE implementations

  • Add Thunder-specific optimizations including CUDA graphs and cache support

  • Update weight layouts and quantization functions for improved performance

Changes walkthrough

Relevant files
Enhancement
benchmark_inference.py
Main benchmark script with NVFP4 and tensor parallel updates

benchmarks/python/benchmark_inference.py

  • Added lightning-thunder repo reference in docstring
  • Implemented _register_nvfp4_ops() for nvfp4 custom operation
    registration
  • Updated model loading to use AutoConfig.from_pretrained() instead of
    hardcoded config
  • Added support for StaticCache vs HybridChunkedCache based on
    transformers version
  • Added new config options: attn_implementation, thunder_cache,
    enable_thunder_cudagraph
  • Updated tensor parallel plan for both custom and HF MoE
    implementations
  • Added CUDA graph support and Thunder-specific optimizations
  • Removed --profile option as per thunder PR Microbenchmarks the Transformer block. #2715
  • Added torch._grouped_mm support in eager mode as per thunder PR MarkAliasPrepare does not preserve shardings #2721
  • +292/-250
    layers_for_inference_benchmark.py
    Supporting layers with GroupedSwiGLU and updated weight layouts

    benchmarks/python/layers_for_inference_benchmark.py

  • Added lightning-thunder repo reference in docstring
  • Added GroupedSwiGLU and NVFP4InferenceGroupedSwiGLU classes
  • Removed NVFP4InferenceLinear class (replaced by GroupedSwiGLU
    approach)
  • Updated GroupedLinear weight layout from [g,n,k] to
    [g,out_features,in_features]
  • Updated quantization functions for new weight layout
  • Added compute_auxiliary_tensors method for performance optimization
  • Updated Llama4MoE to handle new weight layouts with proper transposes
  • Added proper offset handling with prepended zero for grouped
    operations
  • +207/-262

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review
    Missing Tests

    This PR adds substantial new functionality including NVFP4 quantization, distributed tensor parallel support, CUDAGraph integration, and enhanced benchmarking capabilities. However, no new tests were added to validate these features work correctly. The complexity of the changes (especially around distributed setup, custom op registration, and quantization) warrants comprehensive test coverage to prevent regressions.

    # SPDX-FileCopyrightText: Copyright (c) 2025-present NVIDIA CORPORATION & AFFILIATES.
    # All rights reserved.
    # SPDX-License-Identifier: BSD-3-Clause
    
    """Inference benchmark focusing on throughput and latency metrics of prefill and decode phases.
    
    AutoModelForCausalLM from Hugging Face transformers is used for model implementation.
    
    Key metrics:
    - Throughput (tokens/second)
    - Latency (ms/token)
    - Time to First Token (TTFT)
    - Time Between Output Tokens (TBOT)
    
    Pulled from the lightning-thunder repo. Reference:
    https://github.com/Lightning-AI/lightning-thunder/blob/4d3a3c3a7481efdc6a23cdeea99c3ffd31af5e78/thunder/benchmarks/benchmark_inference.py
    """
    
    # fmt: off
    
    from __future__ import annotations
    from contextlib import contextmanager
    from dataclasses import dataclass, field
    import argparse
    import json
    import os
    import statistics
    import time
    import warnings
    from typing import Any
    from collections.abc import Callable
    from looseversion import LooseVersion
    
    import torch
    import torch.distributed as dist
    import torch.nn as nn
    from torch.distributed.device_mesh import init_device_mesh
    from torch.distributed.tensor.parallel import parallelize_module, RowwiseParallel, ColwiseParallel
    from tqdm import tqdm
    import transformers
    from transformers import AutoConfig, AutoModelForCausalLM
    from transformers.cache_utils import HybridChunkedCache, StaticCache
    from transformers.models.llama4.modeling_llama4 import Llama4TextMoe
    from torch.distributed.tensor.placement_types import Shard
    from torch.distributed.tensor import DTensor
    
    import thunder
    from thunder.dynamo.compiler import thunderfx
    from thunder.benchmarks.layers_for_inference_benchmark import (
        GroupedSwiGLU,
    Error Handling Robustness

    The new NVFP4 custom op registration in _register_nvfp4_ops() catches exceptions and only warns, which could hide critical failures during benchmarking. Additionally, the distributed setup with torchelastic detection and device mesh initialization lacks comprehensive error handling for scenarios like failed process group creation or device mesh initialization failures.

    def _register_nvfp4_ops():
        """Register nvfp4 custom operations with Thunder."""
        # Register f16a_nvfp4weight_scaled_grouped_mm with nvfuser translator
        _nvfp4_grouped_mm_symbol = _register_custom_op(nvfuser_f16a_nvfp4weight_scaled_grouped_mm)
    
        def nvfp4_grouped_mm_translator(
            activation,
            fp4_weight,
            weight_scaling_factor,
            global_scale,
            offsets,
            blockscale_offsets,
            problem_sizes,
            *,
            fd,
            lc_to_nv_map,
        ):
            from nvfuser_direct import DataType
            from thunder.executors.nvfuserex_impl import getnv
    
            nv_act = getnv(activation, fd, lc_to_nv_map)
            nv_fp4_w = getnv(fp4_weight, fd, lc_to_nv_map)
            nv_sf_w = getnv(weight_scaling_factor, fd, lc_to_nv_map)
            nv_alpha = getnv(global_scale, fd, lc_to_nv_map)
            nv_offsets = getnv(offsets, fd, lc_to_nv_map)
            nv_blocksf_offsets = getnv(blockscale_offsets, fd, lc_to_nv_map)
            nv_problem_sizes = getnv(problem_sizes, fd, lc_to_nv_map)
            # dynamic shape support has some concretization issue
            m_size = activation.shape[0]
            k_size = activation.shape[1]
            k_tile_size = k_size // 16
    
            reshaped_mat1 = fd.ops.reshape(nv_act, [m_size, k_tile_size, 16])
            scale1 = fd.ops.abs(reshaped_mat1)
            scale1 = fd.ops.max(scale1, 2)
            scale1 = fd.ops.div(scale1, FLOAT4_E2M1_MAX)
            scale1 = fd.ops.clamp(scale1, FLOAT8_E4M3_EPS, FLOAT8_E4M3_MAX)
    
            broadcast_scale1 = fd.ops.broadcast(scale1, [False, False, True])
            reshaped_scaled_mat1 = fd.ops.div(reshaped_mat1, broadcast_scale1)
            reshaped_scaled_mat1 = fd.ops.clamp(reshaped_scaled_mat1, -FLOAT8_E4M3_MAX, FLOAT8_E4M3_MAX)
    
            scaled_mat1 = fd.ops.reshape(reshaped_scaled_mat1, [m_size, k_size])
            fp4_mat1 = fd.ops.cast(scaled_mat1, DataType.Float4_e2m1fn)
            fp8_scale1 = fd.ops.cast(scale1, DataType.Float8_e4m3fn)
            layout_fp8_scale1 = fd.ops.preprocess_grouped_matmul_input_sf(fp8_scale1, nv_offsets, nv_blocksf_offsets)
            out = fd.ops.cutlass_nvfp4_grouped_mm(
                fp4_mat1,
                nv_fp4_w,
                layout_fp8_scale1,
                nv_sf_w,
                nv_alpha,
                # NOTE: we might need to call contiguous on problem_sizes
                nv_problem_sizes,
                nv_offsets,
                nv_blocksf_offsets,
                DataType.BFloat16,
            )
            return out
    
        _register_nvfuser_translator(_nvfp4_grouped_mm_symbol, nvfp4_grouped_mm_translator)
    Potential Memory Issues

    The new NVFP4InferenceGroupedSwiGLU class computes auxiliary tensors (blockscale_offsets, problem_sizes) multiple times during forward passes. While there's an optimization to compute them once, the memory allocation patterns and tensor creation could lead to memory fragmentation or excessive memory usage during large-scale inference workloads.

    class NVFP4InferenceGroupedSwiGLU(nn.Module):
        """NVFP4 GroupedSwiGLU that efficiently reuses auxiliary tensor computations."""
    
        def __init__(
            self,
            gate_proj: NVFP4InferenceGroupedLinear,
            up_proj: NVFP4InferenceGroupedLinear,
            down_proj: NVFP4InferenceGroupedLinear,
        ):
            super().__init__()
            self.gate_proj = gate_proj
            self.up_proj = up_proj
            self.down_proj = down_proj
    
        def forward(self, hidden_states: torch.Tensor, offsets: torch.Tensor) -> torch.Tensor:
            # Compute auxiliary tensors once for all three operations
            intermediate_features = self.gate_proj.out_features
            blockscale_offsets_gate, problem_sizes_gate = NVFP4InferenceGroupedLinear.compute_auxiliary_tensors(
                hidden_states, offsets, intermediate_features
            )
    
            gate_out = self.gate_proj(hidden_states, offsets, blockscale_offsets_gate, problem_sizes_gate)
            up_out = self.up_proj(hidden_states, offsets, blockscale_offsets_gate, problem_sizes_gate)
    
            intermediate = torch.nn.functional.silu(gate_out) * up_out
    
            # For down_proj, we need different problem_sizes (different output features)
            hidden_features = self.down_proj.out_features
            blockscale_offsets_down, problem_sizes_down = NVFP4InferenceGroupedLinear.compute_auxiliary_tensors(
                intermediate, offsets, hidden_features
            )
    
            return self.down_proj(intermediate, offsets, blockscale_offsets_down, problem_sizes_down)
    
        @staticmethod
        def from_grouped_swiglu(grouped_swiglu: GroupedSwiGLU, fqn: str | None = None) -> NVFP4InferenceGroupedSwiGLU:
            """Create an NVFP4InferenceGroupedSwiGLU from a GroupedSwiGLU.
    
            Args:
                grouped_swiglu (GroupedSwiGLU): The source GroupedSwiGLU.
                fqn (str or None): Fully qualified name. Currently unused; reserved for future use or compatibility.
            """
            gate_proj = NVFP4InferenceGroupedLinear.from_grouped_linear(grouped_swiglu.gate_proj)
            up_proj = NVFP4InferenceGroupedLinear.from_grouped_linear(grouped_swiglu.up_proj)
            down_proj = NVFP4InferenceGroupedLinear.from_grouped_linear(grouped_swiglu.down_proj)
            return NVFP4InferenceGroupedSwiGLU(gate_proj, up_proj, down_proj)

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    2 files reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    Copy link
    Collaborator

    @wujingyue wujingyue left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I didn't review the nvfp4 stuff. Other changes LGTM!

    group_outs = []
    for group_a, group_b in zip(a.split(group_sizes), b.unbind()):
    group_outs.append(group_a @ group_b)
    for idx, group_a in enumerate(a.split(group_sizes)):
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I don't think this fallback implementation is necessary any more. Lightning-AI/lightning-thunder#2721

    But this can come as a different PR.

    Copy link
    Collaborator

    @wujingyue wujingyue Nov 22, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Never mind -- this is still necessary for torch <2.8. OOC, what's the minimum torch version nvFuser supports? cc @xwang233 and @nWEIdia

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    we only build against the latest stable and nightly

    @tbqh tbqh force-pushed the inference_benchmark_Nov21 branch from 790d7d7 to f5d05ae Compare November 21, 2025 23:09
    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    2 files reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    @tbqh tbqh added the enable-auto-merge Auto-merge a PR when: 1) PR mergeable 2) Internal CI complete 3) No failures label Nov 21, 2025
    @tbqh
    Copy link
    Collaborator Author

    tbqh commented Nov 21, 2025

    !test

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Additional Comments (1)

    1. benchmarks/python/layers_for_inference_benchmark.py, line 550 (link)

      style: imports from test module thunder.tests.llama4_moe - verify this is intended to be part of thunder's public API when installed as a dependency, or if it should use a different public module

    2 files reviewed, 1 comment

    Edit Code Review Agent Settings | Greptile

    @github-actions github-actions bot merged commit abbbf4e into main Nov 22, 2025
    60 of 63 checks passed
    @github-actions github-actions bot removed the enable-auto-merge Auto-merge a PR when: 1) PR mergeable 2) Internal CI complete 3) No failures label Nov 22, 2025
    @github-actions github-actions bot deleted the inference_benchmark_Nov21 branch November 22, 2025 03:29
    @xwang233
    Copy link
    Collaborator

    xwang233 commented Nov 22, 2025

    The latest internal pipeline run actually had two failures jit_binary_distributed_tests_20_GB200 and jit_python_distributed_tests_20_GB200 . Those were not detected by the auto-merge workflow because I missed the pagination of CI status check and only checked the latest 30 statuses, which were all successful. I'm working on a fix for that.

    I'm not sure if this PR directly caused the two failures. If so, please help revert the PR. Sorry about the inconvenience. Failures seem unrelated to this PR.

    xwang233 added a commit that referenced this pull request Nov 22, 2025
    This fixes a severe bug where the auto-merge workflow only checked the
    first 30 commit statuses, causing it to miss failures and incorrectly
    merge PRs with failing checks.
    
    Root cause analysis:
    - PR #5578 had 2 failed GB200 tests at 23:23-23:25 UTC
    - By 03:29 UTC, 27+ new successful statuses pushed failures past position 30
    - Workflow only fetched first page (30 items), saw 0 failures, and merged
    
    Fixed 4 critical pagination issues:
    1. listCommitStatusesForRef (line 140) - CRITICAL: Only saw 30 of 57 statuses
    2. checks.listForRef (line 173) - Could miss failed checks if >30 exist
    3. issues.listComments (line 349) - Wouldn't find status comment if >30 comments
    4. pulls.list (line 64) - Could miss PR if >30 open PRs on branch
    
    All API calls now use github.paginate() to retrieve complete results.
    
    🤖 Generated with [Claude Code](https://claude.com/claude-code)
    
    Co-Authored-By: Claude <[email protected]>
    xwang233 added a commit that referenced this pull request Nov 22, 2025
    ## Summary
    
    Fixes a critical bug where the auto-merge workflow only fetched the
    first 30 results from GitHub API list operations, causing it to miss
    failed checks and incorrectly merge PRs.
    
    ## Root Cause
    
    PR #5578 had 2 failed GB200 tests that occurred early in the CI run. By
    the time the auto-merge action ran 4+ hours later, 27 newer successful
    statuses had been created. Since the workflow used unpaginated API calls
    (default limit: 30 items), the failed statuses were pushed beyond the
    first page and never detected.
    
    ## Changes
    
    Fixed 4 GitHub API calls to use `github.paginate()`:
    1. `listCommitStatusesForRef` - Was only checking 30 of 57+ statuses
    2. `checks.listForRef` - Could miss failed checks if >30 exist  
    3. `issues.listComments` - Could miss status comment if >30 comments
    4. `pulls.list` - Could miss PR if >30 open PRs on branch
    
    Also simplified the `pr_approved` check logic which was deriving
    approval status from `mergeable_state` in a confusing way. The workflow
    now shows the actual `mergeable_state` value in status comments for
    transparency.
    
    ## Impact
    
    The auto-merge workflow will now correctly detect ALL failures
    regardless of how many statuses exist, preventing incorrect merges like
    #5578.
    
    ---------
    
    Co-authored-by: Claude <[email protected]>
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    4 participants