Skip to content

Conversation

@Imbernoulli
Copy link

What I Changed

I updated three parts to make gpt-oss training run faster and use less memory:

  1. Flash Attention 3
  2. Gradient Checkpointing
  3. Liger Kernel

After this Change, you can train gpt-oss-120b with 60k context len within ~40G mem per GPU.

Environment

You need to do two things:

  1. Install Liger Kernel: You must install this manually from here:
    https://github.com/Comet0322/Liger-Kernel

  2. Download Flash Attention 3: If you are in a place with an internet connection, first run the code below to download and cache Flash Attention 3.

    import torch
    from kernels import get_kernel
    vllm_flash_attn3 = get_kernel("kernels-community/vllm-flash-attn3")

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Imbernoulli, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the training efficiency of gpt-oss models by incorporating several key optimizations. The changes aim to reduce memory footprint and accelerate training speed, making it possible to train larger models with longer context lengths on more constrained hardware resources. The integration of Flash Attention 3, Liger Kernel, and specific gradient checkpointing configurations are central to achieving these performance improvements.

Highlights

  • Flash Attention 3 Integration: Enabled Flash Attention 3 for gpt_oss models by dynamically loading and registering the vllm-flash-attn3 kernel, significantly improving attention mechanism efficiency.
  • Liger Kernel Support: Extended Liger Kernel support to gpt_oss models, allowing for specialized kernel optimizations to enhance training performance and memory usage.
  • Gradient Checkpointing for GPT-OSS: Integrated GptOssMLP as a Z3 leaf module for gpt_oss models, which is a common pattern used in conjunction with gradient checkpointing to reduce memory consumption during training, especially for large models.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Imbernoulli
Copy link
Author

06deed5299bda32d6102af5e3f512d8b
An experiment. The loss curve is reasonable.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several optimizations for gpt-oss model training, including support for Flash Attention 3, the Liger Kernel, and compatibility with DeepSpeed ZeRO-3. These changes are well-targeted and should improve performance and memory usage as described. I've provided a few suggestions to enhance code maintainability and address minor issues. A key point of concern is the modification of model_args.flash_attn with a value not present in the AttentionFunction enum, which could introduce type-related bugs.

load_and_register_kernel("kernels-community/vllm-flash-attn3")
setattr(config, "_attn_implementation", "kernels-community/vllm-flash-attn3")
setattr(config, "_attn_implementation_internal", "kernels-community/vllm-flash-attn3")
model_args.flash_attn = "kernels-community/vllm-flash-attn3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The flash_attn attribute of model_args is of type AttentionFunction, which is an enum. You are assigning a string value "kernels-community/vllm-flash-attn3" which is not a member of this enum. While this works here because the function returns early, it can lead to unexpected behavior in other parts of the code that might expect a valid AttentionFunction member. For better type safety and maintainability, please consider adding this new attention type to the AttentionFunction enum in src/llamafactory/extras/constants.py.

Comment on lines 123 to +131
if model_type == "qwen3_omni_moe":
from transformers.models.qwen3_omni_moe.modeling_qwen3_omni_moe import Qwen3OmniMoeThinkerTextSparseMoeBlock

_set_z3_leaf_modules(model, [Qwen3OmniMoeThinkerTextSparseMoeBlock])


if model_type == "gpt_oss":
from transformers.models.gpt_oss.modeling_gpt_oss import GptOssMLP

_set_z3_leaf_modules(model, [GptOssMLP])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are some unnecessary empty lines and lines with only whitespace. These should be removed for better code formatting and consistency.

Suggested change
if model_type == "qwen3_omni_moe":
from transformers.models.qwen3_omni_moe.modeling_qwen3_omni_moe import Qwen3OmniMoeThinkerTextSparseMoeBlock
_set_z3_leaf_modules(model, [Qwen3OmniMoeThinkerTextSparseMoeBlock])
if model_type == "gpt_oss":
from transformers.models.gpt_oss.modeling_gpt_oss import GptOssMLP
_set_z3_leaf_modules(model, [GptOssMLP])
if model_type == "qwen3_omni_moe":
from transformers.models.qwen3_omni_moe.modeling_qwen3_omni_moe import Qwen3OmniMoeThinkerTextSparseMoeBlock
_set_z3_leaf_modules(model, [Qwen3OmniMoeThinkerTextSparseMoeBlock])
if model_type == "gpt_oss":
from transformers.models.gpt_oss.modeling_gpt_oss import GptOssMLP
_set_z3_leaf_modules(model, [GptOssMLP])

Imbernoulli and others added 2 commits October 27, 2025 11:07
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@hiyouga hiyouga added the pending This problem is yet to be addressed label Nov 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pending This problem is yet to be addressed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants