Skip to content

Conversation

@meichangsu1
Copy link
Contributor

[feat] support activation cpu offload in fsdp and fsdp2 lint fix

feat(examples): update FSDP2 config and training script for activation CPU offload

  • Refactor FSDP2 JSON config to include detailed documentation and recommended settings
  • Add activation_cpu_offload parameter to FSDP config for memory optimization
  • Update training script to use new Swift CLI syntax and adjusted hyperparameters

feat(plugin): update FSDP version key and add gradient requirement for checkpointing

  • Change key from 'fsdp_version' to 'version' in fsdp_config for consistency
  • Add call to model.enable_input_require_grads() when activation checkpointing is enabled to ensure proper gradient computation during CPU offloading

feat(plugin): fix whitespace in activation CPU offload callback

docs: remove activation_cpu_offload parameter documentation

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Write the detail information belongs to this PR.

Experiment results

Paste your experiment result here(if needed).

[feat] support activation cpu offload in fsdp and fsdp2 lint fix

feat(examples): update FSDP2 config and training script for activation CPU offload

- Refactor FSDP2 JSON config to include detailed documentation and recommended settings
- Add activation_cpu_offload parameter to FSDP config for memory optimization
- Update training script to use new Swift CLI syntax and adjusted hyperparameters

feat(plugin): update FSDP version key and add gradient requirement for checkpointing

- Change key from 'fsdp_version' to 'version' in fsdp_config for consistency
- Add call to model.enable_input_require_grads() when activation checkpointing is enabled to ensure proper gradient computation during CPU offloading

feat(plugin): fix whitespace in activation CPU offload callback

docs: remove activation_cpu_offload parameter documentation
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @meichangsu1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a critical memory optimization feature: activation CPU offloading for FSDP and FSDP2 training strategies. By intelligently moving activation tensors to CPU memory, the system can train larger models or use more extensive batch sizes on GPUs with limited memory. The implementation includes a sophisticated asynchronous offloading mechanism and is seamlessly integrated into the existing training framework via a new callback, making it easy to enable and configure.

Highlights

  • Activation CPU Offload: Implemented support for offloading activation tensors to CPU memory for FSDP and FSDP2 models, significantly reducing GPU memory consumption during training.
  • Asynchronous Offloading Mechanism: Introduced an asynchronous double-buffer group offload handler that intelligently manages CPU-GPU data transfers, allowing for overlapping communication and computation to improve efficiency.
  • Integrated Training Callback: A new ActivationCpuOffloadCallBack is added to the training pipeline, which automatically enables and configures activation CPU offloading based on the FSDP settings provided in the training arguments.
  • FSDP Configuration and Examples: Provided a new example FSDP2 configuration file (fsdp2.json) with detailed documentation and activation_cpu_offload enabled, along with a corresponding training script (train.sh).
  • Gradient Requirement for Checkpointing: Ensured proper gradient computation by calling model.enable_input_require_grads() when activation checkpointing is enabled in conjunction with CPU offloading.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for activation CPU offload in FSDP and FSDP2, which is a great feature for memory optimization. The implementation adds a new ActivationCpuOffloadCallBack and the associated logic to hook into the model's forward pass.

My review has identified a few issues:

  • There is a critical inconsistency in the FSDP configuration key used for the FSDP version (version vs. fsdp_version), which will prevent the new feature from working.
  • The example configuration file has a confusing note.
  • The example training script has a typo.
  • The new activation_cpu_offload.py file has some minor issues with logging configuration and type hinting.

I have provided detailed comments and suggestions to address these points. Once these are resolved, the PR should be in good shape.

# Check if fsdp_config is a dictionary and has activation_cpu_offload enabled
if isinstance(fsdp_config, dict) and fsdp_config.get('activation_cpu_offload', False):
# Get FSDP version from fsdp_config
strategy = fsdp_config.get('version', None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This line attempts to get the FSDP version using the key 'version', but other parts of the codebase, specifically swift/llm/argument/train_args.py, use the key 'fsdp_version' to configure FSDP. This inconsistency will prevent the activation offload feature from being enabled correctly because the FSDP version will not be found in the config.

While the PR description mentions changing the key to 'version', the change has not been applied to train_args.py. To ensure correctness, this PR should either:

  1. Also include the change in swift/llm/argument/train_args.py to use 'version'.
  2. Use 'fsdp_version' here to maintain consistency with the existing codebase.

Given that train_args.py is not part of this PR, I recommend using fsdp_version here to fix the bug.

Suggested change
strategy = fsdp_config.get('version', None)
strategy = fsdp_config.get('fsdp_version', None)

{
"_description": "FSDP2 configuration for distributed training (PyTorch native FSDP v2)",
"_requires": "torch>=2.4.0",
"_note": "This is the recommended configuration for multi-GPU training without CPU offloading. NOTE: When using FSDP2, do NOT use --gradient_checkpointing, use activation_checkpointing in fsdp_config instead.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _note field contains a statement that is contradictory to the purpose of this configuration file. It says, "This is the recommended configuration for multi-GPU training without CPU offloading," but this file is specifically for demonstrating activation CPU offloading. This could be confusing for users. Please update the note to reflect the file's actual purpose.

Suggested change
"_note": "This is the recommended configuration for multi-GPU training without CPU offloading. NOTE: When using FSDP2, do NOT use --gradient_checkpointing, use activation_checkpointing in fsdp_config instead.",
"_note": "This is a configuration for multi-GPU training with activation CPU offloading. NOTE: When using FSDP2, do NOT use --gradient_checkpointing, use activation_checkpointing in fsdp_config instead.",

CUDA_VISIBLE_DEVICES=0,1 \
swift sft \
--model 'Qwen/Qwen3-0.6B' \
--dataset 'swift/self-cognition#1000' \ \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is an extra backslash here (\\). A single backslash is sufficient for line continuation in shell scripts. This could cause unexpected behavior in some shells.

Suggested change
--dataset 'swift/self-cognition#1000' \ \
--dataset 'swift/self-cognition#1000' \

from swift.utils import get_logger

logger = get_logger()
logger.setLevel(logging.WARNING)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logger level is hardcoded to logging.WARNING. This overrides the global log level configuration and may suppress useful INFO level logs from this module, making debugging more difficult. It's recommended to remove this line to allow the logger to inherit its level from the global configuration.

self.model_parameters_storage = new_storage


def get_torch_device() -> any:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The type hint any is used for the return value. any is a built-in function in Python. The correct type hint for an arbitrary type is Any from the typing module, which is already imported in this file.

Suggested change
def get_torch_device() -> any:
def get_torch_device() -> Any:

@Jintao-Huang
Copy link
Collaborator

方便问一下,这个RP解决的是什么问题啊

@@ -0,0 +1,27 @@
#!/bin/bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

加一个 前后的显存占用对比吧

然后用8B的模型

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

from swift.utils import get_logger

logger = get_logger()
logger.setLevel(logging.WARNING)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个请去掉,影响面比较大

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Jintao-Huang
Copy link
Collaborator

有参考代码的链接嘛

- Change model from Qwen3-0.6B to Qwen3-8B in training script
- Remove logger level setting to use default logging configuration
- Add training logs demonstrating memory savings with activation offload
- Show OOM error when activation offload is disabled for comparison

The update demonstrates the effectiveness of activation CPU offload for larger models, showing successful training with Qwen3-8B where it previously would have OOM'd without offloading.
@meichangsu1
Copy link
Contributor Author

有参考代码的链接嘛

InternLM/InternEvo#391
volcengine/verl#1220


# activation_cpu_offload=false
# OOM
# {'loss': 1.13790035, 'grad_norm': 1.41472316, 'learning_rate': 5e-05, 'token_acc': 0.83174487, 'epoch': 0.04, 'global_step/max_steps': '1/27', 'percentage': '3.70%', 'elapsed_time': '46s', 'remaining_time': '20m 1s', 'memory(GiB)': 61.79, 'train_speed(iter/s)': 0.021641}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lora 61GiB吗,这个例子跑的啊

--lora_rank 8 \
--lora_alpha 32 \
--target_modules all-linear \
--freeze_vit true \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么有vit的参数

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for that,the script displayed here is copied from the other demo;while the log is not produced from the train script here ,i will update it later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants