-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[feat] support activation cpu offload in fsdp and fsdp2 #7201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @meichangsu1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a critical memory optimization feature: activation CPU offloading for FSDP and FSDP2 training strategies. By intelligently moving activation tensors to CPU memory, the system can train larger models or use more extensive batch sizes on GPUs with limited memory. The implementation includes a sophisticated asynchronous offloading mechanism and is seamlessly integrated into the existing training framework via a new callback, making it easy to enable and configure. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for activation CPU offload in FSDP and FSDP2, which is a great feature for memory optimization. The implementation adds a new ActivationCpuOffloadCallBack and the associated logic to hook into the model's forward pass.
My review has identified a few issues:
- There is a critical inconsistency in the FSDP configuration key used for the FSDP version (
versionvs.fsdp_version), which will prevent the new feature from working. - The example configuration file has a confusing note.
- The example training script has a typo.
- The new
activation_cpu_offload.pyfile has some minor issues with logging configuration and type hinting.
I have provided detailed comments and suggestions to address these points. Once these are resolved, the PR should be in good shape.
| # Check if fsdp_config is a dictionary and has activation_cpu_offload enabled | ||
| if isinstance(fsdp_config, dict) and fsdp_config.get('activation_cpu_offload', False): | ||
| # Get FSDP version from fsdp_config | ||
| strategy = fsdp_config.get('version', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line attempts to get the FSDP version using the key 'version', but other parts of the codebase, specifically swift/llm/argument/train_args.py, use the key 'fsdp_version' to configure FSDP. This inconsistency will prevent the activation offload feature from being enabled correctly because the FSDP version will not be found in the config.
While the PR description mentions changing the key to 'version', the change has not been applied to train_args.py. To ensure correctness, this PR should either:
- Also include the change in
swift/llm/argument/train_args.pyto use'version'. - Use
'fsdp_version'here to maintain consistency with the existing codebase.
Given that train_args.py is not part of this PR, I recommend using fsdp_version here to fix the bug.
| strategy = fsdp_config.get('version', None) | |
| strategy = fsdp_config.get('fsdp_version', None) |
| { | ||
| "_description": "FSDP2 configuration for distributed training (PyTorch native FSDP v2)", | ||
| "_requires": "torch>=2.4.0", | ||
| "_note": "This is the recommended configuration for multi-GPU training without CPU offloading. NOTE: When using FSDP2, do NOT use --gradient_checkpointing, use activation_checkpointing in fsdp_config instead.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _note field contains a statement that is contradictory to the purpose of this configuration file. It says, "This is the recommended configuration for multi-GPU training without CPU offloading," but this file is specifically for demonstrating activation CPU offloading. This could be confusing for users. Please update the note to reflect the file's actual purpose.
| "_note": "This is the recommended configuration for multi-GPU training without CPU offloading. NOTE: When using FSDP2, do NOT use --gradient_checkpointing, use activation_checkpointing in fsdp_config instead.", | |
| "_note": "This is a configuration for multi-GPU training with activation CPU offloading. NOTE: When using FSDP2, do NOT use --gradient_checkpointing, use activation_checkpointing in fsdp_config instead.", |
| CUDA_VISIBLE_DEVICES=0,1 \ | ||
| swift sft \ | ||
| --model 'Qwen/Qwen3-0.6B' \ | ||
| --dataset 'swift/self-cognition#1000' \ \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| from swift.utils import get_logger | ||
|
|
||
| logger = get_logger() | ||
| logger.setLevel(logging.WARNING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| self.model_parameters_storage = new_storage | ||
|
|
||
|
|
||
| def get_torch_device() -> any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
方便问一下,这个RP解决的是什么问题啊 |
| @@ -0,0 +1,27 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
加一个 前后的显存占用对比吧
然后用8B的模型
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| from swift.utils import get_logger | ||
|
|
||
| logger = get_logger() | ||
| logger.setLevel(logging.WARNING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个请去掉,影响面比较大
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
有参考代码的链接嘛 |
|
|
|
||
| # activation_cpu_offload=false | ||
| # OOM | ||
| # {'loss': 1.13790035, 'grad_norm': 1.41472316, 'learning_rate': 5e-05, 'token_acc': 0.83174487, 'epoch': 0.04, 'global_step/max_steps': '1/27', 'percentage': '3.70%', 'elapsed_time': '46s', 'remaining_time': '20m 1s', 'memory(GiB)': 61.79, 'train_speed(iter/s)': 0.021641} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lora 61GiB吗,这个例子跑的啊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不是这个例子跑的,用的内部一个数据集跑的,那个数据机的token 长度比较大
| --lora_rank 8 \ | ||
| --lora_alpha 32 \ | ||
| --target_modules all-linear \ | ||
| --freeze_vit true \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么有vit的参数
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for that,the script displayed here is copied from the other demo;while the log is not produced from the train script here ,i will update it later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经更新了示例和相应的显存占用
- Add ActivationCpuOffloadCallBack import and registration in callbacks mapping - Automatically append activation_cpu_offload callback when FSDP config has activation_cpu_offload enabled - Enables memory-efficient training by offloading activations to CPU during FSDP forward pass
…offload - Add `fsdp2.json` configuration file for PyTorch native FSDP v2 with activation CPU offloading - Include detailed parameter documentation and usage notes for FSDP2 - Provide example training script (`train.sh`) demonstrating multi-GPU training with LoRA - Disable gradient checkpointing in favor of FSDP's native activation checkpointing - Enable CPU RAM efficient loading and sharded state dicts for memory optimization
86979cb to
20680af
Compare
- Add __init__ method to ActivationCpuOffloadCallBack to properly initialize parent class - Update import to use local base TrainerCallback instead of transformers version - Ensure callback follows consistent initialization pattern with other callbacks
- Remove activation_cpu_offload parameter from fsdp2.json - Set activation_checkpointing to true for improved memory efficiency - Maintain existing auto_wrap_policy and state_dict_type settings
[feat] support activation cpu offload in fsdp and fsdp2
PR type
PR information
Introduce activation CPU offloading built on autograd saved_tensors hooks and grouped async offload/reload to reduce GPU activation memory.
Add synchronous and async double-buffer offload handlers with stream-based D2H/H2D overlap and group window scheduling.
Wrap FSDP/FSDP2 layer forward to insert group commit boundaries and manage offload context; skip Embedding layers.
Provide activation checkpointing compatibility by replacing transformers’ checkpointing with internal checkpoint wrapper when enabled.
Add training callback to read fsdp_config flags and enable activation offload for FSDP v1/v2 at train start.
Experiment results
a simple test
for model-specific results,please look the examples in the code