OOM with Gemma3-27B-IT on DPO (8×A100, Deepspeed) #2555

Kshitiz-Khandel · 2025-04-24T10:48:46Z

Kshitiz-Khandel
Apr 24, 2025

I'm encountering out-of-memory (OOM) errors when fine-tuning google/gemma-3-27b-it using DPO with 8×A100 (40GB) GPUs, while leveraging Deepspeed Zero3. Here's a breakdown:

Below is the base config file(without kernel optimisations)

base_model: google/gemma-3-27b-it
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: False
load_in_4bit: True

chat_template: gemma3
rl: dpo
datasets:

path: ./new_prompt_axolotl_train.jsonl
type: chat_template.default
field_messages: "prompt" ##messages
field_chosen: "chosen"
field_rejected: "rejected"
message_property_mappings:
role: role
content: content
roles:
user: ["user"]
assistant: ["assistant"]
system: ["system"]

dataset_prepared_path:
val_set_size: 0.05
output_dir: ./outputs/dpo2/gemma3/experiment2/

sequence_len: 1024
sample_packing: false
pad_to_sequence_len: true

adapter: lora
lora_model_dir:
lora_r: 16
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 2
micro_batch_size: 2
num_epochs: 10
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

bf16: auto
tf32: false

#gradient_checkpointing: true
resume_from_checkpoint:
logging_steps: 1
flash_attention: false
deepspeed: ./deepspeed_configs/zero3.json

warmup_steps: 10
evals_per_epoch: 4
saves_per_epoch: 1
weight_decay: 0.0
shuffle: True

To further optmise the training, below config options were enabled
plugins:

axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
axolotl.integrations.liger.LigerPlugin

cut_cross_entropy: true

liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_layer_norm: true

gradient_checkpointing: "offload"

lora_mlp_kernel: true
lora_qkv_kernel: true
lora_o_kernel: true

However, enabling the lora_*_kernel options results in a model compatibility error:
These flags expect the model to be of type PeftModelForCausalLM, but Gemma3 is instantiated as AutoModelForCausalLM.

I also tried running the same setup with gemma-3-12b-it, but encountered OOM errors there as well

Any help on this would be appreciated
@NanoCode012

NanoCode012 · 2025-04-25T05:10:28Z

NanoCode012
Apr 25, 2025
Maintainer

Hey! I noticed you used load_in_4bit: true but adapter: lora. You may want to try with

-adapter: lora
+adapter: qlora

Let me know what happens after you set this.

7 replies

Kshitiz-Khandel Apr 25, 2025
Author

Case1: Lora patches turned off
Error: AttributeError: 'functools.partial' object has no attribute 'self'. Did you mean: 'call?'
This error shows up because Axolotl’s hf_grad_checkpoint_offload_wrapper is expecting a method bound to a class instance — something like layer.forward — which has a self attribute pointing to the layer it belongs to.

But instead, it’s getting a functools.partial object, which doesn’t have self. So when the code tries to access decoder_layer.self, it throws the above attribute error. My hypothesis was that it was originating because of the plugins
cut_cross_entropy.CutCrossEntropyPlugin and liger.LigerPlugin , but even after turning them off I run into the same error

Where it breaks: axolotl/src/axolotl/utils/gradient_checkpointing/init.py
Below is the config file used
plugins:

axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin ## Error persists even. after turning off these plugins
axolotl.integrations.liger.LigerPlugin

cut_cross_entropy: true

liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_layer_norm: true
##gradient_checkpointing: true
gradient_checkpointing: "offload"

##lora_mlp_kernel: true
##lora_qkv_kernel: true
##lora_o_kernel: true

Case2: Gradient checkpointing:True instead of offloading it .

A RuntimeError: CUDA error: device-side assert triggered occurs, specifically an "index out of bounds" assertion within a CUDA kernel, indicating an invalid memory access attempt on the GPU. This is triggered in TRL's DPO trainer when applying a loss_mask to logits, likely due to a mismatch in tensor shapes or mask alignment during the chosen logits calculation.

Exact Python Error (on multiple ranks):
RuntimeError: CUDA error: device-side assert triggered
(from [rank2]: File "/home/jupyter/DPO/axolotl/venv/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 1270, in concatenated_forward)

Exact CUDA Kernel Assertion:
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [1,0,0], thread: [XX,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. (thread ID varies)

Below is the config file used

plugins:

axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
axolotl.integrations.liger.LigerPlugin

cut_cross_entropy: true

liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_layer_norm: true
gradient_checkpointing: true
##gradient_checkpointing: "offload"
##lora_mlp_kernel: true
##lora_qkv_kernel: true
##lora_o_kernel: true

@NanoCode012 Any help here would be appreciated
Also curious, given the model size, shouldn't DPO training on 8*A100 suffice?

Kshitiz-Khandel Apr 28, 2025
Author

Hey @NanoCode012. Could you please help out with the above? Thanks!

NanoCode012 Apr 28, 2025
Maintainer

@Kshitiz-Khandel to ensure it's not OOM, have you tried a same/smaller setup but with the smaller models?

Kshitiz-Khandel Apr 29, 2025
Author

For Gemma3- 4B model with the following configuration:

**Training works fine with ZeRO-1 and ZeRO-2.

Training also succeeds without using DeepSpeed.

However, with ZeRO-3, I run into out-of-memory (OOM) errors.**

Here's the current config used:

val_set_size: 0.05
output_dir: ./outputs/dpo/gemma-3/experiment3/

sequence_len: 1024
sample_packing: false
pad_to_sequence_len: true

adapter: lora
lora_model_dir:
lora_r: 16
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 2
micro_batch_size: 2
num_epochs: 10
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

bf16: auto
tf32: false

resume_from_checkpoint:
logging_steps: 1
flash_attention: false
deepspeed: ./deepspeed_configs/zero2.json

warmup_steps: 10
evals_per_epoch: 2
saves_per_epoch: 1
weight_decay: 0.0
shuffle: true

overrides_of_model_config:
rope_scaling:
type: "linear" ## I tried reducing the context window to prevent OOM, but no help
factor: 0.5

Gemma3-12B-it
I attempted training a 12B model, but encountered OOM errors across all configurations — whether using ZeRO-1, ZeRO-2, or even without DeepSpeed.
@NanoCode012

NanoCode012 Apr 29, 2025
Maintainer

What's your memory usage across these runs? You've changed to use lora instead?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

OOM with Gemma3-27B-IT on DPO (8×A100, Deepspeed) #2555

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

OOM with Gemma3-27B-IT on DPO (8×A100, Deepspeed) #2555

Uh oh!

Uh oh!

Kshitiz-Khandel Apr 24, 2025

Replies: 1 comment · 7 replies

Uh oh!

NanoCode012 Apr 25, 2025 Maintainer

Uh oh!

Uh oh!

Kshitiz-Khandel Apr 25, 2025 Author

Uh oh!

Kshitiz-Khandel Apr 28, 2025 Author

Uh oh!

NanoCode012 Apr 28, 2025 Maintainer

Uh oh!

Kshitiz-Khandel Apr 29, 2025 Author

Uh oh!

NanoCode012 Apr 29, 2025 Maintainer

Kshitiz-Khandel
Apr 24, 2025

Replies: 1 comment 7 replies

NanoCode012
Apr 25, 2025
Maintainer

Kshitiz-Khandel Apr 25, 2025
Author

Kshitiz-Khandel Apr 28, 2025
Author

NanoCode012 Apr 28, 2025
Maintainer

Kshitiz-Khandel Apr 29, 2025
Author

NanoCode012 Apr 29, 2025
Maintainer