Getting `AttributeError: 'Gemma3ForConditionalGeneration' object has no attribute 'vocab_size` when using CCE

### Please check that this issue hasn't been reported before.

- [x] I searched previous [Bug Reports](https://github.com/axolotl-ai-cloud/axolotl/labels/bug) didn't find any similar reports.

### Expected Behavior

Training begins

### Current behaviour

I have 4 L4 GPUs with 24GB of VRAM on the same node. I am trying to use CCE because without that I am getting OOM on using deepspeed in zero1. However, I am getting this error:
My transformers version is 4.52.3 and I have installed the CCE upstream for axolotl
`AttributeError: 'Gemma3ForConditionalGeneration' object has no attribute 'vocab_size'`

### Steps to reproduce

I am sharing the config file, if we run it, we observe this error for a Gemma Fine-tuned model.

### Config yaml

```yaml
# A list of one or more datasets to finetune the model with
base_model: sarvamai/sarvam-translate

plugins:
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin

strict: true
chat_template: gemma3
datasets:
- path: /home/random/sanchit/data/train/train_en_to_kn_samples.jsonl
  type: chat_template
  field_messages: messages
  message_property_mappings:
    role: role
    content: value
- path: /home/random/sanchit/data/train/train_kn_to_en_samples.jsonl
  type: chat_template
  field_messages: messages
  message_property_mappings:
    role: role
    content: value


device: cuda
# Seed for reproducibility
seed: 42

bf16: true
dataset_processes: 45
val_set_size: 0.02
sequence_len: 8192
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true
# Whether to use gradient checkpointing. Available options are: true, false, 'offload',
# 'offload_disk'.
# https://huggingface.co/docs/transformers/v4.18.0/en/performance#gradient-checkpointing
gradient_checkpointing: true

# The maximum length of an input to train with, this should typically be less than 2048
# as most models have a token/context limit of 2048

# Whether to use flash attention patch https://github.com/Dao-AILab/flash-attention
flash_attention: true

# How much of the dataset to set aside as evaluation. 1 = 100%, 0.50 = 50%, etc. 0 for
# no eval.
# val_set_size: 0.02


# Add or change special tokens. If you add tokens here, you don't need to add them to
# the `tokens` list.
eot_tokens:
  - <end_of_turn>

train_on_inputs: false
# Maximum number of iterations to train for. It precedes num_epochs which means that if
# both are set, num_epochs will not be guaranteed. e.g., when 1 epoch is 1000 steps =>
# `num_epochs: 2` and `max_steps: 100` will train for 100 steps
# --- Core training settings ---
num_epochs: 1
gradient_accumulation_steps: 2
micro_batch_size: 1
# --- Warmup settings ---
warmup_ratio: 0.03  # 3% of total steps (~low dataset, faster convergence)

# --- Evaluation settings ---
eval_strategy: steps  

# --- Saving settings ---
save_strategy: epoch
saves_per_epoch: 5

# --- Logging ---
logging_steps: 50  # Log every 50 steps


# --- Save format ---
save_safetensors: True

# --- Logging tools ---
use_tensorboard: true
# low_cpu_mem_usage: true
wandb_project: "translation-en-kn-ift"
wandb_entity: "translation-adalat-ai"
wandb_name: "lora_linear_run_l4"

# --- Adapter & LoRA settings ---
adapter: lora
lora_r: 16
lora_alpha: 32
lora_target_linear: true

# --- Learning rate & batch size ---
learning_rate: 0.0002
weight_decay: 0.01

# --- Optimizer & scheduler ---
optimizer: adamw_torch_fused
lr_scheduler: linear

# --- Output & Hub settings ---
output_dir: "./model-out/exp1"
dataset_prepared_path: last_run_prepared
trust_remote_code: True
# deepspeed: deepspeed_configs/zero3_bf16_cpuoffload_all.json
deepspeed: deepspeed_configs/zero1.json
```

### Possible solution

_No response_

### Which Operating Systems are you using?

- [x] Linux
- [ ] macOS
- [ ] Windows

### Python Version

3.11

### axolotl branch-commit

main/5a961ec

### Acknowledgements

- [x] My issue title is concise, descriptive, and in title casing.
- [x] I have searched the existing issues to make sure this bug has not been reported yet.
- [x] I am using the latest version of axolotl.
- [x] I have provided enough information for the maintainers to reproduce and diagnose the issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Getting `AttributeError: 'Gemma3ForConditionalGeneration' object has no attribute 'vocab_size` when using CCE #2874

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Getting AttributeError: 'Gemma3ForConditionalGeneration' object has no attribute 'vocab_size when using CCE #2874

Description

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Getting `AttributeError: 'Gemma3ForConditionalGeneration' object has no attribute 'vocab_size` when using CCE #2874