`RuntimeError: self and mat2 must have the same dtype, but got Float and BFloat16` when training with `torch_compile`

### System Info

- `transformers` version: 4.48.0.dev0
- Platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.35
- Python version: 3.12.5
- Huggingface_hub version: 0.25.1
- Safetensors version: 0.4.5
- Accelerate version: 0.34.2
- Accelerate config:    not found
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA GeForce RTX 4090

### Who can help?

@ArthurZucker @muellerzr @SunMarc

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

When I try continual pretraining ModernBERT for a MLM objective with the `torch_compile` flag of my `TrainingArguments` set to `True`, I get the below error:
```python
  0%|                                                                                   | 0/1223301 [00:00<?, ?it/s]
/home/dev/.venv/lib/python3.12/site-packages/onnxscript/converter.py:820: FutureWarning: 'onnxscript.values.Op.param_schemas' is deprecated in version 0.1 and will be removed in the future. Please use '.op_signature' instead.
  param_schemas = callee.param_schemas()
/home/dev/.venv/lib/python3.12/site-packages/onnxscript/converter.py:820: FutureWarning: 'onnxscript.values.OnnxFunction.param_schemas' is deprecated in version 0.1 and will be removed in the future. Please use '.op_signature' instead.
  param_schemas = callee.param_schemas()
W1221 15:33:48.046000 14779 torch/_dynamo/variables/tensor.py:776] [4/0] Graph break from `Tensor.item()`, consider setting:
W1221 15:33:48.046000 14779 torch/_dynamo/variables/tensor.py:776] [4/0]     torch._dynamo.config.capture_scalar_outputs = True
W1221 15:33:48.046000 14779 torch/_dynamo/variables/tensor.py:776] [4/0] or:
W1221 15:33:48.046000 14779 torch/_dynamo/variables/tensor.py:776] [4/0]     env TORCHDYNAMO_CAPTURE_SCALAR_OUTPUTS=1
W1221 15:33:48.046000 14779 torch/_dynamo/variables/tensor.py:776] [4/0] to include these operations in the captured graph.
W1221 15:33:48.046000 14779 torch/_dynamo/variables/tensor.py:776] [4/0]
W1221 15:33:48.046000 14779 torch/_dynamo/variables/tensor.py:776] [4/0] Graph break: from user code at:
W1221 15:33:48.046000 14779 torch/_dynamo/variables/tensor.py:776] [4/0]   File "/home/dev/.venv/lib/python3.12/site-packages/transformers/models/modernbert/modeling_modernbert.py", line 711, in torch_dynamo_resume_in__unpad_modernbert_input_at_710
W1221 15:33:48.046000 14779 torch/_dynamo/variables/tensor.py:776] [4/0]     max_seqlen_in_batch = int(seqlens_in_batch.max().item())
W1221 15:33:48.046000 14779 torch/_dynamo/variables/tensor.py:776] [4/0]
W1221 15:33:48.046000 14779 torch/_dynamo/variables/tensor.py:776] [4/0]
Traceback (most recent call last):
  File "/home/dev/encoder/scripts/train/train_modernbert.py", line 206, in <module>
    trainer.train(
  File "/home/dev/.venv/lib/python3.12/site-packages/transformers/trainer.py", line 2163, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/transformers/trainer.py", line 2523, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/transformers/trainer.py", line 3668, in training_step
    loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/transformers/trainer.py", line 3722, in compute_loss
    outputs = model(**inputs)
              ^^^^^^^^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 465, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/accelerate/utils/operations.py", line 820, in forward
    return model_forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/accelerate/utils/operations.py", line 808, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/transformers/models/modernbert/modeling_modernbert.py", line 1023, in forward
    @add_start_docstrings_to_model_forward(MODERNBERT_INPUTS_DOCSTRING)
  File "/home/dev/.venv/lib/python3.12/site-packages/transformers/models/modernbert/modeling_modernbert.py", line 1055, in torch_dynamo_resume_in_forward_at_1055
    input_ids, indices, cu_seqlens, max_seqlen, position_ids, labels = _unpad_modernbert_input(
  File "/home/dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/transformers/models/modernbert/modeling_modernbert.py", line 913, in forward
    layer_outputs = encoder_layer(
                    ^^^^^^^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/transformers/models/modernbert/modeling_modernbert.py", line 519, in forward
    def forward(
  File "/home/dev/.venv/lib/python3.12/site-packages/transformers/models/modernbert/modeling_modernbert.py", line 529, in torch_dynamo_resume_in_forward_at_529
    attn_outputs = self.attn(
  File "/home/dev/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1100, in forward
    return compiled_fn(full_args)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 308, in runtime_wrapper
    all_outs = call_func_at_runtime_with_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 124, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
                            ^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 98, in g
    return f(*args)
           ^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 575, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1525, in forward
    fw_outs = call_func_at_runtime_with_args(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 124, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
                            ^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 488, in wrapper
    return compiled_fn(runtime_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 667, in inner_fn
    outs = compiled_fn(args)
           ^^^^^^^^^^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/torch/_inductor/codecache.py", line 1478, in __call__
    return self.current_callable(inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 1977, in run
    return model(new_inputs)
           ^^^^^^^^^^^^^^^^^
  File "/tmp/torchinductor_/y7/cy7xv2rrzhznbq3e2wurnq5pmygfytvnpovxlh5bugtoa3ebwy6f.py", line 277, in call
    extern_kernels.addmm(buf9, buf7, reinterpret_tensor(buf8, (1152, 768), (1, 1152), 0), alpha=1, beta=1, out=buf10)
RuntimeError: self and mat2 must have the same dtype, but got Float and BFloat16
```

This does not occur when finetuning for a classification task.

I am using `bfloat16` mixed precision.

### Expected behavior

The training works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`RuntimeError: self and mat2 must have the same dtype, but got Float and BFloat16` when training with `torch_compile` #35382

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RuntimeError: self and mat2 must have the same dtype, but got Float and BFloat16 when training with torch_compile #35382

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`RuntimeError: self and mat2 must have the same dtype, but got Float and BFloat16` when training with `torch_compile` #35382