[Qwen3VL] fix device mismatch error for FSDP2 training #41536

HollowMan6 · 2025-10-12T18:46:38Z

What does this PR do?

For FSDP2, parameters might be on a meta device, and the weight.device attribute may not accurately reflect where the actual computation will happen during forward passes.

  File "transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 776, in forward
    pos_embeds = self.fast_pos_embed_interpolate(grid_thw)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 745, in fast_pos_embed_interpolate
    pos_embeds = self.pos_embed(idx_tensor) * weight_tensor[:, :, None]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/nn/modules/module.py", line 1879, in _call_impl
    return inner()
           ^^^^^^^
  File "torch/nn/modules/module.py", line 1827, in inner
    result = forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/nn/modules/sparse.py", line 192, in forward
    return F.embedding(
           ^^^^^^^^^^^^
  File "torch/nn/functional.py", line 2546, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but got index is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA__index_select)

volcengine/verl#3686 (comment)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@yonigozlan @molbap @ArthurZucker @Cyrilvallez @zucchini-nlp

github-actions · 2025-10-12T18:47:41Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: qwen3_omni_moe, qwen3_vl, qwen3_vl_moe

For FSDP2, parameters might be on a meta device, and the weight.device attribute may not accurately reflect where the actual computation will happen during forward passes. ```log File "transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 776, in forward pos_embeds = self.fast_pos_embed_interpolate(grid_thw) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 745, in fast_pos_embed_interpolate pos_embeds = self.pos_embed(idx_tensor) * weight_tensor[:, :, None] ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "torch/nn/modules/module.py", line 1879, in _call_impl return inner() ^^^^^^^ File "torch/nn/modules/module.py", line 1827, in inner result = forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "torch/nn/modules/sparse.py", line 192, in forward return F.embedding( ^^^^^^^^^^^^ File "torch/nn/functional.py", line 2546, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Expected all tensors to be on the same device, but got index is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA__index_select) ``` volcengine/verl#3686 (comment) Signed-off-by: Hollow Man <[email protected]>

zucchini-nlp · 2025-10-13T08:49:56Z

src/transformers/models/qwen3_omni_moe/modeling_qwen3_omni_moe.py

-        )
+        idx_tensor = torch.tensor(idx_list, dtype=torch.long, device=device)
+        weight_tensor = torch.tensor(weight_list, dtype=self.pos_embed.weight.dtype, device=device)
        pos_embeds = self.pos_embed(idx_tensor) * weight_tensor[:, :, None]


i think it is possible that the embedded positions and the weight tensors are on different devices, if the input grid was not on the same device as the positional embedding weight

Yeah, since the device for grid_thw is pretty much dependent on the user-side implementation (passed as a parameter for the forward method), I think it's better to take the device of grid_thw for unifying the device of idx_tensor and weight_tensor, so that user-side implementation can have more control over this and to guarantee nothing can go wrong here. User-side code can ensure the input grid is on the same device as the positional embedding weight, as the size of grid_thw is small, so there shouldn't be too much overhead. This is tested on verl and worked fine.

HollowMan6 force-pushed the qwen3vl branch from 950a05b to 6776a40 Compare October 12, 2025 18:51

This was referenced Oct 12, 2025

When enabling SP, the qwen3_vl_moe model training throws an error volcengine/verl#3721

Closed

[model] fix: qwen3vl patch volcengine/verl#3686

Merged

zucchini-nlp reviewed Oct 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Qwen3VL] fix device mismatch error for FSDP2 training #41536

[Qwen3VL] fix device mismatch error for FSDP2 training #41536

HollowMan6 commented Oct 12, 2025

Uh oh!

github-actions bot commented Oct 12, 2025

Uh oh!

zucchini-nlp Oct 13, 2025

Uh oh!

HollowMan6 Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Qwen3VL] fix device mismatch error for FSDP2 training #41536

Are you sure you want to change the base?

[Qwen3VL] fix device mismatch error for FSDP2 training #41536

Conversation

HollowMan6 commented Oct 12, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

github-actions bot commented Oct 12, 2025

Uh oh!

zucchini-nlp Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

HollowMan6 Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants