Skip to content

Conversation

HollowMan6
Copy link
Contributor

What does this PR do?

For FSDP2, parameters might be on a meta device, and the weight.device attribute may not accurately reflect where the actual computation will happen during forward passes.

  File "transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 776, in forward
    pos_embeds = self.fast_pos_embed_interpolate(grid_thw)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 745, in fast_pos_embed_interpolate
    pos_embeds = self.pos_embed(idx_tensor) * weight_tensor[:, :, None]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/nn/modules/module.py", line 1879, in _call_impl
    return inner()
           ^^^^^^^
  File "torch/nn/modules/module.py", line 1827, in inner
    result = forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/nn/modules/sparse.py", line 192, in forward
    return F.embedding(
           ^^^^^^^^^^^^
  File "torch/nn/functional.py", line 2546, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but got index is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA__index_select)

volcengine/verl#3686 (comment)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@yonigozlan @molbap @ArthurZucker @Cyrilvallez @zucchini-nlp

Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: qwen3_omni_moe, qwen3_vl, qwen3_vl_moe

For FSDP2, parameters might be on a meta device, and the weight.device attribute may
not accurately reflect where the actual computation will happen during forward passes.

```log
  File "transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 776, in forward
    pos_embeds = self.fast_pos_embed_interpolate(grid_thw)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 745, in fast_pos_embed_interpolate
    pos_embeds = self.pos_embed(idx_tensor) * weight_tensor[:, :, None]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/nn/modules/module.py", line 1879, in _call_impl
    return inner()
           ^^^^^^^
  File "torch/nn/modules/module.py", line 1827, in inner
    result = forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/nn/modules/sparse.py", line 192, in forward
    return F.embedding(
           ^^^^^^^^^^^^
  File "torch/nn/functional.py", line 2546, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but got index is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA__index_select)
```
volcengine/verl#3686 (comment)

Signed-off-by: Hollow Man <[email protected]>
)
idx_tensor = torch.tensor(idx_list, dtype=torch.long, device=device)
weight_tensor = torch.tensor(weight_list, dtype=self.pos_embed.weight.dtype, device=device)
pos_embeds = self.pos_embed(idx_tensor) * weight_tensor[:, :, None]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it is possible that the embedded positions and the weight tensors are on different devices, if the input grid was not on the same device as the positional embedding weight

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, since the device for grid_thw is pretty much dependent on the user-side implementation (passed as a parameter for the forward method), I think it's better to take the device of grid_thw for unifying the device of idx_tensor and weight_tensor, so that user-side implementation can have more control over this and to guarantee nothing can go wrong here. User-side code can ensure the input grid is on the same device as the positional embedding weight, as the size of grid_thw is small, so there shouldn't be too much overhead. This is tested on verl and worked fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants