Skip to content

[Feature Request]:[Qwen3-VL] feature-extraction export fails: unordered_map::at crash due to torch.vmap in masking logic #33912

@OntosAI

Description

@OntosAI

Description
I am attempting to export the Qwen3-VL-Embedding-2B model to OpenVINO. This blocks the deployment of the new multimodal embedding capabilities.
I have encountered two blockers:
Optimum-CLI limitation: feature-extraction task is not yet supported for qwen3_vl architecture.
OpenVINO conversion crash: Manual conversion using ov.convert_model fails with RuntimeError: unordered_map::at. The traceback indicates an issue tracing torch.vmap operations within transformers.masking_utils (specifically _vmap_for_bhqkv), even when attn_implementation="eager" is explicitly set.
To Reproduce
Environment:
openvino==2025.3
torch==2.5.1+cpu
transformers (Qwen3-VL branch/latest)
optimum-intel (latest source)
Minimal Reproduction Script:
code
Python
import torch
import openvino as ov
from transformers import AutoModel, AutoProcessor
from PIL import Image

model_id = "Qwen/Qwen3-VL-Embedding-2B"

Load with eager attention to attempt disabling FlashAttn/SDPA optimization

model = AutoModel.from_pretrained(
model_id,
trust_remote_code=True,
attn_implementation="eager",
device_map="cpu"
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

Prepare multimodal dummy input

dummy_image = Image.new('RGB', (28, 28), color='black')
dummy_text = "<|image_pad|>Describe this image."
inputs = processor(text=[dummy_text], images=[dummy_image], return_tensors="pt")

Wrapper to align with OpenVINO input expectations

class Wrapper(torch.nn.Module):
def init(self, model):
super().init()
self.model = model
def forward(self, input_ids, attention_mask, pixel_values, image_grid_thw):
return self.model(
input_ids=input_ids, attention_mask=attention_mask,
pixel_values=pixel_values, image_grid_thw=image_grid_thw,
output_hidden_states=True
).last_hidden_state

Crash happens here

ov_model = ov.convert_model(
Wrapper(model),
example_input=(inputs.input_ids, inputs.attention_mask, inputs.pixel_values, inputs.image_grid_thw)
)
Relevant Traceback
The error occurs deep within the PyTorch frontend when handling the vectorized masking logic:
code
Text
Traceback (most recent call last):
...
File ".../transformers/masking_utils.py", line 392, in sdpa_mask_recent_torch
causal_mask = _vmap_for_bhqkv(mask_function)(batch_arange, head_arange, cache_position, kv_arange)
...
File ".../torch/_functorch/vmap.py", line 484, in _flat_vmap
batched_outputs = func(*batched_inputs, **kwargs)
...
File ".../openvino/frontend/pytorch/ts_decoder.py", line 84, in init
raise RuntimeError(
RuntimeError: Couldn't get TorchScript module by tracing.
Exception:
unordered_map::at
Request
Please add support for task="feature-extraction" for qwen3_vl in Optimum Intel.
Fix the OpenVINO PyTorch frontend to handle (or correctly bypass) torch.vmap / functorch constructs used in the new Transformers masking implementation, or provide a workaround to strictly disable these paths during tracing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions