[Feature Request]:[Qwen3-VL] feature-extraction export fails: unordered_map::at crash due to torch.vmap in masking logic

Description
I am attempting to export the Qwen3-VL-Embedding-2B model to OpenVINO. This blocks the deployment of the new multimodal embedding capabilities.
I have encountered two blockers:
Optimum-CLI limitation: feature-extraction task is not yet supported for qwen3_vl architecture.
OpenVINO conversion crash: Manual conversion using ov.convert_model fails with RuntimeError: unordered_map::at. The traceback indicates an issue tracing torch.vmap operations within transformers.masking_utils (specifically _vmap_for_bhqkv), even when attn_implementation="eager" is explicitly set.
To Reproduce
Environment:
openvino==2025.3
torch==2.5.1+cpu
transformers (Qwen3-VL branch/latest)
optimum-intel (latest source)
Minimal Reproduction Script:
code
Python
import torch
import openvino as ov
from transformers import AutoModel, AutoProcessor
from PIL import Image

model_id = "Qwen/Qwen3-VL-Embedding-2B"

# Load with eager attention to attempt disabling FlashAttn/SDPA optimization
model = AutoModel.from_pretrained(
    model_id, 
    trust_remote_code=True, 
    attn_implementation="eager", 
    device_map="cpu"
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Prepare multimodal dummy input
dummy_image = Image.new('RGB', (28, 28), color='black')
dummy_text = "<|image_pad|>Describe this image."
inputs = processor(text=[dummy_text], images=[dummy_image], return_tensors="pt")

# Wrapper to align with OpenVINO input expectations
class Wrapper(torch.nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model = model
    def forward(self, input_ids, attention_mask, pixel_values, image_grid_thw):
        return self.model(
            input_ids=input_ids, attention_mask=attention_mask,
            pixel_values=pixel_values, image_grid_thw=image_grid_thw,
            output_hidden_states=True
        ).last_hidden_state

# Crash happens here
ov_model = ov.convert_model(
    Wrapper(model), 
    example_input=(inputs.input_ids, inputs.attention_mask, inputs.pixel_values, inputs.image_grid_thw)
)
Relevant Traceback
The error occurs deep within the PyTorch frontend when handling the vectorized masking logic:
code
Text
Traceback (most recent call last):
  ...
  File ".../transformers/masking_utils.py", line 392, in sdpa_mask_recent_torch
    causal_mask = _vmap_for_bhqkv(mask_function)(batch_arange, head_arange, cache_position, kv_arange)
  ...
  File ".../torch/_functorch/vmap.py", line 484, in _flat_vmap
    batched_outputs = func(*batched_inputs, **kwargs)
  ...
  File ".../openvino/frontend/pytorch/ts_decoder.py", line 84, in __init__
    raise RuntimeError(
RuntimeError: Couldn't get TorchScript module by tracing.
Exception:
unordered_map::at
Request
Please add support for task="feature-extraction" for qwen3_vl in Optimum Intel.
Fix the OpenVINO PyTorch frontend to handle (or correctly bypass) torch.vmap / functorch constructs used in the new Transformers masking implementation, or provide a workaround to strictly disable these paths during tracing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]:[Qwen3-VL] feature-extraction export fails: unordered_map::at crash due to torch.vmap in masking logic #33912

Load with eager attention to attempt disabling FlashAttn/SDPA optimization

Prepare multimodal dummy input

Wrapper to align with OpenVINO input expectations

Crash happens here

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request]:[Qwen3-VL] feature-extraction export fails: unordered_map::at crash due to torch.vmap in masking logic #33912

Description

Load with eager attention to attempt disabling FlashAttn/SDPA optimization

Prepare multimodal dummy input

Wrapper to align with OpenVINO input expectations

Crash happens here

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions