[PyTorch Conversion] SmolVLM model fails due to unsupported 'unfold' op in Core ML

🧠 Summary
I'm attempting to convert a HuggingFace multi-modal model (SmolVLM-256M-Instruct) to Core ML format using coremltools.convert() from PyTorch. The conversion fails due to the use of the unfold operation, which is currently unsupported in Core ML's MIL backend.

💻 Environment
macOS: 14.0 (Sonoma) — internal version 26.x

Device: Apple Silicon (M1/M2)

Python: 3.10

coremltools: 8.0.0

torch: 2.1.0

transformers: 4.34.0

Model: SmolVLM-256M-Instruct (downloaded locally)

📦 Conversion Code
```python
import torch
import coremltools as ct
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image

class SmolVLMWrapper(torch.nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model = model

    def forward(self, pixel_values, input_ids):
        return self.model(pixel_values=pixel_values, input_ids=input_ids).logits

model = AutoModelForVision2Seq.from_pretrained("path/to/local/model", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("path/to/local/model", trust_remote_code=True)
wrapped_model = SmolVLMWrapper(model).eval()

dummy_image = Image.new('RGB', (224, 224))
dummy_text = "<image>\ndescribe this image"
inputs = processor(text=dummy_text, images=dummy_image, return_tensors="pt")
example_input = (inputs['pixel_values'], inputs['input_ids'])

traced_model = torch.jit.trace(wrapped_model, example_input)

coreml_model = ct.convert(
    model=traced_model,
    source="pytorch",
    inputs=[
        ct.TensorType(name="pixel_values", shape=example_input[0].shape),
        ct.TensorType(name="input_ids", shape=example_input[1].shape)
    ],
    convert_to="mlprogram",
    compute_units=ct.ComputeUnit.ALL,
    minimum_deployment_target=ct.target.iOS16,
    debug=True
)
```
❌ Error Message
```
ERROR - converting 'unfold' op (located at: 'model/model/patches_subgrid.1'):
PyTorch convert function for op 'unfold' not implemented.
```
Also observed:

Code
Core ML embedding (gather) layer does not support any inputs besides the weights and indices. Those given will be ignored.
📌 Notes
The model uses unfold internally for patch extraction in the vision encoder.

The conversion fails early in the MIL graph construction phase.

I’ve confirmed the traced model returns static logits and does not contain dynamic control flow.

I’m not using TensorFlow/Keras in this environment.

🙏 Feature Request
Please consider adding support for the unfold operation in Core ML’s PyTorch conversion path. This op is commonly used in vision models for patch embedding and is increasingly relevant for lightweight multi-modal architectures.

Alternatively, if there’s a recommended workaround or rewrite pattern for unfold, I’d be happy to adapt the model.

Thanks for your work on Core ML — it’s a critical tool for bringing advanced AI models to Apple platforms!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PyTorch Conversion] SmolVLM model fails due to unsupported 'unfold' op in Core ML #2599

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[PyTorch Conversion] SmolVLM model fails due to unsupported 'unfold' op in Core ML #2599

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions