Skip to content

[PyTorch Conversion] SmolVLM model fails due to unsupported 'unfold' op in Core ML #2599

@jefferyby

Description

@jefferyby

🧠 Summary
I'm attempting to convert a HuggingFace multi-modal model (SmolVLM-256M-Instruct) to Core ML format using coremltools.convert() from PyTorch. The conversion fails due to the use of the unfold operation, which is currently unsupported in Core ML's MIL backend.

💻 Environment
macOS: 14.0 (Sonoma) — internal version 26.x

Device: Apple Silicon (M1/M2)

Python: 3.10

coremltools: 8.0.0

torch: 2.1.0

transformers: 4.34.0

Model: SmolVLM-256M-Instruct (downloaded locally)

📦 Conversion Code

import torch
import coremltools as ct
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image

class SmolVLMWrapper(torch.nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model = model

    def forward(self, pixel_values, input_ids):
        return self.model(pixel_values=pixel_values, input_ids=input_ids).logits

model = AutoModelForVision2Seq.from_pretrained("path/to/local/model", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("path/to/local/model", trust_remote_code=True)
wrapped_model = SmolVLMWrapper(model).eval()

dummy_image = Image.new('RGB', (224, 224))
dummy_text = "<image>\ndescribe this image"
inputs = processor(text=dummy_text, images=dummy_image, return_tensors="pt")
example_input = (inputs['pixel_values'], inputs['input_ids'])

traced_model = torch.jit.trace(wrapped_model, example_input)

coreml_model = ct.convert(
    model=traced_model,
    source="pytorch",
    inputs=[
        ct.TensorType(name="pixel_values", shape=example_input[0].shape),
        ct.TensorType(name="input_ids", shape=example_input[1].shape)
    ],
    convert_to="mlprogram",
    compute_units=ct.ComputeUnit.ALL,
    minimum_deployment_target=ct.target.iOS16,
    debug=True
)

❌ Error Message

ERROR - converting 'unfold' op (located at: 'model/model/patches_subgrid.1'):
PyTorch convert function for op 'unfold' not implemented.

Also observed:

Code
Core ML embedding (gather) layer does not support any inputs besides the weights and indices. Those given will be ignored.
📌 Notes
The model uses unfold internally for patch extraction in the vision encoder.

The conversion fails early in the MIL graph construction phase.

I’ve confirmed the traced model returns static logits and does not contain dynamic control flow.

I’m not using TensorFlow/Keras in this environment.

🙏 Feature Request
Please consider adding support for the unfold operation in Core ML’s PyTorch conversion path. This op is commonly used in vision models for patch embedding and is increasingly relevant for lightweight multi-modal architectures.

Alternatively, if there’s a recommended workaround or rewrite pattern for unfold, I’d be happy to adapt the model.

Thanks for your work on Core ML — it’s a critical tool for bringing advanced AI models to Apple platforms!

Metadata

Metadata

Assignees

No one assigned

    Labels

    PyTorch (traced)missing layer typeUnable to convert a layer type from the relevant frameworktriagedReviewed and examined, release as been assigned if applicable (status)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions