-
Notifications
You must be signed in to change notification settings - Fork 722
Description
🧠 Summary
I'm attempting to convert a HuggingFace multi-modal model (SmolVLM-256M-Instruct) to Core ML format using coremltools.convert() from PyTorch. The conversion fails due to the use of the unfold operation, which is currently unsupported in Core ML's MIL backend.
💻 Environment
macOS: 14.0 (Sonoma) — internal version 26.x
Device: Apple Silicon (M1/M2)
Python: 3.10
coremltools: 8.0.0
torch: 2.1.0
transformers: 4.34.0
Model: SmolVLM-256M-Instruct (downloaded locally)
📦 Conversion Code
import torch
import coremltools as ct
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
class SmolVLMWrapper(torch.nn.Module):
def __init__(self, model):
super().__init__()
self.model = model
def forward(self, pixel_values, input_ids):
return self.model(pixel_values=pixel_values, input_ids=input_ids).logits
model = AutoModelForVision2Seq.from_pretrained("path/to/local/model", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("path/to/local/model", trust_remote_code=True)
wrapped_model = SmolVLMWrapper(model).eval()
dummy_image = Image.new('RGB', (224, 224))
dummy_text = "<image>\ndescribe this image"
inputs = processor(text=dummy_text, images=dummy_image, return_tensors="pt")
example_input = (inputs['pixel_values'], inputs['input_ids'])
traced_model = torch.jit.trace(wrapped_model, example_input)
coreml_model = ct.convert(
model=traced_model,
source="pytorch",
inputs=[
ct.TensorType(name="pixel_values", shape=example_input[0].shape),
ct.TensorType(name="input_ids", shape=example_input[1].shape)
],
convert_to="mlprogram",
compute_units=ct.ComputeUnit.ALL,
minimum_deployment_target=ct.target.iOS16,
debug=True
)
❌ Error Message
ERROR - converting 'unfold' op (located at: 'model/model/patches_subgrid.1'):
PyTorch convert function for op 'unfold' not implemented.
Also observed:
Code
Core ML embedding (gather) layer does not support any inputs besides the weights and indices. Those given will be ignored.
📌 Notes
The model uses unfold internally for patch extraction in the vision encoder.
The conversion fails early in the MIL graph construction phase.
I’ve confirmed the traced model returns static logits and does not contain dynamic control flow.
I’m not using TensorFlow/Keras in this environment.
🙏 Feature Request
Please consider adding support for the unfold operation in Core ML’s PyTorch conversion path. This op is commonly used in vision models for patch embedding and is increasingly relevant for lightweight multi-modal architectures.
Alternatively, if there’s a recommended workaround or rewrite pattern for unfold, I’d be happy to adapt the model.
Thanks for your work on Core ML — it’s a critical tool for bringing advanced AI models to Apple platforms!