how to finetuning a bi-encoder embedding model of multimodel input

I want to cluster ecommerce products by bi-encoder. For each product, it has a name(text) and an image. Can I use sentence-transfomer to finetune a bi-encoder model?  The training dataset contains product clusters, like:

```
product1_name, product1_img, cluster_id1
product2_name, product2_img, cluster_id1
product3_name, product3_img, cluster_id2

productm_name,productm_img, cluster_idn
```

I want to try first to define it as a classification problem(cluster_id1,...cluster_idn) and use arcface loss. But If there are other suitable losses, it's also fine.

 Is sentence transformer suitable for my use case? I find siglip(something like clip) is good at embedding. Its training data is image/text pair, but my data is not the same as it.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

how to finetuning a bi-encoder embedding model of multimodel input #3601

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

how to finetuning a bi-encoder embedding model of multimodel input #3601

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions