I want to cluster ecommerce products by bi-encoder. For each product, it has a name(text) and an image. Can I use sentence-transfomer to finetune a bi-encoder model? The training dataset contains product clusters, like:
product1_name, product1_img, cluster_id1
product2_name, product2_img, cluster_id1
product3_name, product3_img, cluster_id2
productm_name,productm_img, cluster_idn
I want to try first to define it as a classification problem(cluster_id1,...cluster_idn) and use arcface loss. But If there are other suitable losses, it's also fine.
Is sentence transformer suitable for my use case? I find siglip(something like clip) is good at embedding. Its training data is image/text pair, but my data is not the same as it.