Is it possible to describe the training data in Chinese? #1063

ctgushiwei · 2025-03-31T08:39:37Z

ctgushiwei
Mar 31, 2025

I would like to fine-tune the clip model based on my own dataset for an application in a Chinese context. Is it appropriate to describe the images in Chinese for fine-tuning the training?

rwightman · 2025-04-16T21:31:31Z

rwightman
Apr 16, 2025
Maintainer

@ctgushiwei it could work, but you'd need a model with a tokenizer that supports chinese or a wide variety of languages and was trained with at least some chinese captions. SigLIP i18n and SigLIP2 models should be compatible and had Chinese in the language mix.

4 replies

rwightman Apr 16, 2025
Maintainer

Training from scratch you'd still need a tokenizer that's compatible. So would have to use the custom HF tokenizer feature and load an existing tokenizer that supports Chinese.

ctgushiwei Apr 17, 2025
Author

thank you! I just want to finetune the model use our Chinese data

fables008 Jul 3, 2025

Training from scratch you'd still need a tokenizer that's compatible. So would have to use the custom HF tokenizer feature and load an existing tokenizer that supports Chinese.

Hi, I have a question about fine-tuning the laion/mscoco_finetuned_CoCa-ViT-L-14-laion2B-s13B-b90k model.

This model was pretrained and finetuned using English captions. Now I’d like to fine-tune it further using Chinese image-caption pairs.

Is it possible to keep the original text encoder and image encoder unchanged, but replace the tokenizer with one that supports Chinese (e.g., a multilingual tokenizer)?
Would this work?

Or would it break the alignment between the tokenizer and the text encoder, since the original text encoder was trained with a different tokenization scheme?

What’s the correct way to fine-tune this model on Chinese data?

Thank you!

fables008 Jul 3, 2025

thank you! I just want to finetune the model use our Chinese data

兄弟，你中文微调的效果咋样，我们现在预训练模型是亿级英文图文对，训练数据是几十万级图文对，只改分词器，不改txt encoder 是不太行吧

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is it possible to describe the training data in Chinese? #1063

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is it possible to describe the training data in Chinese? #1063

Uh oh!

ctgushiwei Mar 31, 2025

Replies: 1 comment · 4 replies

Uh oh!

rwightman Apr 16, 2025 Maintainer

Uh oh!

rwightman Apr 16, 2025 Maintainer

Uh oh!

ctgushiwei Apr 17, 2025 Author

Uh oh!

fables008 Jul 3, 2025

Uh oh!

fables008 Jul 3, 2025

ctgushiwei
Mar 31, 2025

Replies: 1 comment 4 replies

rwightman
Apr 16, 2025
Maintainer

rwightman Apr 16, 2025
Maintainer

ctgushiwei Apr 17, 2025
Author