How can we get the better performance of SigLIP using full scratch training?

Hello,  
When attempting to train SigLIP from scratch using this codebase, is it impossible to reproduce?  
Even with an environment heavily utilizing H100 GPUs, we encounter issues such as:  
- Out of Memory errors preventing training; the same batch size works fine without SigLIP (global batch size = 64k).  
- Contrastive loss (without Sigmoid) provides more stable learning compared to Sigmoid Loss.  
  - Using Sigmoid Loss increases likelihood of loss spikes.  

Is there no way to train except by reducing the batch size? We are interested in training from scratch using a 5B-scale dataset, rather than fine-tuning.  
Could you provide advice on training SigLIP from scratch?  
Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How can we get the better performance of SigLIP using full scratch training? #1125

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How can we get the better performance of SigLIP using full scratch training? #1125

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions