Hello,
When attempting to train SigLIP from scratch using this codebase, is it impossible to reproduce?
Even with an environment heavily utilizing H100 GPUs, we encounter issues such as:
- Out of Memory errors preventing training; the same batch size works fine without SigLIP (global batch size = 64k).
- Contrastive loss (without Sigmoid) provides more stable learning compared to Sigmoid Loss.
- Using Sigmoid Loss increases likelihood of loss spikes.
Is there no way to train except by reducing the batch size? We are interested in training from scratch using a 5B-scale dataset, rather than fine-tuning.
Could you provide advice on training SigLIP from scratch?
Thank you.