How much GPU memory can SigLIP save compared with CLIP? #872

Pluto-Jin · 2024-02-22T11:18:24Z

Pluto-Jin
Feb 22, 2024

Hello,

The original SigLIP paper said they can fit 2x batch size on TPU with base SigLIP model, compared with CLIP.

But in my experiment, I both used 14400 batch size on 48 A100-40GB, while the SigLIP and CLIP models are both base-sized standard structure. Then during the training, SigLIP takes 33.5G while CLIP takes 37.0G on each GPU. They are close and I couldn't scale up 2x batch size as the paper said.

I am not using any FSDP/deepspeed techniques, is it the reason? Or does the GPU type matter a lot? I have no idea.

Can anyone who ever trained a SigLIP model share your experience?

Thanks!

alexisdrakopoulos · 2025-01-03T11:48:58Z

alexisdrakopoulos
Jan 3, 2025

Yea I also can't replicate it, I might try the original Jax code and see if they're doing something different. I'm training like this:
python -m open_clip_train.main
--save-frequency 1
--zeroshot-frequency 1
--dataset-type webdataset
--train-data 'tar_files/dataset_{000..042}.tar'
--train-num-samples 1280000
--warmup 10000
--batch-size=212
--lr=1e-4
--wd=0
--epochs=200
--workers=4
--model ViT-B-16-SigLIP
--pretrained webli
--siglip
--opt timm/Lion
--grad-checkpointing

0 replies

alexisdrakopoulos · 2025-01-03T12:01:00Z

alexisdrakopoulos
Jan 3, 2025

Nevermind, with --grad-checkpointing I can now run batch size 1200 whereas without it's stuck at 220.

0 replies

rwightman · 2025-01-03T17:33:40Z

rwightman
Jan 3, 2025
Maintainer

performance gains claimed in siglip paper have yet to be reproduced here, the clip and siglip losses are at best roughly equal, in many situations the siglip seems a bit worse in practice.

Since the OP I've added some variations of the siglip loss that can be switched via "--loss-dist-impl" arg.

See:

open_clip/src/open_clip/loss.py

Lines 314 to 448 in b2f1403

    
           class SigLipLoss(nn.Module): 
        
               """ Sigmoid Loss for Language Image Pre-Training (SigLIP) - https://arxiv.org/abs/2303.15343 
        
               @article{zhai2023sigmoid, 
        
                 title={Sigmoid loss for language image pre-training}, 
        
                 author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas}, 
        
                 journal={arXiv preprint arXiv:2303.15343}, 
        
                 year={2023} 
        
               } 
        
               """ 
        
               def __init__( 
        
                       self, 
        
                       cache_labels: bool = False, 
        
                       rank: int = 0, 
        
                       world_size: int = 1, 
        
                       dist_impl: Optional[str] = None, 
        
               ): 
        
                   super().__init__() 
        
                   self.cache_labels = cache_labels 
        
                   self.rank = rank 
        
                   self.world_size = world_size 
        
                   self.dist_impl = dist_impl or 'bidir'  # default to bidir exchange for now, this will likely change 
        
                   assert self.dist_impl in ('bidir', 'shift', 'reduce', 'gather') 
        
                   # cache state FIXME cache not currently used, worthwhile? 
        
                   self.prev_num_logits = 0 
        
                   self.labels = {} 
        
               def get_ground_truth(self, device, dtype, num_logits, negative_only=False) -> torch.Tensor: 
        
                   labels = -torch.ones((num_logits, num_logits), device=device, dtype=dtype) 
        
                   if not negative_only: 
        
                       labels = 2 * torch.eye(num_logits, device=device, dtype=dtype) + labels 
        
                   return labels 
        
               def get_logits(self, image_features, text_features, logit_scale, logit_bias=None): 
        
                   logits = logit_scale * image_features @ text_features.T 
        
                   if logit_bias is not None: 
        
                       logits += logit_bias 
        
                   return logits 
        
               def _loss(self, image_features, text_features, logit_scale, logit_bias=None, negative_only=False): 
        
                   logits = self.get_logits(image_features, text_features, logit_scale, logit_bias) 
        
                   labels = self.get_ground_truth( 
        
                       image_features.device, 
        
                       image_features.dtype, 
        
                       image_features.shape[0], 
        
                       negative_only=negative_only, 
        
                   ) 
        
                   loss = -F.logsigmoid(labels * logits).sum() / image_features.shape[0] 
        
                   return loss 
        
               def forward(self, image_features, text_features, logit_scale, logit_bias, output_dict=False): 
        
                   loss = self._loss(image_features, text_features, logit_scale, logit_bias) 
        
                   if self.world_size > 1: 
        
                       if self.dist_impl == 'bidir': 
        
                           right_rank = (self.rank + 1) % self.world_size 
        
                           left_rank = (self.rank - 1 + self.world_size) % self.world_size 
        
                           text_features_to_right = text_features_to_left = text_features 
        
                           num_bidir, remainder = divmod(self.world_size - 1, 2) 
        
                           for i in range(num_bidir): 
        
                               text_features_recv = neighbour_exchange_bidir_with_grad( 
        
                                   left_rank, 
        
                                   right_rank, 
        
                                   text_features_to_left, 
        
                                   text_features_to_right, 
        
                               ) 
        
                               for f in text_features_recv: 
        
                                   loss += self._loss( 
        
                                       image_features, 
        
                                       f, 
        
                                       logit_scale, 
        
                                       logit_bias, 
        
                                       negative_only=True, 
        
                                   ) 
        
                               text_features_to_left, text_features_to_right = text_features_recv 
        
                           if remainder: 
        
                               text_features_recv = neighbour_exchange_with_grad( 
        
                                   left_rank, 
        
                                   right_rank, 
        
                                   text_features_to_right 
        
                               ) 
        
                               loss += self._loss( 
        
                                   image_features, 
        
                                   text_features_recv, 
        
                                   logit_scale, 
        
                                   logit_bias, 
        
                                   negative_only=True, 
        
                               ) 
        
                       elif self.dist_impl == "shift": 
        
                           right_rank = (self.rank + 1) % self.world_size 
        
                           left_rank = (self.rank - 1 + self.world_size) % self.world_size 
        
                           text_features_to_right = text_features 
        
                           for i in range(self.world_size - 1): 
        
                               text_features_from_left = neighbour_exchange_with_grad( 
        
                                   left_rank, 
        
                                   right_rank, 
        
                                   text_features_to_right, 
        
                               ) 
        
                               loss += self._loss( 
        
                                   image_features, 
        
                                   text_features_from_left, 
        
                                   logit_scale, 
        
                                   logit_bias, 
        
                                   negative_only=True, 
        
                               ) 
        
                               text_features_to_right = text_features_from_left 
        
                       elif self.dist_impl == "reduce": 
        
                           for i in range(self.world_size): 
        
                               text_from_other = torch.distributed.nn.all_reduce( 
        
                                   text_features * (self.rank == i), 
        
                                   torch.distributed.ReduceOp.SUM, 
        
                               ) 
        
                               loss += float(i != self.rank) * self._loss( 
        
                                   image_features, 
        
                                   text_from_other, 
        
                                   logit_scale, 
        
                                   logit_bias, 
        
                                   negative_only=True, 
        
                               ) 
        
                       elif self.dist_impl == "gather": 
        
                           all_text = torch.distributed.nn.all_gather(text_features) 
        
                           for i in range(self.world_size): 
        
                               loss += float(i != self.rank) * self._loss( 
        
                                   image_features, 
        
                                   all_text[i], 
        
                                   logit_scale, 
        
                                   logit_bias, 
        
                                   negative_only=True, 
        
                               ) 
        
                       else: 
        
                           assert False 
        
                   return {"contrastive_loss": loss} if output_dict else loss

'gather' is probably the best balance, 'reduce' might be the best for memory but slower speed. The original bidir/shift impl here were supposed to mimic what was described in the paper but probably not great given the number of send/recv calls needed to impl in torch.

The official codebase never appeared actually added the exact impl described in the paper. Not sure why. I don't imagine it being any faster than here except maybe for some specific jax reasons.

I also feel some of the comparisons that were made in the paper were against a CLIP loss impl that may not have been as efficient as the defaults we use here ... 'local loss + gather w/ grad' combo

0 replies

alexisdrakopoulos · 2025-01-03T17:44:01Z

alexisdrakopoulos
Jan 3, 2025

Interesting. The state of ML research is really lacking in reproducibility! If I could have your advice. I have a rich dataset of 3.5 million archeological items many with detailed descriptions. I'm looking to train an object retrieval model so that based on an image, some text or both I can retrieve likely candidate archeological objects. CLIP style models seem best suited for this. However I hear they're a nightmare to fine tune. I was going to finetune SigLIP pertained with weight decay off and the same settings as the paper. Are there other alternatives I should try? Maybe DinoV2 SSL?

…

On Fri, Jan 3, 2025, 18:34 Ross Wightman ***@***.***> wrote: performance gains claimed in siglip paper have yet to be reproduced here, the clip and siglip losses are at best roughly equal, in many situations the siglip seems a bit worse in practice. Since the OP I've added some variations of the siglip loss that can be switched via "--loss-dist-impl" arg. See: https://github.com/mlfoundations/open_clip/blob/b2f1403605aade5a004434076246b6bc741aa47d/src/open_clip/loss.py#L314-L448 'gather' is probably the best balance, 'reduce' might be the best for memory but slower speed. The original bidir/shift impl here were supposed to mimic what was described in the paper but probably not great given the number of send/recv calls needed to impl in torch. The official codebase never appeared actually added the exact impl described in the paper. Not sure why. I don't imagine it being any faster than here except maybe for some specific jax reasons. I also feel some of the comparisons that were made in the paper were against a CLIP loss impl that may not have been as efficient as the defaults we use here ... 'local loss + gather w/ grad' combo — Reply to this email directly, view it on GitHub <#872 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH4X76KPZZOYFXESZUJNT432I3CYRAVCNFSM6AAAAABURSMBL2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNZSG43TSNY> . You are receiving this because you commented.Message ID: ***@***.*** com>

5 replies

rwightman Jan 3, 2025
Maintainer

@alexisdrakopoulos I think you should get decent milage fine-tuning SigLIP or CLIP models. The SigLIP and DFN (CLIP) models are at the top of most rankings by zero-shot and other downstream eval tasks.

If the captions are very long, and tokens will be truncated, you might get better results deploying masking/prioritization ala CLIPA https://github.com/mlfoundations/open_clip/blob/main/docs/clipa.md#text-token-length-reduction

Using the same pretrain settings as the originals is usually not the best strategy. Use a LR that's at least one OOM smaller, I don't know if I'd fully disable weight decay, maybe try 1/2 to 1/4 (of the pretrain) to start.

3.5M is a decent number of samples, but not 'a lot' by CLIP / SigLIP standards, enabling some image augmentations could help.

I use layer-wise LR decay in a lot of timm base fine-tune, that can be used here too but not added to the codebase. I think the EVA people used it for their models, could try borowing that (open to PR if can be cleanly added here): https://github.com/baaivision/EVA/blob/master/EVA-CLIP/rei/training/optim.py

rwightman Jan 3, 2025
Maintainer

Also, I see no reason why you can't fine-tune a SigLIP model with CLIP loss, or try a CLIP model with SigLIP loss...

CLIP definitely performs better if you can fine-tune at larger batch sizes though, just like pretrain. Getting the global batch size up into the 8-32k range can really help though you still get results without doing that.

SigLIP appears to work a bit better at smaller batch sizes though I haven't heard anyone trying it at the 16+k range saying they actually saw better results than with CLIP loss at similar batch sizes. I think the SigLIP dataset was better cleaned/curated than others so might be a big part of why those models are so good (vs just the loss).

rom1504 Jan 3, 2025
Maintainer

I'd suggest first using a pretrained model as a baseline.
You can compute embeddings and have a search UI easily with https://github.com/rom1504/clip-retrieval
I've seen even clip B/32 get good retrieval results with a large collection, I bet by default you would get something decent

Based on what you then observe to not work you will probably have better ideas what to do.
Maybe you need better resolution, maybe you need to do query augmentation, maybe you need to index features specific to your domain in addition to embeddings, maybe you need a classifier to decide if a sample is worth processing with a small or big clip model.
It will also help you figure out what kind of eval set you want to build.

Fine tuning clip models is only one thing you can do to get better retrieval results.

rwightman Jan 3, 2025
Maintainer

very good point, thanks @rom1504 ... I'd assumed there was already a baseline without any fine-tune but if not, definitely do that with different models, much less effort than the fine-tune itself :) Would provide signal in terms of which models could be better to fine-tune as well.

alexisdrakopoulos Jan 3, 2025

Yep that's very true. So far I only tried models like EfficientNet and DinoV2. I still need to work on my benchmark because the results indicated that efficientnet-b0 using just the 1000 last layer output somehow performed better than DinoV2 which is not a great sign of my benchmark.

alexisdrakopoulos · 2025-01-03T17:59:49Z

alexisdrakopoulos
Jan 3, 2025

Regarding augmentations. The good thing is because the text is long I am using LLMs to produce multiple versions of the text. So essentially 1 long description can become 5 slightly different short descriptions focusing on key details. I haven't seen this done and I'm hoping it improves performance. Image augmentations are important too since a lot of the database is also black and white. I want to also have augmentations that add noise similar to older photographs. Some of the database dates back to 1880.

…

On Fri, Jan 3, 2025, 18:56 Ross Wightman ***@***.***> wrote: @alexisdrakopoulos <https://github.com/alexisdrakopoulos> I think you should get decent milage fine-tuning SigLIP or CLIP models. The SigLIP and DFN (CLIP) models are at the top of most rankings by zero-shot and other downstream eval tasks. If the captions are very long, and tokens will be truncated, you might get better results deploying masking/prioritization ala CLIPA https://github.com/mlfoundations/open_clip/blob/main/docs/clipa.md#text-token-length-reduction Using the same pretrain settings as the originals is usually not the best strategy. Use a LR that's at least one OOM smaller, I don't know if I'd fully disable weight decay, maybe try 1/2 to 1/4 to start. 3.5M is a decent number of samples, but not 'a lot' by CLIP / SigLIP standards, enabling some image augmentations could help. I use layer-wise LR decay in a lot of timm base fine-tune, that can be used here too but not added to the codebase. I think the EVA people used it for their models, could try borowing that (open to PR if can be cleanly added here): https://github.com/baaivision/EVA/blob/master/EVA-CLIP/rei/training/optim.py — Reply to this email directly, view it on GitHub <#872 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH4X76MCYKDKFMZOOSNYYQL2I3FNZAVCNFSM6AAAAABURSMBL2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNZSG44TINQ> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

2 replies

rwightman Jan 3, 2025
Maintainer

@alexisdrakopoulos yes that should definitely help, it's related to the CLIPA approach, one of their strategies for shortening the context length was to use nltk to prioritize parts of the the descriptions, but generating multiple new ones with a more capable model should be better

rwightman Jan 3, 2025
Maintainer

Also if this dataset happens to be openly licensable, I'm sure others would love to explore... I can help get it on the HF hub if that's a possibility. Understand it may be proprietary though :)

alexisdrakopoulos · 2025-01-03T18:02:52Z

alexisdrakopoulos
Jan 3, 2025

I am GPU poor so large batch sizes aren't ideal. I can get access to TPUs for a short time but I haven't tried pytorch on TPUs so not sure what kind of issues I'll run into. Otherwise I'm mainly stuck on max 8x48gb VRAM.

…

On Fri, Jan 3, 2025, 18:59 Ross Wightman ***@***.***> wrote: Also, I see no reason why you can't fine-tune a SigLIP model with CLIP loss, or try a CLIP model with SigLIP loss... CLIP definitely performs better if you can fine-tune at larger batch sizes though, just like pretrain. Getting the global batch size up into the 8-32k range can really help though you still get results without doing that. SigLIP appears to work a bit better at smaller batch sizes though I haven't heard anyone trying it at the 16+k range saying they actually saw better results than with CLIP loss at similar batch sizes. I think the SigLIP dataset was better cleaned/curated than others so might be a big part of why those models are so good (vs just the loss). — Reply to this email directly, view it on GitHub <#872 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH4X76MM67ZKECSNUXA2PBD2I3FY5AVCNFSM6AAAAABURSMBL2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNZSG44TOMI> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

4 replies

rwightman Jan 3, 2025
Maintainer

8x48 is a decent amount, with modest size models, grad checkpointing can get an okay batch size for fine-tune. An interesting comparison would be to take two models in the same family, a 'base' size and maybe a large/so400m/huge and see if you get more milage maximizing batch size with the base or maximizing model capacity with your resources.

rwightman Jan 3, 2025
Maintainer

Also since the resolution is higher and you probably get some gains with this sort of data w/ fine-graind features, the convext models I trained might work well at 512x512 ... I covered a convnext base / large / and made a special xxlarge for the clip runs.

alexisdrakopoulos Jan 3, 2025

Interesting! I'll give it a shot. I still need a good benchmark to judge retrieval quality. I'm working on one.

Very exciting stuff! You're a great help and a good champion of OSS.

rwightman Jan 3, 2025
Maintainer

@alexisdrakopoulos thanks, FYI all of the zero-shot and retreival comparisons (e.g. https://github.com/mlfoundations/open_clip/blob/main/docs/openclip_retrieval_results.csv) for the models here were done with this code base: https://github.com/LAION-AI/CLIP_benchmark

alexisdrakopoulos · 2025-01-03T18:05:29Z

alexisdrakopoulos
Jan 3, 2025

The complete dataset is proprietary and ridiculously expensive. However a subset of it (2m entries) I can likely publish down the line. I'm hoping to get a paper out of this! What do you think of https://news.ycombinator.com/item?id=34970045? It's proprietary tech so could be fluff but I'm curious about their claims.

…

On Fri, Jan 3, 2025, 19:03 Ross Wightman ***@***.***> wrote: Also if this dataset happens to be openly licensable, I'm sure others would love to explore... I can help get it on the HF hub if that's a possibility. Understand it may be proprietary though :) — Reply to this email directly, view it on GitHub <#872 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH4X76NWUIIZ5DQBKRGDQ7D2I3GHVAVCNFSM6AAAAABURSMBL2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNZSHAYDAMQ> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

2 replies

rwightman Jan 3, 2025
Maintainer

@alexisdrakopoulos yeah, I've seen some of their claims before, don't know if it's fluff, very little info provided and they're selling something so hard to say.

I know there have been lots of 'this works better at smaller data scale' papers, some of them probably do work better for smaller data but they often cherry picked some specific comparisons and the bigger / simpler contrastive models tended to remain better across broader downstream use cases.

alexisdrakopoulos Jan 3, 2025

Yea that's why I'm curious, my use case is so specific. I only need it to perform well on photos of archeological objects, that's it.

How much GPU memory can SigLIP save compared with CLIP? #872

Uh oh!

Uh oh!

Replies: 7 comments · 13 replies

Uh oh!

Uh oh!

Uh oh!

rwightman Jan 3, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

rwightman Jan 3, 2025 Maintainer

Uh oh!

rwightman Jan 3, 2025 Maintainer

Uh oh!

rom1504 Jan 3, 2025 Maintainer

Uh oh!

Uh oh!

rwightman Jan 3, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rwightman Jan 3, 2025 Maintainer

Uh oh!

rwightman Jan 3, 2025 Maintainer

Uh oh!

Uh oh!

rwightman Jan 3, 2025 Maintainer

Uh oh!

rwightman Jan 3, 2025 Maintainer

Uh oh!

Uh oh!

rwightman Jan 3, 2025 Maintainer

Uh oh!

Uh oh!

rwightman Jan 3, 2025 Maintainer

Uh oh!

Replies: 7 comments 13 replies

rwightman
Jan 3, 2025
Maintainer

rwightman Jan 3, 2025
Maintainer

rwightman Jan 3, 2025
Maintainer

rom1504 Jan 3, 2025
Maintainer

rwightman Jan 3, 2025
Maintainer

rwightman Jan 3, 2025
Maintainer

rwightman Jan 3, 2025
Maintainer

rwightman Jan 3, 2025
Maintainer

rwightman Jan 3, 2025
Maintainer

rwightman Jan 3, 2025
Maintainer

rwightman Jan 3, 2025
Maintainer