Replies: 7 comments 13 replies
-
|
Yea I also can't replicate it, I might try the original Jax code and see if they're doing something different. I'm training like this: |
Beta Was this translation helpful? Give feedback.
-
|
Nevermind, with --grad-checkpointing I can now run batch size 1200 whereas without it's stuck at 220. |
Beta Was this translation helpful? Give feedback.
-
|
performance gains claimed in siglip paper have yet to be reproduced here, the clip and siglip losses are at best roughly equal, in many situations the siglip seems a bit worse in practice. Since the OP I've added some variations of the siglip loss that can be switched via "--loss-dist-impl" arg. See: open_clip/src/open_clip/loss.py Lines 314 to 448 in b2f1403 'gather' is probably the best balance, 'reduce' might be the best for memory but slower speed. The original bidir/shift impl here were supposed to mimic what was described in the paper but probably not great given the number of send/recv calls needed to impl in torch. The official codebase never appeared actually added the exact impl described in the paper. Not sure why. I don't imagine it being any faster than here except maybe for some specific jax reasons. I also feel some of the comparisons that were made in the paper were against a CLIP loss impl that may not have been as efficient as the defaults we use here ... 'local loss + gather w/ grad' combo |
Beta Was this translation helpful? Give feedback.
-
|
Interesting. The state of ML research is really lacking in reproducibility!
If I could have your advice. I have a rich dataset of 3.5 million
archeological items many with detailed descriptions. I'm looking to train
an object retrieval model so that based on an image, some text or both I
can retrieve likely candidate archeological objects.
CLIP style models seem best suited for this. However I hear they're a
nightmare to fine tune. I was going to finetune SigLIP pertained with
weight decay off and the same settings as the paper. Are there other
alternatives I should try? Maybe DinoV2 SSL?
…On Fri, Jan 3, 2025, 18:34 Ross Wightman ***@***.***> wrote:
performance gains claimed in siglip paper have yet to be reproduced here,
the clip and siglip losses are at best roughly equal, in many situations
the siglip seems a bit worse in practice.
Since the OP I've added some variations of the siglip loss that can be
switched via "--loss-dist-impl" arg.
See:
https://github.com/mlfoundations/open_clip/blob/b2f1403605aade5a004434076246b6bc741aa47d/src/open_clip/loss.py#L314-L448
'gather' is probably the best balance, 'reduce' might be the best for
memory but slower speed. The original bidir/shift impl here were supposed
to mimic what was described in the paper but probably not great given the
number of send/recv calls needed to impl in torch.
The official codebase never appeared actually added the exact impl
described in the paper. Not sure why. I don't imagine it being any faster
than here except maybe for some specific jax reasons.
I also feel some of the comparisons that were made in the paper were
against a CLIP loss impl that may not have been as efficient as the
defaults we use here ... 'local loss + gather w/ grad' combo
—
Reply to this email directly, view it on GitHub
<#872 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AH4X76KPZZOYFXESZUJNT432I3CYRAVCNFSM6AAAAABURSMBL2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNZSG43TSNY>
.
You are receiving this because you commented.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
-
|
Regarding augmentations. The good thing is because the text is long I am
using LLMs to produce multiple versions of the text.
So essentially 1 long description can become 5 slightly different short
descriptions focusing on key details.
I haven't seen this done and I'm hoping it improves performance.
Image augmentations are important too since a lot of the database is also
black and white. I want to also have augmentations that add noise similar
to older photographs. Some of the database dates back to 1880.
…On Fri, Jan 3, 2025, 18:56 Ross Wightman ***@***.***> wrote:
@alexisdrakopoulos <https://github.com/alexisdrakopoulos> I think you
should get decent milage fine-tuning SigLIP or CLIP models. The SigLIP and
DFN (CLIP) models are at the top of most rankings by zero-shot and other
downstream eval tasks.
If the captions are very long, and tokens will be truncated, you might get
better results deploying masking/prioritization ala CLIPA
https://github.com/mlfoundations/open_clip/blob/main/docs/clipa.md#text-token-length-reduction
Using the same pretrain settings as the originals is usually not the best
strategy. Use a LR that's at least one OOM smaller, I don't know if I'd
fully disable weight decay, maybe try 1/2 to 1/4 to start.
3.5M is a decent number of samples, but not 'a lot' by CLIP / SigLIP
standards, enabling some image augmentations could help.
I use layer-wise LR decay in a lot of timm base fine-tune, that can be
used here too but not added to the codebase. I think the EVA people used it
for their models, could try borowing that (open to PR if can be cleanly
added here):
https://github.com/baaivision/EVA/blob/master/EVA-CLIP/rei/training/optim.py
—
Reply to this email directly, view it on GitHub
<#872 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AH4X76MCYKDKFMZOOSNYYQL2I3FNZAVCNFSM6AAAAABURSMBL2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNZSG44TINQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
-
|
I am GPU poor so large batch sizes aren't ideal. I can get access to TPUs
for a short time but I haven't tried pytorch on TPUs so not sure what kind
of issues I'll run into.
Otherwise I'm mainly stuck on max 8x48gb VRAM.
…On Fri, Jan 3, 2025, 18:59 Ross Wightman ***@***.***> wrote:
Also, I see no reason why you can't fine-tune a SigLIP model with CLIP
loss, or try a CLIP model with SigLIP loss...
CLIP definitely performs better if you can fine-tune at larger batch sizes
though, just like pretrain. Getting the global batch size up into the 8-32k
range can really help though you still get results without doing that.
SigLIP appears to work a bit better at smaller batch sizes though I
haven't heard anyone trying it at the 16+k range saying they actually saw
better results than with CLIP loss at similar batch sizes. I think the
SigLIP dataset was better cleaned/curated than others so might be a big
part of why those models are so good (vs just the loss).
—
Reply to this email directly, view it on GitHub
<#872 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AH4X76MM67ZKECSNUXA2PBD2I3FY5AVCNFSM6AAAAABURSMBL2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNZSG44TOMI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
-
|
The complete dataset is proprietary and ridiculously expensive. However a
subset of it (2m entries) I can likely publish down the line. I'm hoping to
get a paper out of this!
What do you think of https://news.ycombinator.com/item?id=34970045?
It's proprietary tech so could be fluff but I'm curious about their claims.
…On Fri, Jan 3, 2025, 19:03 Ross Wightman ***@***.***> wrote:
Also if this dataset happens to be openly licensable, I'm sure others
would love to explore... I can help get it on the HF hub if that's a
possibility. Understand it may be proprietary though :)
—
Reply to this email directly, view it on GitHub
<#872 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AH4X76NWUIIZ5DQBKRGDQ7D2I3GHVAVCNFSM6AAAAABURSMBL2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNZSHAYDAMQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
The original SigLIP paper said they can fit 2x batch size on TPU with base SigLIP model, compared with CLIP.
But in my experiment, I both used 14400 batch size on 48 A100-40GB, while the SigLIP and CLIP models are both base-sized standard structure. Then during the training, SigLIP takes 33.5G while CLIP takes 37.0G on each GPU. They are close and I couldn't scale up 2x batch size as the paper said.
I am not using any FSDP/deepspeed techniques, is it the reason? Or does the GPU type matter a lot? I have no idea.
Can anyone who ever trained a SigLIP model share your experience?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions