Catastrophic performance loss in 1st epoch #1094

alexisdrakopoulos · 2025-07-06T11:02:27Z

alexisdrakopoulos
Jul 6, 2025

Hi, I'm using the following:

python -m open_clip_train.main \
    --save-frequency 1 \
    --zeroshot-frequency 1 \
    --dataset-type webdataset \
    --train-data 'data/dataset_{000..042}.tar'  \
    --train-num-samples 1280000 \
    --warmup 90000 \
    --batch-size=9000 \
    --lr=1e-4 \
    --wd=1e-5 \
    --epochs=10 \
    --workers=6 \
    --model ViT-B-16-SigLIP2 \
    --pretrained webli \
    --siglip \
    --opt timm/Lion \
    --grad-checkpointing

with 224x224 images + detailed text pairs with around 2 million samples.

Here is my image retrieval benchmark with the pre-trained model:

--- Benchmark Results ---
Recall@1  : 17.28%
Recall@5  : 34.82%
Recall@10 : 43.18%
Recall@50 : 62.73%
Recall@100: 70.92%
-------------------------

and here are my results after 1 test epoch:

--- Benchmark Results ---
Recall@1  : 5.26%
Recall@5  : 12.78%
Recall@10 : 16.99%
Recall@50 : 28.55%
Recall@100: 34.69%
------------------------

and here after 2 epochs:

--- Benchmark Results ---
Recall@1  : 6.39%
Recall@5  : 14.86%
Recall@10 : 18.79%
Recall@50 : 31.73%
Recall@100: 38.57%
-------------------------

Should I be using a different optimizer? Should I set WD to 0? any advice is appreciated.

I am training on a single H200 node which allows batch size of around 9000 given the args above.

alexisdrakopoulos · 2025-07-07T08:51:52Z

alexisdrakopoulos
Jul 7, 2025
Author

So interestingly, doing the exact same thing but with a lower warmup leads to better performance but still very strange performance:

Timestamp             Recall@1    Recall@5    Recall@10   Recall@50   
----------------------------------------------------------------------
2025-07-06,19:55:42   17.31%      34.78%      43.18%      62.75%      
2025-07-06,20:40:08   26.15%      49.50%      59.11%      74.82%      
2025-07-06,21:25:25   29.46%      54.07%      63.55%      79.53%      
2025-07-06,22:10:40   28.07%      52.49%      62.66%      79.15%      
2025-07-06,22:55:54   30.42%      55.86%      66.60%      83.17%      
2025-07-06,23:41:08   26.46%      50.69%      61.43%      80.91%      
2025-07-07,00:26:33   28.27%      52.91%      63.94%      83.03%      
2025-07-07,01:11:56   27.82%      52.77%      64.03%      83.08%      
2025-07-07,01:56:48   25.05%      50.39%      61.22%      81.30%      
2025-07-07,02:41:51   28.07%      54.39%      65.09%      83.91%      
2025-07-07,03:27:11   27.11%      51.86%      62.96%      83.45%      
2025-07-07,04:12:34   27.50%      53.23%      64.44%      85.09%      
2025-07-07,04:57:13   28.44%      54.37%      66.09%      85.40%      
2025-07-07,05:42:29   27.06%      52.75%      63.73%      84.53%  
2025-07-07,07:12:59   27.02%      51.26%      62.50%      83.06%  
2025-07-07,07:58:13   29.01%      55.29%      66.51%      85.05%      
2025-07-07,08:42:50   26.23%      51.56%      62.73%      83.50%     
----------------------------------------------------------------------

0 replies

rwightman · 2025-07-07T14:54:10Z

rwightman
Jul 7, 2025
Maintainer

@alexisdrakopoulos don't think this is a bug of any sort. FYI warmup is in steps so if you have a small dataset that could be a lot of epochs, so it means you're LR is way too high and things collapse after the LR hits the breaking point. 1e-4 would be a high fine-tune for AdamW but even higher for LION since I vaguely recall the recommendation for that optimizer being ~1/3 to 1/10 of Adam equivalent?

0 replies

rwightman · 2025-07-07T14:54:21Z

rwightman
Jul 7, 2025
Maintainer

Moving to discussions...

0 replies

alexisdrakopoulos · 2025-07-10T17:50:19Z

alexisdrakopoulos
Jul 10, 2025
Author

I have around 3 million image text pairs. The warmup was indeed ridiculous, but I tried many different settings with lower LR/warmups etc..

I can't figure out why the first few epochs my performance massively improves before slowly degrading over subsequent epochs. It's wasting a lot of money and compute. I'm now just using the model from epoch 2 or 3 which is like 30 minutes into the fine tuning process....

Do you have any advice on how to explore this type of behavior and fix it?

I am frustrated because I am comparing it to DinoV2 in terms of Image to Image retrieval, and DinoV2 gets these results out of the box:

--- Benchmark Results ---
Recall@1  : 38.88%
Recall@5  : 64.51%
Recall@10 : 73.12%
Recall@50 : 83.47%
Recall@100: 86.51%
-------------------------

even ResNet fairs well:

--- Benchmark Results ---
Recall@1  : 27.72%
Recall@5  : 52.40%
Recall@10 : 62.11%
Recall@50 : 77.65%
Recall@100: 82.85%
-------------------------

A little frustrating.

0 replies

jn2clark · 2025-07-11T02:16:25Z

jn2clark
Jul 11, 2025

@alexisdrakopoulos can you share any information about the dataset? I think there are two likely causes here (2 being most likely),
1 - if the data has a lot of redundancy in the pairs (i.e. not every image is unique and not every text is unique) then large batch sizes can cause "collisions" where you have (false) negatives in the off diagonal. My advice would be to try much smaller batch sizes and work your way up to the current one. From my experience fine-tuning, the optimal batch size is rarely the largest.
2 - the logit bias and scale defaults are not well aligned for your dataset. Ideally you can "burn in" the scale and bias. This can be achieved by freezing both text and image towers and only optimizing the bias and scale for a few epochs with a higher lr (1e-2). Then restart training using these bias and logits.
Its also worth trying a really small training dataset and verifying you can over-fit that to make sure there is no issue in the data or somewhere else.

0 replies

alexisdrakopoulos · 2025-07-11T13:28:55Z

alexisdrakopoulos
Jul 11, 2025
Author

Hi @jn2clark I really appreciate you answering!

The data shouldn't have too many duplicates, but the webdataset file I use is not pre-shuffled. I use the open_clip shuffling function which is 2 stage I believe. Each webdataset file has 3000 image/text pairs and I have just under 1000 of those.

The dataset are images of antiquities, largely from museums and other places on the internet. Each image is 224 pixels long on the longest axis. Each image has 1 detailed piece of text.

Image/text pairs are largely unique though there are duplicates, but I don't think there should be more than a few close duplicates. I guess I can crunch it again by computing phashes of the images...

You are right that not every text is unique, I am now implementing an augmentation step where every time the sampler tries to get a label it gets 1 random piece of text out of 5. Each Image I have has 5 different valid captions written slightly differently. I need to make sure this is actually running though.

0 replies

alexisdrakopoulos · 2025-07-12T21:16:07Z

alexisdrakopoulos
Jul 12, 2025
Author

I haven't been able to try your idea out yet since the text tower isn't trivial to lock, I saw there's a PR open I'll try to implement that. Here is my current evaluation metrics per epoch after adding some more image augmentations and using learning rate 1e-6, wd 1e-7 and a short warm up:

| Epoch | Recall@1 | Recall@5 | Recall@10 | Recall@50 |
| ----- | -------- | -------- | --------- | --------- |
| 0     | 7.20%    | 16.35%   | 21.89%    | 40.03%    |
| 1     | 27.88%   | 48.50%   | 57.83%    | 77.97%    |
| 2     | 29.84%   | 51.04%   | 59.81%    | 78.91%    |
| 3     | 32.26%   | 53.92%   | 62.78%    | 81.05%    |
| 4     | 32.09%   | 54.14%   | 62.98%    | 81.05%    |
| 5     | 31.57%   | 53.72%   | 62.39%    | 80.86%    |
| 6     | 31.43%   | 53.22%   | 62.13%    | 80.62%    |
| 7     | 31.41%   | 53.26%   | 62.08%    | 80.55%    |
| 8     | 31.39%   | 53.20%   | 61.86%    | 80.22%    |
| 9     | 31.28%   | 53.00%   | 61.71%    | 80.29%    |
| 10    | 31.24%   | 52.91%   | 61.71%    | 80.27%    |

here it is compared to pretrained models:

| Model                           |  Recall@1  |  Recall@5  | Recall@10  | Recall@50  | Recall@100 |
| :------------------------------ | :--------: | :--------: | :--------: | :--------: | :--------: |
| DINOv2 ViT-S/14                 |   37.87%   |   55.12%   |   61.69%   |   74.24%   |   79.33%   |
| DINOv2 ViT-B/14                 |   37.94%   |   53.66%   |   59.40%   |   72.10%   |   76.97%   |
| DINOv2 ViT-L/14                 | **39.01%** | **55.18%** |   60.62%   |   72.21%   |   76.62%   |
| ResNet-18                       |   23.97%   |   40.91%   |   48.88%   |   65.01%   |   71.86%   |
| ResNet-50                       |   17.40%   |   31.17%   |   37.92%   |   55.08%   |   63.65%   |
| ResNet-101                      |   14.78%   |   26.46%   |   32.18%   |   48.48%   |   56.73%   |
| ViT-B/16 SigLIP2 Webli          |   7.20%    |   16.35%   |   21.89%   |   40.03%   |   50.03%   |
| ArgusNet (Epoch 4)              |   32.09%   |   54.14%   | **62.98%** | **81.05%** | **84.26%** |
| ViT-B-16-SigLIP2 512 Webli      |   4.63%    |   12.42%   |   17.20%   |   33.75%   |   44.40%   |
| ViT-SO400M-16-SigLIP2-384 Webli |   4.43%    |   11.90%   |   17.03%   |   34.88%   |   46.65%   |

0 replies

alexisdrakopoulos · 2025-07-20T08:23:03Z

alexisdrakopoulos
Jul 20, 2025
Author

@jn2clark I tried your second suggestion and sadly it didn't improve things.

I'm going to go back to the drawing board and try cleaning up my dataset some more.

What really confuses me is why the model performance jumps from atrocious to 30% Recall@1 when the model is exposed to like 20% of the dataset, and it then just gets worse and worse and worse slowly. My learning rates aren't even high according to the literature.

I might even try a super slow learning rate like 1e-7 with Adam I suppose, though that seems like a strange approach.

0 replies

Catastrophic performance loss in 1st epoch #1094

Uh oh!

Uh oh!

alexisdrakopoulos Jul 6, 2025

Replies: 8 comments

Uh oh!

alexisdrakopoulos Jul 7, 2025 Author

Uh oh!

rwightman Jul 7, 2025 Maintainer

Uh oh!

rwightman Jul 7, 2025 Maintainer

Uh oh!

Uh oh!

alexisdrakopoulos Jul 10, 2025 Author

Uh oh!

jn2clark Jul 11, 2025

Uh oh!

alexisdrakopoulos Jul 11, 2025 Author

Uh oh!

Uh oh!

alexisdrakopoulos Jul 12, 2025 Author

Uh oh!

alexisdrakopoulos Jul 20, 2025 Author

alexisdrakopoulos
Jul 6, 2025

alexisdrakopoulos
Jul 7, 2025
Author

rwightman
Jul 7, 2025
Maintainer

rwightman
Jul 7, 2025
Maintainer

alexisdrakopoulos
Jul 10, 2025
Author

jn2clark
Jul 11, 2025

alexisdrakopoulos
Jul 11, 2025
Author

alexisdrakopoulos
Jul 12, 2025
Author

alexisdrakopoulos
Jul 20, 2025
Author