Validation set issues #1071

SoumyajitStatsPerform · 2025-04-26T05:57:57Z

SoumyajitStatsPerform
Apr 26, 2025

Hi all,

I have a couple of related questions about the validation set when fine-tuning a CLIP model. I am looking at the following :
--model ViT-B-32
--pretrained laion2b_s34b_b79k
--gather-with-grad
--local-loss
--grad-checkpointing

My train dataset has over million images (and their associated captions in txt files). I have tried a whole range of batch sizes (256, 512, 1024). The dataset is in a webdataset format. When I use a "big" validation set size of 300000+ image + txt files, it crashes with a memory allocation issue when evaluating the val set as follows : [rank0]: logits_per_image = (logit_scale * image_features @ text_features.t()).detach().cpu()
[rank0]: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
[rank0]: RuntimeError: [enforce fail at alloc_cpu.cpp:119] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 474649346704 bytes. Error code 12 (Cannot allocate memory).
When I use a "small" validation set of, say size around 10K image + txt pairs, there are no memory issues. However, with smaller val sets, the val clip loss seems to only increase.

So, my questions are two-fold. 1) what is the typical val set size to work with to not run into memory issues ? 2) has anyone else noticed this increase in val clip loss, and what steps did you take to deal with that ?

here is the full training script I am using (with two NVIDIA Titan RTX 24 Gb cards):

torchrun --nproc_per_node 2 -m open_clip_train.main
--batch-size 1024
--precision amp
--workers 4
--save-frequency 1
--report-to tensorboard
--logs="path/to/logs/"
--dataset-type webdataset
--train-data="path/to/dataset-{000000..001099}.tar"
--val-data="/path/to/dataset-{000000..000299}.tar"
--train-num-samples 1100000
--val-num-samples 300000
--dataset-resampled
--force-image-size 128 64
--warmup 10000
--lr=5e-4
--beta1 0.9
--beta2 0.99
--lr-scheduler cosine
--wd=0.2
--epochs=50
--model ViT-B-32
--pretrained laion2b_s34b_b79k
--gather-with-grad
--local-loss
--grad-checkpointing

shreyas-kamath-simplisafe · 2025-07-02T15:58:36Z

shreyas-kamath-simplisafe
Jul 2, 2025

I am going through the same. Have you by any chance figured it out?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Validation set issues #1071

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Validation set issues #1071

Uh oh!

SoumyajitStatsPerform Apr 26, 2025

Replies: 1 comment

Uh oh!

shreyas-kamath-simplisafe Jul 2, 2025

SoumyajitStatsPerform
Apr 26, 2025

shreyas-kamath-simplisafe
Jul 2, 2025