Validation set issues #1071
Unanswered
SoumyajitStatsPerform
asked this question in
Q&A
Replies: 1 comment
-
|
I am going through the same. Have you by any chance figured it out? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all,
I have a couple of related questions about the validation set when fine-tuning a CLIP model. I am looking at the following :
--model ViT-B-32
--pretrained laion2b_s34b_b79k
--gather-with-grad
--local-loss
--grad-checkpointing
My train dataset has over million images (and their associated captions in txt files). I have tried a whole range of batch sizes (256, 512, 1024). The dataset is in a webdataset format. When I use a "big" validation set size of 300000+ image + txt files, it crashes with a memory allocation issue when evaluating the val set as follows : [rank0]: logits_per_image = (logit_scale * image_features @ text_features.t()).detach().cpu()
[rank0]: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
[rank0]: RuntimeError: [enforce fail at alloc_cpu.cpp:119] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 474649346704 bytes. Error code 12 (Cannot allocate memory).
When I use a "small" validation set of, say size around 10K image + txt pairs, there are no memory issues. However, with smaller val sets, the val clip loss seems to only increase.
So, my questions are two-fold. 1) what is the typical val set size to work with to not run into memory issues ? 2) has anyone else noticed this increase in val clip loss, and what steps did you take to deal with that ?
here is the full training script I am using (with two NVIDIA Titan RTX 24 Gb cards):
torchrun --nproc_per_node 2 -m open_clip_train.main
--batch-size 1024
--precision amp
--workers 4
--save-frequency 1
--report-to tensorboard
--logs="path/to/logs/"
--dataset-type webdataset
--train-data="path/to/dataset-{000000..001099}.tar"
--val-data="/path/to/dataset-{000000..000299}.tar"
--train-num-samples 1100000
--val-num-samples 300000
--dataset-resampled
--force-image-size 128 64
--warmup 10000
--lr=5e-4
--beta1 0.9
--beta2 0.99
--lr-scheduler cosine
--wd=0.2
--epochs=50
--model ViT-B-32
--pretrained laion2b_s34b_b79k
--gather-with-grad
--local-loss
--grad-checkpointing
Beta Was this translation helpful? Give feedback.
All reactions