Is it possible to use gradient accumulation to counter small GPU memory? #203
Description
Hi! I had a quick question I was wondering if I could pick your brain about:
I'm using SimCLR for very high-dimensionality data (such that I max out at batch size 4). Clearly, it really isn't feasible to run SimCLR since the batch size is so low. I was thinking about trying to use some sort of gradient accumulation technique, but my concern is that it might not quite mesh well with how the loss function works. Let's say I want to use an effective batch size of 64 (with minibatch size 4). Since we are essentially computing the dot product of the projections, instead of computing the dot product between 64 pairs like it would be in normal SimCLR, it would be like computing the averaged dot product of 8 instances of 4 pairs, and then updating the gradient. I'm not confident that this will have the same effect as a large batch size since the loss itself is reliant on comparing the single positive sample to a large number of negative samples. Do you think there is a way that I can modify this framework to simulate large batch sizes with these types of memory constraints? Or if there is a way I can get gradient accumulation to work the way I want?