Open
Description
🐛 Bug
At the moment it seems that a significant portion of data stored in the cache (~40-60%) ends up not removed over the course of training, and then remains in the cache dir upon completion of training. I suspect this
To Reproduce
Currently training a few models over 8 nodes with a large source dataset (only 1 epoch), and the cache dir size accumulates indefinitely. Over the process, it seems that the lock counts for these files are much greater than 0, so I wonder if it has to do with force_download
-related behavior?
Expected behavior
The cache dir does not expand significantly above its target size, and files aren't being locked as many times.
Additional context
Environment detail
- PyTorch Version (e.g., 1.0):
- OS (e.g., Linux):
- How you installed PyTorch (
conda
,pip
, source): - Build command you used (if compiling from source):
- Python version:
- CUDA/cuDNN version:
- GPU models and configuration:
- Any other relevant information: