Local cache dir not fully clearing in DDP multi-node training.

## 🐛 Bug

At the moment it seems that a significant portion of data stored in the cache (~40-60%) ends up not removed over the course of training, and then remains in the cache dir upon completion of training. I suspect this 

### To Reproduce

Currently training a few models over 8 nodes with a large source dataset (only 1 epoch), and the cache dir size accumulates indefinitely. Over the process, it seems that the lock counts for these files are much greater than 0, so I wonder if it has to do with `force_download`-related behavior?

### Expected behavior

The cache dir does not expand significantly above its target size, and files aren't being locked as many times.

### Additional context



<details>
  <summary>Environment detail</summary>

- PyTorch Version (e.g., 1.0):
- OS (e.g., Linux):
- How you installed PyTorch (`conda`, `pip`, source):
- Build command you used (if compiling from source):
- Python version:
- CUDA/cuDNN version:
- GPU models and configuration:
- Any other relevant information:

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Local cache dir not fully clearing in DDP multi-node training. #512

🐛 Bug

To Reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Local cache dir not fully clearing in DDP multi-node training. #512

Description

🐛 Bug

To Reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions