Skip to content

Local cache dir not fully clearing in DDP multi-node training. #512

Open
@JackUrb

Description

@JackUrb

🐛 Bug

At the moment it seems that a significant portion of data stored in the cache (~40-60%) ends up not removed over the course of training, and then remains in the cache dir upon completion of training. I suspect this

To Reproduce

Currently training a few models over 8 nodes with a large source dataset (only 1 epoch), and the cache dir size accumulates indefinitely. Over the process, it seems that the lock counts for these files are much greater than 0, so I wonder if it has to do with force_download-related behavior?

Expected behavior

The cache dir does not expand significantly above its target size, and files aren't being locked as many times.

Additional context

Environment detail
  • PyTorch Version (e.g., 1.0):
  • OS (e.g., Linux):
  • How you installed PyTorch (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions