Skip to content

Run big models with DDP/FSDP instead of torch.nn.DataParallel #683

@WenjieDu

Description

@WenjieDu

1. Feature description

Make PyPOTS run models on multi-GPU with DDP (Distributed Data Parallel, https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) or FSDP (Fully Sharded Data Parallel, https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html).

2. Motivation

Current multi-gpu training implemented with torch.nn.DataParallel in PyPOTS framework is not enough for training big models like Time-LLM (e.g. #675 Time-LLM easy OOM on short-len TS samples), we need more advanced feature like DDP or FSDP

3. Your contribution

Would like to lead or arrange the development task. Please leave comments below to start discussions if you're interested. More comments will help prioritize this feature.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions