Run big models with DDP/FSDP instead of `torch.nn.DataParallel`

### 1. Feature description

Make PyPOTS run models on multi-GPU with DDP (Distributed Data Parallel, https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) or FSDP (Fully Sharded Data Parallel, https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html).

### 2. Motivation

Current multi-gpu training implemented with `torch.nn.DataParallel` in PyPOTS framework is not enough for training big models like Time-LLM (e.g. #675 Time-LLM easy OOM on short-len TS samples), we need more advanced feature like DDP or FSDP

### 3. Your contribution

Would like to lead or arrange the development task. Please leave comments below to start discussions if you're interested. More comments will help prioritize this feature.

![](https://hitx.vercel.app/counter/?id=https://github.com/wenjiedu/pypots/issues/683)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Run big models with DDP/FSDP instead of `torch.nn.DataParallel` #683

1. Feature description

2. Motivation

3. Your contribution

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Run big models with DDP/FSDP instead of torch.nn.DataParallel #683

Description

1. Feature description

2. Motivation

3. Your contribution

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Run big models with DDP/FSDP instead of `torch.nn.DataParallel` #683