You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ive recently been updating package dependencies in my project (python, pytorch, lighting). Without changing anything else in my code or hardware, aside from the lightning import convention, I now get RunTimeErrors when trainign a TFT model, in with ddp.
Each rank immediatly returns errors similar to this, but with different shapes.
[rank5]: RuntimeError: [5]: params[0] in this process with sizes [53, 15] appears not to match sizes of the same param in process 0.
I believe this is because ddp strategy is to restart processes of the code which then instantiates separate versions of the model on subsets of the data. However the pytorch-forecasting implementation of TFT encodes categorical features internally. If different subsets of the data have different categorical values, the shapes wont match.
Expected behavior
TFT with categorical variables should support ddp training strategy.
Additional context
I'm training on a single EC2 node with 8 GPUs.
Trainer( accelerator="gpu", strategy="ddp", devices=1, ...
works but is slow:
Manually setting the embedding_sizes when initialising the model with .from_dataset solved the size mismatch issue, showing that this is the cause of the bug.
However, (I think) this will mean that the categorical values will have different labels and vectors in the embedding space, so when the different models communicate weights they won't refer to the same thing rendering all categorical variables useless. embedding_labels should be pre-computed globally as well.
mkuiack
changed the title
[BUG] TFT + categorical features seems not to be compatible with ddp in some situations.
[BUG] TFT + categorical features seems not to be compatible with DDP in some situations.
May 7, 2025
Uh oh!
There was an error while loading. Please reload this page.
Describe the bug
Ive recently been updating package dependencies in my project (python, pytorch, lighting). Without changing anything else in my code or hardware, aside from the lightning import convention, I now get RunTimeErrors when trainign a TFT model, in with ddp.
Each rank immediatly returns errors similar to this, but with different shapes.
I believe this is because ddp strategy is to restart processes of the code which then instantiates separate versions of the model on subsets of the data. However the pytorch-forecasting implementation of TFT encodes categorical features internally. If different subsets of the data have different categorical values, the shapes wont match.
Expected behavior
TFT with categorical variables should support ddp training strategy.
Additional context
I'm training on a single EC2 node with 8 GPUs.
Trainer( accelerator="gpu", strategy="ddp", devices=1, ...
works but is slow:
This doesn't work:
Trainer( accelerator="gpu", strategy="ddp", devices=8, ...
Versions
doesn't work:
works:
The text was updated successfully, but these errors were encountered: