[BUG] TFT + categorical features seems not to be compatible with DDP in some situations. #1825

mkuiack · 2025-05-06T14:02:11Z

Describe the bug

Ive recently been updating package dependencies in my project (python, pytorch, lighting). Without changing anything else in my code or hardware, aside from the lightning import convention, I now get RunTimeErrors when trainign a TFT model, in with ddp.
Each rank immediatly returns errors similar to this, but with different shapes.

[rank5]: RuntimeError: [5]: params[0] in this process with sizes [53, 15] appears not to match sizes of the same param in process 0.

I believe this is because ddp strategy is to restart processes of the code which then instantiates separate versions of the model on subsets of the data. However the pytorch-forecasting implementation of TFT encodes categorical features internally. If different subsets of the data have different categorical values, the shapes wont match.

Expected behavior

TFT with categorical variables should support ddp training strategy.

Additional context

I'm training on a single EC2 node with 8 GPUs.

Trainer( accelerator="gpu", strategy="ddp", devices=1, ...
works but is slow:

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]

This doesn't work:

Trainer( accelerator="gpu", strategy="ddp", devices=8, ...

----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]

Versions
doesn't work:

python = "~=3.11.0"
pytorch-forecasting = "~=1.2.0"
pytorch-lightning = "==2.0.0"
torch = [
  { version = "==2.5.1+cu118", source = "pytorch-cuda", markers = "sys_platform =='linux' and platform_machine== 'x86_64'" },
  { version = "==2.5.1", source = "picnic", markers = "sys_platform== 'darwin'" },
]

works:


[tool.poetry.dependencies]
python = "~=3.10.0"
pytorch-forecasting = "~=0.10.2"
pytorch-lightning = "~=1.8.0"
torch = [
  { version = "==1.13.1+cu117", source = "pytorch-cuda", markers = "sys_platform=='linux' and platform_machine == 'x86_64'" },
  { version = "==1.13.1", source = "picnic", markers = "sys_platform == 'darwin'" },
]

The text was updated successfully, but these errors were encountered:

mkuiack · 2025-05-07T08:47:58Z

Manually setting the embedding_sizes when initialising the model with .from_dataset solved the size mismatch issue, showing that this is the cause of the bug.

ie.

    # embedding_sizes = {"category_column": (num_categories, embedding_dim)}
    embedding_sizes = {'store_id': (100, 50), 
                       'weekday_name': (7, 50), 
                       'month_name': (12, 50) }

    tft = TemporalFusionTransformer.from_dataset(
        dataset,
        embedding_sizes=embedding_sizes,
        **hyperparameters,
        ...
    )

However, (I think) this will mean that the categorical values will have different labels and vectors in the embedding space, so when the different models communicate weights they won't refer to the same thing rendering all categorical variables useless. embedding_labels should be pre-computed globally as well.

mkuiack added the bug Something isn't working label May 6, 2025

github-project-automation bot moved this to Needs triage & validation in Bugfixing - pytorch-forecasting May 6, 2025

github-project-automation bot added this to Bugfixing - pytorch-forecasting May 6, 2025

mkuiack changed the title ~~[BUG] TFT + categorical features seems not to be compatible with ddp in some situations.~~ [BUG] TFT + categorical features seems not to be compatible with DDP in some situations. May 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] TFT + categorical features seems not to be compatible with DDP in some situations. #1825

[BUG] TFT + categorical features seems not to be compatible with DDP in some situations. #1825

mkuiack commented May 6, 2025 •

edited

Loading

mkuiack commented May 7, 2025

Uh oh!

[BUG] TFT + categorical features seems not to be compatible with DDP in some situations. #1825

[BUG] TFT + categorical features seems not to be compatible with DDP in some situations. #1825

Comments

mkuiack commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mkuiack commented May 7, 2025

Uh oh!

mkuiack commented May 6, 2025 •

edited

Loading