Skip to content

[BUG] TFT + categorical features seems not to be compatible with DDP in some situations. #1825

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mkuiack opened this issue May 6, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@mkuiack
Copy link

mkuiack commented May 6, 2025

Describe the bug

Ive recently been updating package dependencies in my project (python, pytorch, lighting). Without changing anything else in my code or hardware, aside from the lightning import convention, I now get RunTimeErrors when trainign a TFT model, in with ddp.
Each rank immediatly returns errors similar to this, but with different shapes.

[rank5]: RuntimeError: [5]: params[0] in this process with sizes [53, 15] appears not to match sizes of the same param in process 0.

I believe this is because ddp strategy is to restart processes of the code which then instantiates separate versions of the model on subsets of the data. However the pytorch-forecasting implementation of TFT encodes categorical features internally. If different subsets of the data have different categorical values, the shapes wont match.

Expected behavior

TFT with categorical variables should support ddp training strategy.

Additional context

I'm training on a single EC2 node with 8 GPUs.

Trainer( accelerator="gpu", strategy="ddp", devices=1, ...
works but is slow:

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]

This doesn't work:

Trainer( accelerator="gpu", strategy="ddp", devices=8, ...

----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]

Versions
doesn't work:

python = "~=3.11.0"
pytorch-forecasting = "~=1.2.0"
pytorch-lightning = "==2.0.0"
torch = [
  { version = "==2.5.1+cu118", source = "pytorch-cuda", markers = "sys_platform =='linux' and platform_machine== 'x86_64'" },
  { version = "==2.5.1", source = "picnic", markers = "sys_platform== 'darwin'" },
]

works:


[tool.poetry.dependencies]
python = "~=3.10.0"
pytorch-forecasting = "~=0.10.2"
pytorch-lightning = "~=1.8.0"
torch = [
  { version = "==1.13.1+cu117", source = "pytorch-cuda", markers = "sys_platform=='linux' and platform_machine == 'x86_64'" },
  { version = "==1.13.1", source = "picnic", markers = "sys_platform == 'darwin'" },
]
@mkuiack mkuiack added the bug Something isn't working label May 6, 2025
@github-project-automation github-project-automation bot moved this to Needs triage & validation in Bugfixing - pytorch-forecasting May 6, 2025
@mkuiack
Copy link
Author

mkuiack commented May 7, 2025

Manually setting the embedding_sizes when initialising the model with .from_dataset solved the size mismatch issue, showing that this is the cause of the bug.

ie.

    # embedding_sizes = {"category_column": (num_categories, embedding_dim)}
    embedding_sizes = {'store_id': (100, 50), 
                       'weekday_name': (7, 50), 
                       'month_name': (12, 50) }

    tft = TemporalFusionTransformer.from_dataset(
        dataset,
        embedding_sizes=embedding_sizes,
        **hyperparameters,
        ...
    )

However, (I think) this will mean that the categorical values will have different labels and vectors in the embedding space, so when the different models communicate weights they won't refer to the same thing rendering all categorical variables useless. embedding_labels should be pre-computed globally as well.

@mkuiack mkuiack changed the title [BUG] TFT + categorical features seems not to be compatible with ddp in some situations. [BUG] TFT + categorical features seems not to be compatible with DDP in some situations. May 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Needs triage & validation
Development

No branches or pull requests

1 participant