Skip to content

re-training issues #184

Open
Open
@Han00127

Description

@Han00127

Hello,

I'm currently re-training a Boltz model using the provided datasets. When I attempt to leverage DDP (Distributed Data Parallel) for multi-GPU training, I encounter an error related to the Featurizer. Specifically, the error occurs when using devices=4 with a batch size of 1. The error message is as follows:

"Featurizer failed on 6y9b with error: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping."

Interestingly, when I use devices=1 with the same batch size=1, the training proceeds without any issues. I suspect that this issue might be related to DDP or the DataLoader, but I'm not certain. Could you please provide some insights into this matter?

Thank you in advance for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions