-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
re-training issues #184
Comments
Could you add a raise in the data processing code where that message is printed to see the stack trace? |
GPU available: True (cuda), used: True distributed_backend=nccl LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] | Name | Type | Params | Mode From the attached image, it seems that the training runs when set up with DDP for small-scale learning. However, a Featurizer error still occurs. Upon checking the code, I found that this part is related to data.module.training.py in the TrainingDataset.getitem method (line 220 ~). Additionally, I noticed that when a Featurizer error occurs, the code skips that sample and retrieves another one to continue training. if so, skiping dataset sample might affect the quality of training. Thank you so much for your contributions to open source community. |
Hello,
I'm currently re-training a Boltz model using the provided datasets. When I attempt to leverage DDP (Distributed Data Parallel) for multi-GPU training, I encounter an error related to the Featurizer. Specifically, the error occurs when using devices=4 with a batch size of 1. The error message is as follows:
"Featurizer failed on 6y9b with error: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping."
Interestingly, when I use devices=1 with the same batch size=1, the training proceeds without any issues. I suspect that this issue might be related to DDP or the DataLoader, but I'm not certain. Could you please provide some insights into this matter?
Thank you in advance for your help!
The text was updated successfully, but these errors were encountered: