Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re-training issues #184

Open
Han00127 opened this issue Feb 18, 2025 · 2 comments
Open

re-training issues #184

Han00127 opened this issue Feb 18, 2025 · 2 comments

Comments

@Han00127
Copy link

Hello,

I'm currently re-training a Boltz model using the provided datasets. When I attempt to leverage DDP (Distributed Data Parallel) for multi-GPU training, I encounter an error related to the Featurizer. Specifically, the error occurs when using devices=4 with a batch size of 1. The error message is as follows:

"Featurizer failed on 6y9b with error: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping."

Interestingly, when I use devices=1 with the same batch size=1, the training proceeds without any issues. I suspect that this issue might be related to DDP or the DataLoader, but I'm not certain. Could you please provide some insights into this matter?

Thank you in advance for your help!

@gcorso
Copy link
Collaborator

gcorso commented Feb 18, 2025

Could you add a raise in the data processing code where that message is printed to see the stack trace?

@Han00127
Copy link
Author

Han00127 commented Feb 19, 2025

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
Training dataset size: 180540
Training dataset size: 180540
Training dataset size: 180540
Training dataset size: 180540
Training dataset size: 180540
Training dataset size: 180540
Training dataset size: 180540
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8

distributed_backend=nccl
All distributed processes registered. Starting with 8 processes

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
/root/mambaforge/envs/fold/lib/python3.11/site-packages/torch/optim/lr_scheduler.py:60: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
warnings.warn(
/root/mambaforge/envs/fold/lib/python3.11/site-packages/torch/optim/lr_scheduler.py:60: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
warnings.warn(
/root/mambaforge/envs/fold/lib/python3.11/site-packages/torch/optim/lr_scheduler.py:60: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
warnings.warn(
/root/mambaforge/envs/fold/lib/python3.11/site-packages/torch/optim/lr_scheduler.py:60: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
warnings.warn(
/root/mambaforge/envs/fold/lib/python3.11/site-packages/torch/optim/lr_scheduler.py:60: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
warnings.warn(
/root/mambaforge/envs/fold/lib/python3.11/site-packages/torch/optim/lr_scheduler.py:60: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
warnings.warn(
/root/mambaforge/envs/fold/lib/python3.11/site-packages/torch/optim/lr_scheduler.py:60: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
warnings.warn(
/root/mambaforge/envs/fold/lib/python3.11/site-packages/torch/optim/lr_scheduler.py:60: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
warnings.warn(

| Name | Type | Params | Mode
0 | lddt | ModuleDict | 0 | train
1 | disto_lddt | ModuleDict | 0 | train
2 | complex_lddt | ModuleDict | 0 | train
3 | rmsd | MeanMetric | 0 | train
4 | best_rmsd | MeanMetric | 0 | train
5 | train_confidence_loss_logger | MeanMetric | 0 | train
6 | train_confidence_loss_dict_logger | ModuleDict | 0 | train
7 | s_init | Linear | 174 K | train
8 | z_init_1 | Linear | 58.2 K | train
9 | z_init_2 | Linear | 58.2 K | train
10 | input_embedder | InputEmbedder | 1.0 M | train
11 | rel_pos | RelativePositionEncoder | 17.8 K | train
12 | token_bonds | Linear | 128 | train
13 | s_norm | LayerNorm | 768 | train
14 | z_norm | LayerNorm | 256 | train
15 | s_recycle | Linear | 147 K | train
16 | z_recycle | Linear | 16.4 K | train
17 | msa_module | MSAModule | 3.2 M | train
18 | pairformer_module | PairformerModule | 147 M | train
19 | structure_module | AtomDiffusion | 280 M | train
20 | distogram_module | DistogramModule | 8.3 K | train
432 M Trainable params
512 Non-trainable params
432 M Total params
1,728.830 Total estimated model params size (MB)
4280 Modules in train mode
0 Modules in eval mode
Sanity Checking: | | 0/? [00:00<?, ?it/s]/root/mambaforge/envs/fold/lib/python3.11/site-packages/pytorch_lightning/utilities/data.py:105: Total length of DataLoader across ranks is zero. Please make sure this was your intention.
/root/mambaforge/envs/fold/lib/python3.11/site-packages/pytorch_lightning/loops/fit_loop.py:298: The number of training batches (13) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
Epoch 0: 0%| | 0/13 [00:00<?, ?it/s]Featurizer failed on 1v08 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 3tk6 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 4ir0 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 5edh with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 5xz8 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 3zci with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 2woj with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 3f2w with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 4jyz with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 7nyh with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 3gn4 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6ucv with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6exh with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 5udd with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 1vbw with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6ogg with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Epoch 0: 15%|███████████████████▋ | 2/13 [00:27<02:30, 0.07it/s, v_num=t4xm]Featurizer failed on 295d with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6pzq with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 1e6x with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Epoch 0: 23%|█████████████████████████████▌ | 3/13 [00:35<01:58, 0.08it/s, v_num=t4xm]Featurizer failed on 6fkc with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6wr9 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 4n0h with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 1lw4 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6fbk with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Epoch 0: 31%|███████████████████████████████████████▍ | 4/13 [00:43<01:38, 0.09it/s, v_num=t4xm]Featurizer failed on 5ml7 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Epoch 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [01:57<00:00, 0.11it/s, v_num=t4xm]/root/mambaforge/envs/fold/lib/python3.11/site-packages/torchmetrics/utilities/prints.py:43: UserWarning: The compute method of metric MeanMetric was called before the update method which may lead to errors, as metric states have not yet been updated.
warnings.warn(*args, **kwargs) # noqa: B028
/root/mambaforge/envs/fold/lib/python3.11/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:384: ModelCheckpoint(monitor='val/lddt') could not find the monitored key in the returned metrics: ['train/distogram_loss', 'train/diffusion_loss', 'train/mse_loss', 'train/smooth_lddt_loss', 'train/loss', 'train/grad_norm', 'train/param_norm', 'lr', 'train/grad_norm_msa_module', 'train/param_norm_msa_module', 'train/grad_norm_pairformer_module', 'train/param_norm_pairformer_module', 'train/grad_norm_structure_module', 'train/param_norm_structure_module', 'train/confidence_loss', 'train/plddt_loss', 'train/resolved_loss', 'train/pde_loss', 'train/pae_loss', 'epoch', 'step']. HINT: Did you call log('val/lddt', value) in the LightningModule?
Epoch 1: 0%| | 0/13 [00:00<?, ?it/s, v_num=t4xm]Featurizer failed on 1qpa with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 2qzq with error Cannot choose from an empty sequence. Skipping.
Featurizer failed on 5e3o with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6ovz with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6jdj with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 5lr3 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 5xrz with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 5kwy with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6swa with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 5l9t with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 7nhk with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 4oi8 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6zj3 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 4lum with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 4mnq with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 5oa3 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6ah0 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6ip5 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6vyg with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 5k3j with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 5j9q with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6bk8 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6sgb with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6fec with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Epoch 1: 8%|█████████▊ | 1/13 [00:19<03:58, 0.05it/s, v_num=t4xm]Featurizer failed on 1gpm with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Epoch 1: 15%|███████████████████▋ | 2/13 [00:28<02:34, 0.07it/s, v_num=t4xm]Featurizer failed on 6vsk with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Epoch 1: 23%|█████████████████████████████▌ | 3/13 [00:36<02:00, 0.08it/s, v_num=t4xm]Featurizer failed on 6sba with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 7ase with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Epoch 1: 31%|███████████████████████████████████████▍ | 4/13 [00:44<01:40, 0.09it/s, v_num=t4xm]Featurizer failed on 5xg4 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 1mci with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Epoch 2: 0%| | 0/13 [00:00<?, ?it/s, v_num=t4xm]Featurizer failed on 4ng2 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 4zym with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6n4o with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 5nl0 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 2sxl with error Cannot choose from an empty sequence. Skipping.
Featurizer failed on 2iy3 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 4g2p with error Cannot choose from an empty sequence. Skipping.
Featurizer failed on 5cg9 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 5v0a with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6uup with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 7cvx with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 7ary with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 2bjc with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 3icz with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 1ux7 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 3cc2 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6hd0 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 4fhw with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 2cl5 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 4zeb with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6rxx with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 5w0k with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 5y2z with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6zsd with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6tmg with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 2z3f with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 3gug with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 2p0v with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 7abf with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 4qom with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6kda with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6rxx with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 4oy2 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 5kcs with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6arv with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 2aoq with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6or5 with error Cannot choose from an empty sequence. Skipping.
Epoch 2: 15%|███████████████████▋ | 2/13 [00:27<02:31, 0.07it/s, v_num=t4xm]Featurizer failed on 6peu with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6z7o with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6kn7 with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 1l1o with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Epoch 2: 23%|█████████████████████████████▌ | 3/13 [00:35<01:59, 0.08it/s, v_num=t4xm]Featurizer failed on 4e5c with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 4ydp with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 4nvq with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 3j0p with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Featurizer failed on 6gsn with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Epoch 2: 31%|███████████████████████████████████████▍ | 4/13 [00:44<01:39, 0.09it/s, v_num=t4xm]Featurizer failed on 3bpj with error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Skipping.
Epoch 2: 69%|████████████████████████████████████████████████████████████████████████████████████████▌ | 9/13 [01:25<00:37, 0.11it/s, v_num=t4xm]

From the attached image, it seems that the training runs when set up with DDP for small-scale learning. However, a Featurizer error still occurs. Upon checking the code, I found that this part is related to data.module.training.py in the TrainingDataset.getitem method (line 220 ~).

Additionally, I noticed that when a Featurizer error occurs, the code skips that sample and retrieves another one to continue training. if so, skiping dataset sample might affect the quality of training.
Is there any further explanation about this issue or a way to solve this issue?

Thank you so much for your contributions to open source community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants