-
Notifications
You must be signed in to change notification settings - Fork 400
Description
I'm trying to run Neuralangelo with the test set "lego," but I haven't been able to get past the point where I invoke the command:
torchrun --nproc_per_node=${GPUS} train.py
--logdir=logs/${GROUP}/${NAME}
--config=${CONFIG}
--show_pbar
This command throws an error for which I haven't been able to find a solution. I've tried changing many of the parameters in the project's files, but I still can't find a fix. Below is the error I'm encountering, in case anyone has a solution.
Thank you.
Error:
torchrun --nproc_per_node=${GPUS} train.py --logdir=logs/${GROUP}/${NAME} --config=${CONFIG} --show_pbar
(Setting affinity with NVML failed, skipping...)
[W Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
Training with 1 GPUs.
Using random seed 0
Make folder logs/example_group/example_name
- checkpoint:
- save_epoch: 9999999999
- save_iter: 5000
- save_latest_iter: 9999999999
- save_period: 9999999999
- strict_resume: True
- cudnn:
- benchmark: True
- deterministic: False
- data:
- name: dummy
- num_images: None
- num_workers: 4
- preload: True
- readjust:
- center: [0.0, 0.0, 0.0]
- scale: 1.0
- root: datasets/lego_ds2
- train:
- batch_size: 2
- image_size: [801, 801]
- subset: None
- type: projects.neuralangelo.data
- use_multi_epoch_loader: True
- val:
- batch_size: 2
- image_size: [300, 300]
- max_viz_samples: 16
- subset: 4
- image_save_iter: 9999999999
- inference_args:
- local_rank: 0
- logdir: logs/example_group/example_name
- logging_iter: 9999999999999
- max_epoch: 9999999999
- max_iter: 500000
- metrics_epoch: None
- metrics_iter: None
- model:
- appear_embed:
- dim: 4
- enabled: False
- background:
- enabled: True
- encoding:
- levels: 10
- type: fourier
- encoding_view:
- levels: 3
- type: spherical
- mlp:
- activ: relu
- activ_density: softplus
- activ_density_params:
- activ_params:
- hidden_dim: 256
- hidden_dim_rgb: 128
- num_layers: 8
- num_layers_rgb: 2
- skip: [4]
- skip_rgb: []
- view_dep: True
- white: False
- object:
- rgb:
- encoding_view:
- levels: 3
- type: spherical
- mlp:
- activ: relu_
- activ_params:
- hidden_dim: 256
- num_layers: 4
- skip: []
- weight_norm: True
- mode: idr
- encoding_view:
- s_var:
- anneal_end: 0.1
- init_val: 3.0
- sdf:
- encoding:
- coarse2fine:
- enabled: True
- init_active_level: 4
- step: 5000
- hashgrid:
- dict_size: 21
- dim: 4
- max_logres: 11
- min_logres: 5
- range: [-2, 2]
- levels: 16
- type: hashgrid
- coarse2fine:
- gradient:
- mode: numerical
- taps: 4
- mlp:
- activ: softplus
- activ_params:
- beta: 100
- geometric_init: True
- hidden_dim: 256
- inside_out: False
- num_layers: 1
- out_bias: 0.5
- skip: []
- weight_norm: True
- encoding:
- rgb:
- render:
- num_sample_hierarchy: 4
- num_samples:
- background: 32
- coarse: 64
- fine: 16
- rand_rays: 512
- stratified: True
- type: projects.neuralangelo.model
- appear_embed:
- nvtx_profile: False
- optim:
- fused_opt: False
- params:
- lr: 0.001
- weight_decay: 0.01
- sched:
- gamma: 10.0
- iteration_mode: True
- step_size: 9999999999
- two_steps: [300000, 400000]
- type: two_steps_with_warmup
- warm_up_end: 5000
- type: AdamW
- pretrained_weight: None
- source_filename: projects/neuralangelo/configs/custom/lego.yaml
- speed_benchmark: False
- test_data:
- name: dummy
- num_workers: 0
- test:
- batch_size: 1
- is_lmdb: False
- roots: None
- type: imaginaire.datasets.images
- timeout_period: 9999999
- trainer:
- amp_config:
- backoff_factor: 0.5
- enabled: False
- growth_factor: 2.0
- growth_interval: 2000
- init_scale: 65536.0
- ddp_config:
- find_unused_parameters: False
- static_graph: True
- depth_vis_scale: 0.5
- ema_config:
- beta: 0.9999
- enabled: False
- load_ema_checkpoint: False
- start_iteration: 0
- grad_accum_iter: 1
- image_to_tensorboard: False
- init:
- gain: None
- type: none
- loss_weight:
- curvature: 0.0005
- eikonal: 0.1
- render: 1.0
- type: projects.neuralangelo.trainer
- amp_config:
- validation_iter: 5000
- wandb_image_iter: 10000
- wandb_scalar_iter: 100
cudnn benchmark: True
cudnn deterministic: False
Setup trainer.
Using random seed 0
/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
model parameter count: 99,705,900
Initialize model weights using type: none, gain: None
Using random seed 0
[rank0]:[W Utils.hpp:108] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarString)
Allow TensorFloat32 operations on supported devices
Train dataset length: 100
Val dataset length: 4
Training from scratch.
Initialize wandb
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/d/Documents/neuralangelo/train.py", line 104, in
[rank0]: main()
[rank0]: File "/mnt/d/Documents/neuralangelo/train.py", line 85, in main
[rank0]: trainer.init_wandb(cfg,
[rank0]: File "/mnt/d/Documents/neuralangelo/imaginaire/trainers/base.py", line 269, in init_wandb
[rank0]: wandb.watch(self.model_module)
[rank0]: File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_watch.py", line 49, in watch
[rank0]: tel.feature.watch = True
[rank0]: File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/lib/telemetry.py", line 42, in exit
[rank0]: self._run._telemetry_callback(self._obj)
[rank0]: File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 799, in _telemetry_callback
[rank0]: self._telemetry_obj.MergeFrom(telem_obj)
[rank0]: AttributeError: 'Run' object has no attribute '_telemetry_obj'
E0822 21:49:57.518840 139941045491520 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 25214) of binary: /home/miguel12/miniconda3/envs/neuralangelo/bin/python
Traceback (most recent call last):
File "/home/miguel12/miniconda3/envs/neuralangelo/bin/torchrun", line 10, in
sys.exit(main())
File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED