Error wandb

I'm trying to run Neuralangelo with the test set "lego," but I haven't been able to get past the point where I invoke the command: 

torchrun --nproc_per_node=${GPUS} train.py \
    --logdir=logs/${GROUP}/${NAME} \
    --config=${CONFIG} \
    --show_pbar

This command throws an error for which I haven't been able to find a solution. I've tried changing many of the parameters in the project's files, but I still can't find a fix. Below is the error I'm encountering, in case anyone has a solution.

Thank you.

Error:
 torchrun --nproc_per_node=${GPUS} train.py     --logdir=logs/${GROUP}/${NAME}     --config=${CONFIG}     --show_pbar
(Setting affinity with NVML failed, skipping...)
[W Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
Training with 1 GPUs.
Using random seed 0
Make folder logs/example_group/example_name
* checkpoint:
   * save_epoch: 9999999999
   * save_iter: 5000
   * save_latest_iter: 9999999999
   * save_period: 9999999999
   * strict_resume: True
* cudnn:
   * benchmark: True
   * deterministic: False
* data:
   * name: dummy
   * num_images: None
   * num_workers: 4
   * preload: True
   * readjust:
      * center: [0.0, 0.0, 0.0]
      * scale: 1.0
   * root: datasets/lego_ds2
   * train:
      * batch_size: 2
      * image_size: [801, 801]
      * subset: None
   * type: projects.neuralangelo.data
   * use_multi_epoch_loader: True
   * val:
      * batch_size: 2
      * image_size: [300, 300]
      * max_viz_samples: 16
      * subset: 4
* image_save_iter: 9999999999
* inference_args:
* local_rank: 0
* logdir: logs/example_group/example_name
* logging_iter: 9999999999999
* max_epoch: 9999999999
* max_iter: 500000
* metrics_epoch: None
* metrics_iter: None
* model:
   * appear_embed:
      * dim: 4
      * enabled: False
   * background:
      * enabled: True
      * encoding:
         * levels: 10
         * type: fourier
      * encoding_view:
         * levels: 3
         * type: spherical
      * mlp:
         * activ: relu
         * activ_density: softplus
         * activ_density_params:
         * activ_params:
         * hidden_dim: 256
         * hidden_dim_rgb: 128
         * num_layers: 8
         * num_layers_rgb: 2
         * skip: [4]
         * skip_rgb: []
      * view_dep: True
      * white: False
   * object:
      * rgb:
         * encoding_view:
            * levels: 3
            * type: spherical
         * mlp:
            * activ: relu_
            * activ_params:
            * hidden_dim: 256
            * num_layers: 4
            * skip: []
            * weight_norm: True
         * mode: idr
      * s_var:
         * anneal_end: 0.1
         * init_val: 3.0
      * sdf:
         * encoding:
            * coarse2fine:
               * enabled: True
               * init_active_level: 4
               * step: 5000
            * hashgrid:
               * dict_size: 21
               * dim: 4
               * max_logres: 11
               * min_logres: 5
               * range: [-2, 2]
            * levels: 16
            * type: hashgrid
         * gradient:
            * mode: numerical
            * taps: 4
         * mlp:
            * activ: softplus
            * activ_params:
               * beta: 100
            * geometric_init: True
            * hidden_dim: 256
            * inside_out: False
            * num_layers: 1
            * out_bias: 0.5
            * skip: []
            * weight_norm: True
   * render:
      * num_sample_hierarchy: 4
      * num_samples:
         * background: 32
         * coarse: 64
         * fine: 16
      * rand_rays: 512
      * stratified: True
   * type: projects.neuralangelo.model
* nvtx_profile: False
* optim:
   * fused_opt: False
   * params:
      * lr: 0.001
      * weight_decay: 0.01
   * sched:
      * gamma: 10.0
      * iteration_mode: True
      * step_size: 9999999999
      * two_steps: [300000, 400000]
      * type: two_steps_with_warmup
      * warm_up_end: 5000
   * type: AdamW
* pretrained_weight: None
* source_filename: projects/neuralangelo/configs/custom/lego.yaml
* speed_benchmark: False
* test_data:
   * name: dummy
   * num_workers: 0
   * test:
      * batch_size: 1
      * is_lmdb: False
      * roots: None
   * type: imaginaire.datasets.images
* timeout_period: 9999999
* trainer:
   * amp_config:
      * backoff_factor: 0.5
      * enabled: False
      * growth_factor: 2.0
      * growth_interval: 2000
      * init_scale: 65536.0
   * ddp_config:
      * find_unused_parameters: False
      * static_graph: True
   * depth_vis_scale: 0.5
   * ema_config:
      * beta: 0.9999
      * enabled: False
      * load_ema_checkpoint: False
      * start_iteration: 0
   * grad_accum_iter: 1
   * image_to_tensorboard: False
   * init:
      * gain: None
      * type: none
   * loss_weight:
      * curvature: 0.0005
      * eikonal: 0.1
      * render: 1.0
   * type: projects.neuralangelo.trainer
* validation_iter: 5000
* wandb_image_iter: 10000
* wandb_scalar_iter: 100
cudnn benchmark: True
cudnn deterministic: False
Setup trainer.
Using random seed 0
/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
model parameter count: 99,705,900
Initialize model weights using type: none, gain: None
Using random seed 0
[rank0]:[W Utils.hpp:108] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarString)
Allow TensorFloat32 operations on supported devices
Train dataset length: 100
Val dataset length: 4
Training from scratch.
Initialize wandb
[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/d/Documents/neuralangelo/train.py", line 104, in <module>
[rank0]:     main()
[rank0]:   File "/mnt/d/Documents/neuralangelo/train.py", line 85, in main
[rank0]:     trainer.init_wandb(cfg,
[rank0]:   File "/mnt/d/Documents/neuralangelo/imaginaire/trainers/base.py", line 269, in init_wandb
[rank0]:     wandb.watch(self.model_module)
[rank0]:   File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_watch.py", line 49, in watch
[rank0]:     tel.feature.watch = True
[rank0]:   File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/lib/telemetry.py", line 42, in __exit__
[rank0]:     self._run._telemetry_callback(self._obj)
[rank0]:   File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 799, in _telemetry_callback
[rank0]:     self._telemetry_obj.MergeFrom(telem_obj)
[rank0]: AttributeError: 'Run' object has no attribute '_telemetry_obj'
E0822 21:49:57.518840 139941045491520 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 25214) of binary: /home/miguel12/miniconda3/envs/neuralangelo/bin/python
Traceback (most recent call last):
  File "/home/miguel12/miniconda3/envs/neuralangelo/bin/torchrun", line 10, in <module>
    sys.exit(main())
  File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-22_21:49:57
  host      : DESKTOP-Q0DS9I2.
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 25214)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error wandb #209

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-08-22_21:49:57
host : DESKTOP-Q0DS9I2.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 25214)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error wandb #209

Description

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-08-22_21:49:57 host : DESKTOP-Q0DS9I2. rank : 0 (local_rank: 0) exitcode : 1 (pid: 25214) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-08-22_21:49:57
host : DESKTOP-Q0DS9I2.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 25214)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html