Skip to content

Error wandb #209

@P-UnKnow08

Description

@P-UnKnow08

I'm trying to run Neuralangelo with the test set "lego," but I haven't been able to get past the point where I invoke the command:

torchrun --nproc_per_node=${GPUS} train.py
--logdir=logs/${GROUP}/${NAME}
--config=${CONFIG}
--show_pbar

This command throws an error for which I haven't been able to find a solution. I've tried changing many of the parameters in the project's files, but I still can't find a fix. Below is the error I'm encountering, in case anyone has a solution.

Thank you.

Error:
torchrun --nproc_per_node=${GPUS} train.py --logdir=logs/${GROUP}/${NAME} --config=${CONFIG} --show_pbar
(Setting affinity with NVML failed, skipping...)
[W Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
Training with 1 GPUs.
Using random seed 0
Make folder logs/example_group/example_name

  • checkpoint:
    • save_epoch: 9999999999
    • save_iter: 5000
    • save_latest_iter: 9999999999
    • save_period: 9999999999
    • strict_resume: True
  • cudnn:
    • benchmark: True
    • deterministic: False
  • data:
    • name: dummy
    • num_images: None
    • num_workers: 4
    • preload: True
    • readjust:
      • center: [0.0, 0.0, 0.0]
      • scale: 1.0
    • root: datasets/lego_ds2
    • train:
      • batch_size: 2
      • image_size: [801, 801]
      • subset: None
    • type: projects.neuralangelo.data
    • use_multi_epoch_loader: True
    • val:
      • batch_size: 2
      • image_size: [300, 300]
      • max_viz_samples: 16
      • subset: 4
  • image_save_iter: 9999999999
  • inference_args:
  • local_rank: 0
  • logdir: logs/example_group/example_name
  • logging_iter: 9999999999999
  • max_epoch: 9999999999
  • max_iter: 500000
  • metrics_epoch: None
  • metrics_iter: None
  • model:
    • appear_embed:
      • dim: 4
      • enabled: False
    • background:
      • enabled: True
      • encoding:
        • levels: 10
        • type: fourier
      • encoding_view:
        • levels: 3
        • type: spherical
      • mlp:
        • activ: relu
        • activ_density: softplus
        • activ_density_params:
        • activ_params:
        • hidden_dim: 256
        • hidden_dim_rgb: 128
        • num_layers: 8
        • num_layers_rgb: 2
        • skip: [4]
        • skip_rgb: []
      • view_dep: True
      • white: False
    • object:
      • rgb:
        • encoding_view:
          • levels: 3
          • type: spherical
        • mlp:
          • activ: relu_
          • activ_params:
          • hidden_dim: 256
          • num_layers: 4
          • skip: []
          • weight_norm: True
        • mode: idr
      • s_var:
        • anneal_end: 0.1
        • init_val: 3.0
      • sdf:
        • encoding:
          • coarse2fine:
            • enabled: True
            • init_active_level: 4
            • step: 5000
          • hashgrid:
            • dict_size: 21
            • dim: 4
            • max_logres: 11
            • min_logres: 5
            • range: [-2, 2]
          • levels: 16
          • type: hashgrid
        • gradient:
          • mode: numerical
          • taps: 4
        • mlp:
          • activ: softplus
          • activ_params:
            • beta: 100
          • geometric_init: True
          • hidden_dim: 256
          • inside_out: False
          • num_layers: 1
          • out_bias: 0.5
          • skip: []
          • weight_norm: True
    • render:
      • num_sample_hierarchy: 4
      • num_samples:
        • background: 32
        • coarse: 64
        • fine: 16
      • rand_rays: 512
      • stratified: True
    • type: projects.neuralangelo.model
  • nvtx_profile: False
  • optim:
    • fused_opt: False
    • params:
      • lr: 0.001
      • weight_decay: 0.01
    • sched:
      • gamma: 10.0
      • iteration_mode: True
      • step_size: 9999999999
      • two_steps: [300000, 400000]
      • type: two_steps_with_warmup
      • warm_up_end: 5000
    • type: AdamW
  • pretrained_weight: None
  • source_filename: projects/neuralangelo/configs/custom/lego.yaml
  • speed_benchmark: False
  • test_data:
    • name: dummy
    • num_workers: 0
    • test:
      • batch_size: 1
      • is_lmdb: False
      • roots: None
    • type: imaginaire.datasets.images
  • timeout_period: 9999999
  • trainer:
    • amp_config:
      • backoff_factor: 0.5
      • enabled: False
      • growth_factor: 2.0
      • growth_interval: 2000
      • init_scale: 65536.0
    • ddp_config:
      • find_unused_parameters: False
      • static_graph: True
    • depth_vis_scale: 0.5
    • ema_config:
      • beta: 0.9999
      • enabled: False
      • load_ema_checkpoint: False
      • start_iteration: 0
    • grad_accum_iter: 1
    • image_to_tensorboard: False
    • init:
      • gain: None
      • type: none
    • loss_weight:
      • curvature: 0.0005
      • eikonal: 0.1
      • render: 1.0
    • type: projects.neuralangelo.trainer
  • validation_iter: 5000
  • wandb_image_iter: 10000
  • wandb_scalar_iter: 100
    cudnn benchmark: True
    cudnn deterministic: False
    Setup trainer.
    Using random seed 0
    /home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
    warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
    model parameter count: 99,705,900
    Initialize model weights using type: none, gain: None
    Using random seed 0
    [rank0]:[W Utils.hpp:108] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarString)
    Allow TensorFloat32 operations on supported devices
    Train dataset length: 100
    Val dataset length: 4
    Training from scratch.
    Initialize wandb
    [rank0]: Traceback (most recent call last):
    [rank0]: File "/mnt/d/Documents/neuralangelo/train.py", line 104, in
    [rank0]: main()
    [rank0]: File "/mnt/d/Documents/neuralangelo/train.py", line 85, in main
    [rank0]: trainer.init_wandb(cfg,
    [rank0]: File "/mnt/d/Documents/neuralangelo/imaginaire/trainers/base.py", line 269, in init_wandb
    [rank0]: wandb.watch(self.model_module)
    [rank0]: File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_watch.py", line 49, in watch
    [rank0]: tel.feature.watch = True
    [rank0]: File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/lib/telemetry.py", line 42, in exit
    [rank0]: self._run._telemetry_callback(self._obj)
    [rank0]: File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 799, in _telemetry_callback
    [rank0]: self._telemetry_obj.MergeFrom(telem_obj)
    [rank0]: AttributeError: 'Run' object has no attribute '_telemetry_obj'
    E0822 21:49:57.518840 139941045491520 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 25214) of binary: /home/miguel12/miniconda3/envs/neuralangelo/bin/python
    Traceback (most recent call last):
    File "/home/miguel12/miniconda3/envs/neuralangelo/bin/torchrun", line 10, in
    sys.exit(main())
    File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
    return f(*args, **kwargs)
    File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
    File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
    File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call
    return launch_agent(self._config, self._entrypoint, list(args))
    File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
    ============================================================
    train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-08-22_21:49:57
host : DESKTOP-Q0DS9I2.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 25214)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions