Skip to content

Colab demo doesn't work #210

@Rajat-Vishwa

Description

@Rajat-Vishwa

Running the colab example initially gives #205. (COLMAP fails to execute)
#205 is solved by adding the following before installing COLMAP,

!wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
!mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
!wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-ubuntu2204-11-7-local_11.7.0-515.43.04-1_amd64.deb
!dpkg -i cuda-repo-ubuntu2204-11-7-local_11.7.0-515.43.04-1_amd64.deb
!cp /var/cuda-repo-ubuntu2204-11-7-local/cuda-*-keyring.gpg /usr/share/keyrings/
!apt-get update
!apt-get -y install cuda-11-7
!update-alternatives --set cuda /usr/local/cuda-11.7

This fixes COLMAP and it is able to run the preprocessing untill it throws error on the training step,

# @title { vertical-output: true }
%cd /content/neuralangelo
GROUP = "test_exp"
NAME = "lego"
!torchrun --nproc_per_node=1 train.py \
    --logdir=logs/{GROUP}/{NAME} \
    --show_pbar \
    --config=projects/neuralangelo/configs/custom/lego.yaml \
    --data.readjust.scale=0.5 \
    --max_iter=20000 \
    --validation_iter=99999999 \
    --model.object.sdf.encoding.coarse2fine.step=200 \
    --model.object.sdf.encoding.hashgrid.dict_size=19 \
    --optim.sched.warm_up_end=200 \
    --optim.sched.two_steps=[12000,16000]

ERROR :

[W829 13:19:57.835930806 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
Training with 1 GPUs.
Using random seed 0
Make folder logs/test_exp/lego
* checkpoint:
   * save_epoch: 9999999999
   * save_iter: 20000
   * save_latest_iter: 9999999999
   * save_period: 9999999999
   * strict_resume: True
* cudnn:
   * benchmark: True
   * deterministic: False
* data:
   * name: dummy
   * num_images: None
   * num_workers: 4
   * preload: True
   * readjust:
      * center: [0.0, 0.0, 0.0]
      * scale: 0.5
   * root: datasets/lego_ds2
   * train:
      * batch_size: 2
      * image_size: [801, 801]
      * subset: None
   * type: projects.neuralangelo.data
   * use_multi_epoch_loader: True
   * val:
      * batch_size: 2
      * image_size: [300, 300]
      * max_viz_samples: 16
      * subset: 4
* image_save_iter: 9999999999
* inference_args:
* local_rank: 0
* logdir: logs/test_exp/lego
* logging_iter: 9999999999999
* max_epoch: 9999999999
* max_iter: 20000
* metrics_epoch: None
* metrics_iter: None
* model:
   * appear_embed:
      * dim: 8
      * enabled: False
   * background:
      * enabled: True
      * encoding:
         * levels: 10
         * type: fourier
      * encoding_view:
         * levels: 3
         * type: spherical
      * mlp:
         * activ: relu
         * activ_density: softplus
         * activ_density_params:
         * activ_params:
         * hidden_dim: 256
         * hidden_dim_rgb: 128
         * num_layers: 8
         * num_layers_rgb: 2
         * skip: [4]
         * skip_rgb: []
      * view_dep: True
      * white: False
   * object:
      * rgb:
         * encoding_view:
            * levels: 3
            * type: spherical
         * mlp:
            * activ: relu_
            * activ_params:
            * hidden_dim: 256
            * num_layers: 4
            * skip: []
            * weight_norm: True
         * mode: idr
      * s_var:
         * anneal_end: 0.1
         * init_val: 3.0
      * sdf:
         * encoding:
            * coarse2fine:
               * enabled: True
               * init_active_level: 4
               * step: 200
            * hashgrid:
               * dict_size: 19
               * dim: 8
               * max_logres: 11
               * min_logres: 5
               * range: [-2, 2]
            * levels: 16
            * type: hashgrid
         * gradient:
            * mode: numerical
            * taps: 4
         * mlp:
            * activ: softplus
            * activ_params:
               * beta: 100
            * geometric_init: True
            * hidden_dim: 256
            * inside_out: False
            * num_layers: 1
            * out_bias: 0.5
            * skip: []
            * weight_norm: True
   * render:
      * num_sample_hierarchy: 4
      * num_samples:
         * background: 32
         * coarse: 64
         * fine: 16
      * rand_rays: 512
      * stratified: True
   * type: projects.neuralangelo.model
* nvtx_profile: False
* optim:
   * fused_opt: False
   * params:
      * lr: 0.001
      * weight_decay: 0.01
   * sched:
      * gamma: 10.0
      * iteration_mode: True
      * step_size: 9999999999
      * two_steps: [12000, 16000]
      * type: two_steps_with_warmup
      * warm_up_end: 200
   * type: AdamW
* pretrained_weight: None
* source_filename: projects/neuralangelo/configs/custom/lego.yaml
* speed_benchmark: False
* test_data:
   * name: dummy
   * num_workers: 0
   * test:
      * batch_size: 1
      * is_lmdb: False
      * roots: None
   * type: imaginaire.datasets.images
* timeout_period: 9999999
* trainer:
   * amp_config:
      * backoff_factor: 0.5
      * enabled: False
      * growth_factor: 2.0
      * growth_interval: 2000
      * init_scale: 65536.0
   * ddp_config:
      * find_unused_parameters: False
      * static_graph: True
   * depth_vis_scale: 0.5
   * ema_config:
      * beta: 0.9999
      * enabled: False
      * load_ema_checkpoint: False
      * start_iteration: 0
   * grad_accum_iter: 1
   * image_to_tensorboard: False
   * init:
      * gain: None
      * type: none
   * loss_weight:
      * curvature: 0.0005
      * eikonal: 0.1
      * render: 1.0
   * type: projects.neuralangelo.trainer
* validation_iter: 99999999
* wandb_image_iter: 10000
* wandb_scalar_iter: 100
cudnn benchmark: True
cudnn deterministic: False
Setup trainer.
Using random seed 0
[rank0]: Traceback (most recent call last):
[rank0]:   File "/content/neuralangelo/train.py", line 104, in <module>
[rank0]:     main()
[rank0]:   File "/content/neuralangelo/train.py", line 79, in main
[rank0]:     trainer = get_trainer(cfg, is_inference=False, seed=args.seed)
[rank0]:   File "/content/neuralangelo/imaginaire/trainers/utils/get_trainer.py", line 32, in get_trainer
[rank0]:     trainer = trainer_lib.Trainer(cfg, is_inference=is_inference, seed=seed)
[rank0]:   File "/content/neuralangelo/projects/neuralangelo/trainer.py", line 26, in __init__
[rank0]:     super().__init__(cfg, is_inference=is_inference, seed=seed)
[rank0]:   File "/content/neuralangelo/projects/nerf/trainers/base.py", line 28, in __init__
[rank0]:     super().__init__(cfg, is_inference=is_inference, seed=seed)
[rank0]:   File "/content/neuralangelo/imaginaire/trainers/base.py", line 50, in __init__
[rank0]:     self.model = self.setup_model(cfg, seed=seed)
[rank0]:   File "/content/neuralangelo/imaginaire/trainers/base.py", line 116, in setup_model
[rank0]:     lib_model = importlib.import_module(cfg.model.type)
[rank0]:   File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank0]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank0]:   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
[rank0]:   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
[rank0]:   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
[rank0]:   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
[rank0]:   File "<frozen importlib._bootstrap_external>", line 883, in exec_module
[rank0]:   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
[rank0]:   File "/content/neuralangelo/projects/neuralangelo/model.py", line 21, in <module>
[rank0]:     from projects.neuralangelo.utils.modules import NeuralSDF, NeuralRGB, BackgroundNeRF
[rank0]:   File "/content/neuralangelo/projects/neuralangelo/utils/modules.py", line 16, in <module>
[rank0]:     import tinycudann as tcnn
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/tinycudann/__init__.py", line 9, in <module>
[rank0]:     from tinycudann.modules import free_temporary_memory, NetworkWithInputEncoding, Network, Encoding
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/tinycudann/modules.py", line 51, in <module>
[rank0]:     _C = importlib.import_module(f"tinycudann_bindings._{cc}_C")
[rank0]:   File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank0]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank0]: ImportError: /usr/local/lib/python3.10/dist-packages/tinycudann_bindings/_75_C.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE
E0829 13:20:04.141000 139155706921600 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 31457) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-29_13:20:04
  host      : a8e8c22c1e57
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 31457)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Tried switching the cuda versions by using

!sudo update-alternatives --config cuda 
There are 3 choices for the alternative cuda (providing /usr/local/cuda).

  Selection    Path                  Priority   Status
------------------------------------------------------------
  0            /usr/local/cuda-12.2   122       auto mode
  1            /usr/local/cuda-11.7   117       manual mode
* 2            /usr/local/cuda-11.8   118       manual mode
  3            /usr/local/cuda-12.2   122       manual mode

Press <enter> to keep the current choice[*], or type selection number: 2

But it is still doesn't work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions