-
Notifications
You must be signed in to change notification settings - Fork 400
Open
Description
Running the colab example initially gives #205. (COLMAP fails to execute)
#205 is solved by adding the following before installing COLMAP,
!wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
!mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
!wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-ubuntu2204-11-7-local_11.7.0-515.43.04-1_amd64.deb
!dpkg -i cuda-repo-ubuntu2204-11-7-local_11.7.0-515.43.04-1_amd64.deb
!cp /var/cuda-repo-ubuntu2204-11-7-local/cuda-*-keyring.gpg /usr/share/keyrings/
!apt-get update
!apt-get -y install cuda-11-7
!update-alternatives --set cuda /usr/local/cuda-11.7
This fixes COLMAP and it is able to run the preprocessing untill it throws error on the training step,
# @title { vertical-output: true }
%cd /content/neuralangelo
GROUP = "test_exp"
NAME = "lego"
!torchrun --nproc_per_node=1 train.py \
--logdir=logs/{GROUP}/{NAME} \
--show_pbar \
--config=projects/neuralangelo/configs/custom/lego.yaml \
--data.readjust.scale=0.5 \
--max_iter=20000 \
--validation_iter=99999999 \
--model.object.sdf.encoding.coarse2fine.step=200 \
--model.object.sdf.encoding.hashgrid.dict_size=19 \
--optim.sched.warm_up_end=200 \
--optim.sched.two_steps=[12000,16000]
ERROR :
[W829 13:19:57.835930806 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
Training with 1 GPUs.
Using random seed 0
Make folder logs/test_exp/lego
* checkpoint:
* save_epoch: 9999999999
* save_iter: 20000
* save_latest_iter: 9999999999
* save_period: 9999999999
* strict_resume: True
* cudnn:
* benchmark: True
* deterministic: False
* data:
* name: dummy
* num_images: None
* num_workers: 4
* preload: True
* readjust:
* center: [0.0, 0.0, 0.0]
* scale: 0.5
* root: datasets/lego_ds2
* train:
* batch_size: 2
* image_size: [801, 801]
* subset: None
* type: projects.neuralangelo.data
* use_multi_epoch_loader: True
* val:
* batch_size: 2
* image_size: [300, 300]
* max_viz_samples: 16
* subset: 4
* image_save_iter: 9999999999
* inference_args:
* local_rank: 0
* logdir: logs/test_exp/lego
* logging_iter: 9999999999999
* max_epoch: 9999999999
* max_iter: 20000
* metrics_epoch: None
* metrics_iter: None
* model:
* appear_embed:
* dim: 8
* enabled: False
* background:
* enabled: True
* encoding:
* levels: 10
* type: fourier
* encoding_view:
* levels: 3
* type: spherical
* mlp:
* activ: relu
* activ_density: softplus
* activ_density_params:
* activ_params:
* hidden_dim: 256
* hidden_dim_rgb: 128
* num_layers: 8
* num_layers_rgb: 2
* skip: [4]
* skip_rgb: []
* view_dep: True
* white: False
* object:
* rgb:
* encoding_view:
* levels: 3
* type: spherical
* mlp:
* activ: relu_
* activ_params:
* hidden_dim: 256
* num_layers: 4
* skip: []
* weight_norm: True
* mode: idr
* s_var:
* anneal_end: 0.1
* init_val: 3.0
* sdf:
* encoding:
* coarse2fine:
* enabled: True
* init_active_level: 4
* step: 200
* hashgrid:
* dict_size: 19
* dim: 8
* max_logres: 11
* min_logres: 5
* range: [-2, 2]
* levels: 16
* type: hashgrid
* gradient:
* mode: numerical
* taps: 4
* mlp:
* activ: softplus
* activ_params:
* beta: 100
* geometric_init: True
* hidden_dim: 256
* inside_out: False
* num_layers: 1
* out_bias: 0.5
* skip: []
* weight_norm: True
* render:
* num_sample_hierarchy: 4
* num_samples:
* background: 32
* coarse: 64
* fine: 16
* rand_rays: 512
* stratified: True
* type: projects.neuralangelo.model
* nvtx_profile: False
* optim:
* fused_opt: False
* params:
* lr: 0.001
* weight_decay: 0.01
* sched:
* gamma: 10.0
* iteration_mode: True
* step_size: 9999999999
* two_steps: [12000, 16000]
* type: two_steps_with_warmup
* warm_up_end: 200
* type: AdamW
* pretrained_weight: None
* source_filename: projects/neuralangelo/configs/custom/lego.yaml
* speed_benchmark: False
* test_data:
* name: dummy
* num_workers: 0
* test:
* batch_size: 1
* is_lmdb: False
* roots: None
* type: imaginaire.datasets.images
* timeout_period: 9999999
* trainer:
* amp_config:
* backoff_factor: 0.5
* enabled: False
* growth_factor: 2.0
* growth_interval: 2000
* init_scale: 65536.0
* ddp_config:
* find_unused_parameters: False
* static_graph: True
* depth_vis_scale: 0.5
* ema_config:
* beta: 0.9999
* enabled: False
* load_ema_checkpoint: False
* start_iteration: 0
* grad_accum_iter: 1
* image_to_tensorboard: False
* init:
* gain: None
* type: none
* loss_weight:
* curvature: 0.0005
* eikonal: 0.1
* render: 1.0
* type: projects.neuralangelo.trainer
* validation_iter: 99999999
* wandb_image_iter: 10000
* wandb_scalar_iter: 100
cudnn benchmark: True
cudnn deterministic: False
Setup trainer.
Using random seed 0
[rank0]: Traceback (most recent call last):
[rank0]: File "/content/neuralangelo/train.py", line 104, in <module>
[rank0]: main()
[rank0]: File "/content/neuralangelo/train.py", line 79, in main
[rank0]: trainer = get_trainer(cfg, is_inference=False, seed=args.seed)
[rank0]: File "/content/neuralangelo/imaginaire/trainers/utils/get_trainer.py", line 32, in get_trainer
[rank0]: trainer = trainer_lib.Trainer(cfg, is_inference=is_inference, seed=seed)
[rank0]: File "/content/neuralangelo/projects/neuralangelo/trainer.py", line 26, in __init__
[rank0]: super().__init__(cfg, is_inference=is_inference, seed=seed)
[rank0]: File "/content/neuralangelo/projects/nerf/trainers/base.py", line 28, in __init__
[rank0]: super().__init__(cfg, is_inference=is_inference, seed=seed)
[rank0]: File "/content/neuralangelo/imaginaire/trainers/base.py", line 50, in __init__
[rank0]: self.model = self.setup_model(cfg, seed=seed)
[rank0]: File "/content/neuralangelo/imaginaire/trainers/base.py", line 116, in setup_model
[rank0]: lib_model = importlib.import_module(cfg.model.type)
[rank0]: File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank0]: return _bootstrap._gcd_import(name[level:], package, level)
[rank0]: File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
[rank0]: File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
[rank0]: File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
[rank0]: File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
[rank0]: File "<frozen importlib._bootstrap_external>", line 883, in exec_module
[rank0]: File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
[rank0]: File "/content/neuralangelo/projects/neuralangelo/model.py", line 21, in <module>
[rank0]: from projects.neuralangelo.utils.modules import NeuralSDF, NeuralRGB, BackgroundNeRF
[rank0]: File "/content/neuralangelo/projects/neuralangelo/utils/modules.py", line 16, in <module>
[rank0]: import tinycudann as tcnn
[rank0]: File "/usr/local/lib/python3.10/dist-packages/tinycudann/__init__.py", line 9, in <module>
[rank0]: from tinycudann.modules import free_temporary_memory, NetworkWithInputEncoding, Network, Encoding
[rank0]: File "/usr/local/lib/python3.10/dist-packages/tinycudann/modules.py", line 51, in <module>
[rank0]: _C = importlib.import_module(f"tinycudann_bindings._{cc}_C")
[rank0]: File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank0]: return _bootstrap._gcd_import(name[level:], package, level)
[rank0]: ImportError: /usr/local/lib/python3.10/dist-packages/tinycudann_bindings/_75_C.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE
E0829 13:20:04.141000 139155706921600 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 31457) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-08-29_13:20:04
host : a8e8c22c1e57
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 31457)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Tried switching the cuda versions by using
!sudo update-alternatives --config cuda
There are 3 choices for the alternative cuda (providing /usr/local/cuda).
Selection Path Priority Status
------------------------------------------------------------
0 /usr/local/cuda-12.2 122 auto mode
1 /usr/local/cuda-11.7 117 manual mode
* 2 /usr/local/cuda-11.8 118 manual mode
3 /usr/local/cuda-12.2 122 manual mode
Press <enter> to keep the current choice[*], or type selection number: 2
But it is still doesn't work.
Metadata
Metadata
Assignees
Labels
No labels