CPUAdam does not find CUDA #1619
Unanswered
javier-alvarez
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
2021-12-08T15:12:02Z INFO Switching optimizer to DeepSpeedCPUAdam
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
[stderr]Traceback (most recent call last):
[stderr] File "InnerEyePrivate/ML/runner.py", line 57, in
[stderr] main()
[stderr] File "InnerEyePrivate/ML/runner.py", line 53, in main
[stderr] post_cross_validation_hook=runner.default_post_cross_validation_hook)
[stderr] File "/mnt/azureml/cr/j/cfa5340abb4d4a3abec1a3ec4d8e39a6/exe/wd/innereye-deeplearning/InnerEye/ML/runner.py", line 442, in run
[stderr] return runner.run()
[stderr] File "/mnt/azureml/cr/j/cfa5340abb4d4a3abec1a3ec4d8e39a6/exe/wd/innereye-deeplearning/InnerEye/ML/runner.py", line 219, in run
[stderr] self.run_in_situ(azure_run_info)
[stderr] File "/mnt/azureml/cr/j/cfa5340abb4d4a3abec1a3ec4d8e39a6/exe/wd/innereye-deeplearning/InnerEye/ML/runner.py", line 398, in run_in_situ
[stderr] self.ml_runner.run()
[stderr] File "/mnt/azureml/cr/j/cfa5340abb4d4a3abec1a3ec4d8e39a6/exe/wd/innereye-deeplearning/InnerEye/ML/run_ml.py", line 327, in run
[stderr] num_nodes=self.azure_config.num_nodes)
[stderr] File "/mnt/azureml/cr/j/cfa5340abb4d4a3abec1a3ec4d8e39a6/exe/wd/innereye-deeplearning/InnerEye/ML/model_training.py", line 263, in model_train
[stderr] trainer.fit(lightning_model, datamodule=data_module)
[stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit
[stderr] self._run(model)
[stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 717, in _run
[stderr] self.accelerator.setup(self, model) # note: this sets up self.lightning_module
[stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/pytorch_lightning/accelerators/cpu.py", line 39, in setup
[stderr] return super().setup(trainer, model)
[stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in setup
[stderr] self.setup_optimizers(trainer)
[stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 375, in setup_optimizers
[stderr] trainer=trainer, model=self.lightning_module
[stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 190, in init_optimizers
[stderr] return trainer.init_optimizers(model)
[stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/pytorch_lightning/trainer/optimizers.py", line 34, in init_optimizers
[stderr] optim_conf = model.configure_optimizers()
[stderr] File "/mnt/azureml/cr/j/cfa5340abb4d4a3abec1a3ec4d8e39a6/exe/wd/innereye-deeplearning/InnerEye/ML/SSL/lightning_modules/simclr_module.py", line 68, in configure_optimizers
[stderr] deepspeed_optim = DeepSpeedCPUAdam(params, lr=self.learning_rate, weight_decay=self.weight_decay)
[stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/deepspeed/ops/adam/cpu_adam.py", line 83, in init
[stderr] self.ds_opt_adam = CPUAdamBuilder().load()
[stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/deepspeed/ops/op_builder/builder.py", line 370, in load
[stderr] return self.jit_load(verbose)
[stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/deepspeed/ops/op_builder/builder.py", line 385, in jit_load
[stderr] assert_no_cuda_mismatch()
[stderr] File "/azureml-envs/azureml_5602df82e8a46f1160ede9218ecc0c87/lib/python3.7/site-packages/deepspeed/ops/op_builder/builder.py", line 97, in assert_no_cuda_mismatch
[stderr] f"Installed CUDA version {sys_cuda_version} does not match the "
[stderr]Exception: Installed CUDA version 10.2 does not match the version torch was compiled with 11.1, unable to compile cuda/cpp extensions without a matching cuda version.
[stderr]
https://github.com/microsoft/InnerEye-DeepLearning/pull/611/files
Any ideas why this does not find CUDA 11? It installs pytorch 1.8 and cuda 11 with conda
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions