-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Description
I ran the following command:
python scripts/train_pql.py task=FrankaCubeStack algo.num_gpus=1 algo.p_learner_gpu=0 algo.v_learner_gpu=0 algo.distl=True algo.cri_class=DistributionalDoubleQ
However I get the following:
(PQLVLearner pid=88771) CUDA error: device-side assert triggered
(PQLVLearner pid=88771) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(PQLVLearner pid=88771) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(PQLVLearner pid=88771) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(PQLVLearner pid=88771) Traceback (most recent call last):
(PQLVLearner pid=88771) File "/home/stao/miniforge3/envs/pql/lib/python3.8/site-packages/ray/_private/serialization.py", line 404, in deserialize_objects
(PQLVLearner pid=88771) obj = self._deserialize_object(data, metadata, object_ref)
(PQLVLearner pid=88771) File "/home/stao/miniforge3/envs/pql/lib/python3.8/site-packages/ray/_private/serialization.py", line 270, in _deserialize_object
(PQLVLearner pid=88771) return self._deserialize_msgpack_data(data, metadata_fields)
(PQLVLearner pid=88771) File "/home/stao/miniforge3/envs/pql/lib/python3.8/site-packages/ray/_private/serialization.py", line 225, in _deserialize_msgpack_data
(PQLVLearner pid=88771) python_objects = self._deserialize_pickle5_data(pickle5_data)
(PQLVLearner pid=88771) File "/home/stao/miniforge3/envs/pql/lib/python3.8/site-packages/ray/_private/serialization.py", line 215, in _deserialize_pickle5_data
(PQLVLearner pid=88771) obj = pickle.loads(in_band)
(PQLVLearner pid=88771) File "/home/stao/miniforge3/envs/pql/lib/python3.8/site-packages/torch/storage.py", line 414, in _load_from_bytes
(PQLVLearner pid=88771) return torch.load(io.BytesIO(b), weights_only=False)
(PQLVLearner pid=88771) File "/home/stao/miniforge3/envs/pql/lib/python3.8/site-packages/torch/serialization.py", line 1114, in load
(PQLVLearner pid=88771) return _legacy_load(
(PQLVLearner pid=88771) File "/home/stao/miniforge3/envs/pql/lib/python3.8/site-packages/torch/serialization.py", line 1348, in _legacy_load
(PQLVLearner pid=88771) result = unpickler.load()
(PQLVLearner pid=88771) File "/home/stao/miniforge3/envs/pql/lib/python3.8/site-packages/torch/serialization.py", line 1281, in persistent_load
(PQLVLearner pid=88771) obj = restore_location(obj, location)
(PQLVLearner pid=88771) File "/home/stao/miniforge3/envs/pql/lib/python3.8/site-packages/torch/serialization.py", line 414, in default_restore_location
(PQLVLearner pid=88771) result = fn(storage, location)
(PQLVLearner pid=88771) File "/home/stao/miniforge3/envs/pql/lib/python3.8/site-packages/torch/serialization.py", line 392, in _deserialize
(PQLVLearner pid=88771) return obj.to(device=device)
(PQLVLearner pid=88771) File "/home/stao/miniforge3/envs/pql/lib/python3.8/site-packages/torch/storage.py", line 187, in to
(PQLVLearner pid=88771) return _to(self, device, non_blocking)
(PQLVLearner pid=88771) File "/home/stao/miniforge3/envs/pql/lib/python3.8/site-packages/torch/_utils.py", line 90, in _to
(PQLVLearner pid=88771) untyped_storage.copy_(self, non_blocking)
(PQLVLearner pid=88771) RuntimeError: CUDA error: device-side assert triggered
(PQLVLearner pid=88771) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(PQLVLearner pid=88771) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(PQLVLearner pid=88771) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(PQLVLearner pid=88771)
I'm aware the instructions showed using the default number of GPUs so maybe it is due to that? Any help on this is appreciated!
Metadata
Metadata
Assignees
Labels
No labels