Skip to content

Can PQL-D be run on one GPU? #5

@StoneT2000

Description

@StoneT2000

I ran the following command:

python scripts/train_pql.py task=FrankaCubeStack algo.num_gpus=1 algo.p_learner_gpu=0 algo.v_learner_gpu=0 algo.distl=True algo.cri_class=DistributionalDoubleQ

However I get the following:

(PQLVLearner pid=88771) CUDA error: device-side assert triggered
(PQLVLearner pid=88771) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(PQLVLearner pid=88771) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(PQLVLearner pid=88771) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(PQLVLearner pid=88771) Traceback (most recent call last):
(PQLVLearner pid=88771)   File "/home/stao/miniforge3/envs/pql/lib/python3.8/site-packages/ray/_private/serialization.py", line 404, in deserialize_objects
(PQLVLearner pid=88771)     obj = self._deserialize_object(data, metadata, object_ref)
(PQLVLearner pid=88771)   File "/home/stao/miniforge3/envs/pql/lib/python3.8/site-packages/ray/_private/serialization.py", line 270, in _deserialize_object
(PQLVLearner pid=88771)     return self._deserialize_msgpack_data(data, metadata_fields)
(PQLVLearner pid=88771)   File "/home/stao/miniforge3/envs/pql/lib/python3.8/site-packages/ray/_private/serialization.py", line 225, in _deserialize_msgpack_data
(PQLVLearner pid=88771)     python_objects = self._deserialize_pickle5_data(pickle5_data)
(PQLVLearner pid=88771)   File "/home/stao/miniforge3/envs/pql/lib/python3.8/site-packages/ray/_private/serialization.py", line 215, in _deserialize_pickle5_data
(PQLVLearner pid=88771)     obj = pickle.loads(in_band)
(PQLVLearner pid=88771)   File "/home/stao/miniforge3/envs/pql/lib/python3.8/site-packages/torch/storage.py", line 414, in _load_from_bytes
(PQLVLearner pid=88771)     return torch.load(io.BytesIO(b), weights_only=False)
(PQLVLearner pid=88771)   File "/home/stao/miniforge3/envs/pql/lib/python3.8/site-packages/torch/serialization.py", line 1114, in load
(PQLVLearner pid=88771)     return _legacy_load(
(PQLVLearner pid=88771)   File "/home/stao/miniforge3/envs/pql/lib/python3.8/site-packages/torch/serialization.py", line 1348, in _legacy_load
(PQLVLearner pid=88771)     result = unpickler.load()
(PQLVLearner pid=88771)   File "/home/stao/miniforge3/envs/pql/lib/python3.8/site-packages/torch/serialization.py", line 1281, in persistent_load
(PQLVLearner pid=88771)     obj = restore_location(obj, location)
(PQLVLearner pid=88771)   File "/home/stao/miniforge3/envs/pql/lib/python3.8/site-packages/torch/serialization.py", line 414, in default_restore_location
(PQLVLearner pid=88771)     result = fn(storage, location)
(PQLVLearner pid=88771)   File "/home/stao/miniforge3/envs/pql/lib/python3.8/site-packages/torch/serialization.py", line 392, in _deserialize
(PQLVLearner pid=88771)     return obj.to(device=device)
(PQLVLearner pid=88771)   File "/home/stao/miniforge3/envs/pql/lib/python3.8/site-packages/torch/storage.py", line 187, in to
(PQLVLearner pid=88771)     return _to(self, device, non_blocking)
(PQLVLearner pid=88771)   File "/home/stao/miniforge3/envs/pql/lib/python3.8/site-packages/torch/_utils.py", line 90, in _to
(PQLVLearner pid=88771)     untyped_storage.copy_(self, non_blocking)
(PQLVLearner pid=88771) RuntimeError: CUDA error: device-side assert triggered
(PQLVLearner pid=88771) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(PQLVLearner pid=88771) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(PQLVLearner pid=88771) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(PQLVLearner pid=88771)

I'm aware the instructions showed using the default number of GPUs so maybe it is due to that? Any help on this is appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions