Skip to content

[BUG] Maniskill3 crashes on D2H transfer after env rollout #2739

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
AlexandreBrown opened this issue Feb 1, 2025 · 1 comment
Open

[BUG] Maniskill3 crashes on D2H transfer after env rollout #2739

AlexandreBrown opened this issue Feb 1, 2025 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@AlexandreBrown
Copy link
Contributor

AlexandreBrown commented Feb 1, 2025

Describe the bug

Maniskill3 crashes after env.rollout when transferring data to host (cuda to cpu).

for _ in tqdm(range(nb_iters), "Evaluation"):
            rollouts = self.eval_env.rollout(
                max_steps=self.env_max_frames_per_traj,
                policy=policy,
                auto_reset=False,
                auto_cast_to_device=False,
                tensordict=tensordict,
            ).to(device="cpu", non_blocking=False)
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/user/.vscode/extensions/ms-python.debugpy-2024.14.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 71, in <module>
    cli.main()
  File "/home/user/.vscode/extensions/ms-python.debugpy-2024.14.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 501, in main
    run()
  File "/home/user/.vscode/extensions/ms-python.debugpy-2024.14.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 351, in run_file
    runpy.run_path(target, run_name="__main__")
  File "/home/user/.vscode/extensions/ms-python.debugpy-2024.14.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 310, in run_path
    return _run_module_code(code, init_globals, run_name, pkg_name=pkg_name, script_name=fname)
  File "/home/user/.vscode/extensions/ms-python.debugpy-2024.14.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 127, in _run_module_code
    _run_code(code, mod_globals, init_globals, mod_name, mod_spec, pkg_name, script_name)
  File "/home/user/.vscode/extensions/ms-python.debugpy-2024.14.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 118, in _run_code
    exec(code, run_globals)
  File "scripts/train_rl.py", line 118, in <module>
    main()
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "scripts/train_rl.py", line 107, in main
    trainer.train()
  File "/home/user/Documents/SegDAC/segdac_dev/src/segdac_dev/trainers/rl_trainer.py", line 90, in train
    eval_metrics = self.evaluator.evaluate(
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/Documents/SegDAC/segdac_dev/src/segdac_dev/evaluation/rl_evaluator.py", line 147, in evaluate
    eval_metrics = self.log_eval_metrics(agent, env_step)
  File "/home/user/Documents/SegDAC/segdac_dev/src/segdac_dev/evaluation/rl_evaluator.py", line 158, in log_eval_metrics
    eval_metrics = self.gather_eval_rollouts_metrics(policy)
  File "/home/user/Documents/SegDAC/segdac_dev/src/segdac_dev/evaluation/rl_evaluator.py", line 171, in gather_eval_rollouts_metrics
    rollouts = self.eval_env.rollout(
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/tensordict/base.py", line 10623, in to
    tensors = [to(t) for t in tensors]
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/tensordict/base.py", line 10623, in <listcomp>
    tensors = [to(t) for t in tensors]
  File "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/tensordict/base.py", line 10595, in to
    return tensor.to(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[2025-01-30 19:23:26.032] [SAPIEN] [critical] Mem free failed with error code 700!

[2025-01-30 19:23:26.032] [SAPIEN] [critical] /buildAgent/work/eb2f45c4acc808a0/physx/source/gpucommon/src/PxgCudaMemoryAllocator.cpp
[2025-01-30 19:23:26.032] [SAPIEN] [critical] /buildAgent/work/eb2f45c4acc808a0/physx/source/gpucommon/src/PxgCudaMemoryAllocator.cpp
[2025-01-30 19:23:26.032] [SAPIEN] [critical] /buildAgent/work/eb2f45c4acc808a0/physx/source/gpucommon/src/PxgCudaMemoryAllocator.cpp
[2025-01-30 19:23:26.032] [SAPIEN] [critical] /buildAgent/work/eb2f45c4acc808a0/physx/source/gpucommon/src/PxgCudaMemoryAllocator.cpp
[2025-01-30 19:23:26.033] [SAPIEN] [critical] /buildAgent/work/eb2f45c4acc808a0/physx/source/gpucommon/src/PxgCudaMemoryAllocator.cpp
CUDA error at /__w/SAPIEN/SAPIEN/3rd_party/sapien-vulkan-2/src/core/buffer.cpp 103: an illegal memory access was encountered
@AlexandreBrown AlexandreBrown added the bug Something isn't working label Feb 1, 2025
@AlexandreBrown AlexandreBrown changed the title [BUG] [BUG] Maniskill3 crashes on D2H transfer after env rollout Feb 1, 2025
@vmoens
Copy link
Contributor

vmoens commented Feb 3, 2025

I checked but I'm not sure of what is going on here.
It looks like this is indeed triggerered when you do the sync cpu transfer.
The call to to in tensordict looks like

            def to(tensor):
                return tensor.to(
                    device=device, dtype=dtype, non_blocking=sub_non_blocking
                )

where sub_non_blocking=False, device="cpu", dtype=None(maybetodoesn't like being given a dtype when it doesn't need to? A quick hack to test this would be for you to modify theFile "/home/user/miniconda3/envs/maniskill3_env/lib/python3.10/site-packages/tensordict/base.py", line 10595and remove thedtype` instruction)

Also, have you tried with CUDA_LAUNCH_BLOCKING=1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants