Skip to content

CI fails: torch.AcceleratorError: CUDA error: an illegal memory access was encountered #4228

@albertvillanova

Description

@albertvillanova

CI fails for all tests: https://github.com/huggingface/trl/actions/runs/18321987182/job/52177132846

torch.AcceleratorError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
FAILED tests/test_online_dpo_trainer.py::TestOnlineDPOTrainer::test_training_with_vllm_colocate - torch.AcceleratorError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
= 1 failed, 957 passed, 44 skipped, 1 xfailed, 1 xpassed, 156 warnings, 6 rerun in 926.60s (0:15:26) =

Traceback:

>       trainer = OnlineDPOTrainer(
            model=model,
            reward_funcs=self.reward_model,
            args=training_args,
            train_dataset=dummy_dataset["train"],
            processing_class=tokenizer,
            reward_processing_classes=self.reward_tokenizer,
        )

tests/test_online_dpo_trainer.py:329: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
trl/trainer/online_dpo_trainer.py:468: in __init__
    super().__init__(
.venv/lib/python3.10/site-packages/transformers/utils/deprecation.py:172: in wrapped_func
    return func(*args, **kwargs)
.venv/lib/python3.10/site-packages/transformers/trainer.py:452: in __init__
    enable_full_determinism(self.args.seed) if self.args.full_determinism else set_seed(self.args.seed)
.venv/lib/python3.10/site-packages/transformers/trainer_utils.py:106: in set_seed
    torch.manual_seed(seed)
.venv/lib/python3.10/site-packages/torch/_compile.py:53: in inner
    return disable_fn(*args, **kwargs)
.venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:929: in _fn
    return fn(*args, **kwargs)
.venv/lib/python3.10/site-packages/torch/random.py:46: in manual_seed
    torch.cuda.manual_seed_all(seed)
.venv/lib/python3.10/site-packages/torch/cuda/random.py:131: in manual_seed_all
    _lazy_call(cb, seed_all=True)
.venv/lib/python3.10/site-packages/torch/cuda/__init__.py:341: in _lazy_call
    callable()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    def cb():
        for i in range(device_count()):
            default_generator = torch.cuda.default_generators[i]
>           default_generator.manual_seed(seed)
E           torch.AcceleratorError: CUDA error: an illegal memory access was encountered
E           CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
E           For debugging consider passing CUDA_LAUNCH_BLOCKING=1
E           Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

.venv/lib/python3.10/site-packages/torch/cuda/random.py:129: AcceleratorError

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions