generated from fastai/nbdev_template
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Closed
Labels
🏋 Online DPORelated to Online DPORelated to Online DPO🐛 bugSomething isn't workingSomething isn't working
Description
CI fails for all tests: https://github.com/huggingface/trl/actions/runs/18321987182/job/52177132846
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
FAILED tests/test_online_dpo_trainer.py::TestOnlineDPOTrainer::test_training_with_vllm_colocate - torch.AcceleratorError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
= 1 failed, 957 passed, 44 skipped, 1 xfailed, 1 xpassed, 156 warnings, 6 rerun in 926.60s (0:15:26) =
Traceback:
> trainer = OnlineDPOTrainer(
model=model,
reward_funcs=self.reward_model,
args=training_args,
train_dataset=dummy_dataset["train"],
processing_class=tokenizer,
reward_processing_classes=self.reward_tokenizer,
)
tests/test_online_dpo_trainer.py:329:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
trl/trainer/online_dpo_trainer.py:468: in __init__
super().__init__(
.venv/lib/python3.10/site-packages/transformers/utils/deprecation.py:172: in wrapped_func
return func(*args, **kwargs)
.venv/lib/python3.10/site-packages/transformers/trainer.py:452: in __init__
enable_full_determinism(self.args.seed) if self.args.full_determinism else set_seed(self.args.seed)
.venv/lib/python3.10/site-packages/transformers/trainer_utils.py:106: in set_seed
torch.manual_seed(seed)
.venv/lib/python3.10/site-packages/torch/_compile.py:53: in inner
return disable_fn(*args, **kwargs)
.venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:929: in _fn
return fn(*args, **kwargs)
.venv/lib/python3.10/site-packages/torch/random.py:46: in manual_seed
torch.cuda.manual_seed_all(seed)
.venv/lib/python3.10/site-packages/torch/cuda/random.py:131: in manual_seed_all
_lazy_call(cb, seed_all=True)
.venv/lib/python3.10/site-packages/torch/cuda/__init__.py:341: in _lazy_call
callable()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
def cb():
for i in range(device_count()):
default_generator = torch.cuda.default_generators[i]
> default_generator.manual_seed(seed)
E torch.AcceleratorError: CUDA error: an illegal memory access was encountered
E CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
E For debugging consider passing CUDA_LAUNCH_BLOCKING=1
E Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
.venv/lib/python3.10/site-packages/torch/cuda/random.py:129: AcceleratorError
Metadata
Metadata
Assignees
Labels
🏋 Online DPORelated to Online DPORelated to Online DPO🐛 bugSomething isn't workingSomething isn't working