ERROR: python trainer.py - [W1002 20:55:29.000000000 socket.cpp:755] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:54955 (system error: 10049 - The requested address is not valid in its context.). #332
Replies: 1 comment
-
|
If that could help, here is the my environment details: (ace_step) I:\AI Systems\ACE-Step>python -m torch.utils.collect_env OS: Microsoft Windows 11 Home (10.0.26100 64 bits) Python version: 3.10.18 | packaged by Anaconda, Inc. | (main, Jun 5 2025, 13:08:55) [MSC v.1929 64 bit (AMD64)] (64-bit runtime) CPU: Versions of relevant libraries: |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
How to fix it?
What did I do wrong?
I did create successfully the dataset for training my LoRA. But when it comes to the training command step (python trainer.py), I got this error:
(ace_step) I:\AI Systems\ACE-Step>python trainer.py --dataset_path "I:/AI Systems/ACE-Step/my_lora_dataset" --checkpoint_dir "I:/AI Systems/ComfyUI/ComfyUI-Easy-Install/ComfyUI/models/checkpoints/ACE Step v1" --lora_config_path "I:/AI Systems/ACE-Step/config/lora_config.json" --exp_name "summoning_stronghold_lora"
2025-10-02 20:55:05.183 | INFO | acestep.pipeline_ace_step:get_checkpoint_path:178 - Download models from Hugging Face: ACE-Step/ACE-Step-v1-3.5B, cache to: I:/AI Systems/ComfyUI/ComfyUI-Easy-Install/ComfyUI/models/checkpoints/ACE Step v1
Fetching 14 files: 100%|███████████████████████████████████████████████████████████████████████| 14/14 [00:00<?, ?it/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W1002 20:55:29.000000000 socket.cpp:755] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:54955 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
File "I:\AI Systems\ACE-Step\trainer.py", line 890, in
main(args)
File "I:\AI Systems\ACE-Step\trainer.py", line 860, in main
trainer.fit(
File "C:\Users\ML.conda\envs\ace_step\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 561, in fit
call._call_and_handle_interrupt(
File "C:\Users\ML.conda\envs\ace_step\lib\site-packages\pytorch_lightning\trainer\call.py", line 47, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "C:\Users\ML.conda\envs\ace_step\lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
File "C:\Users\ML.conda\envs\ace_step\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 599, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "C:\Users\ML.conda\envs\ace_step\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 968, in _run
self.strategy.setup_environment()
File "C:\Users\ML.conda\envs\ace_step\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 154, in setup_environment
self.setup_distributed()
File "C:\Users\ML.conda\envs\ace_step\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 203, in setup_distributed
_init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)
File "C:\Users\ML.conda\envs\ace_step\lib\site-packages\lightning_fabric\utilities\distributed.py", line 298, in _init_dist_connection
torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
File "C:\Users\ML.conda\envs\ace_step\lib\site-packages\torch\distributed\c10d_logger.py", line 81, in wrapper
return func(*args, **kwargs)
File "C:\Users\ML.conda\envs\ace_step\lib\site-packages\torch\distributed\c10d_logger.py", line 95, in wrapper
func_return = func(*args, **kwargs)
File "C:\Users\ML.conda\envs\ace_step\lib\site-packages\torch\distributed\distributed_c10d.py", line 1764, in init_process_group
default_pg, _ = _new_process_group_helper(
File "C:\Users\ML.conda\envs\ace_step\lib\site-packages\torch\distributed\distributed_c10d.py", line 1999, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL built in")
RuntimeError: Distributed package doesn't have NCCL built in
Beta Was this translation helpful? Give feedback.
All reactions