ERROR: python trainer.py - [W1002 20:55:29.000000000 socket.cpp:755] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:54955 (system error: 10049 - The requested address is not valid in its context.). #332

1979nothuman-droid · 2025-10-03T01:07:42Z

1979nothuman-droid
Oct 3, 2025

How to fix it?
What did I do wrong?

I did create successfully the dataset for training my LoRA. But when it comes to the training command step (python trainer.py), I got this error:

(ace_step) I:\AI Systems\ACE-Step>python trainer.py --dataset_path "I:/AI Systems/ACE-Step/my_lora_dataset" --checkpoint_dir "I:/AI Systems/ComfyUI/ComfyUI-Easy-Install/ComfyUI/models/checkpoints/ACE Step v1" --lora_config_path "I:/AI Systems/ACE-Step/config/lora_config.json" --exp_name "summoning_stronghold_lora"
2025-10-02 20:55:05.183 | INFO | acestep.pipeline_ace_step:get_checkpoint_path:178 - Download models from Hugging Face: ACE-Step/ACE-Step-v1-3.5B, cache to: I:/AI Systems/ComfyUI/ComfyUI-Easy-Install/ComfyUI/models/checkpoints/ACE Step v1
Fetching 14 files: 100%|███████████████████████████████████████████████████████████████████████| 14/14 [00:00<?, ?it/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W1002 20:55:29.000000000 socket.cpp:755] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:54955 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
File "I:\AI Systems\ACE-Step\trainer.py", line 890, in
main(args)
File "I:\AI Systems\ACE-Step\trainer.py", line 860, in main
trainer.fit(
File "C:\Users\ML.conda\envs\ace_step\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 561, in fit
call._call_and_handle_interrupt(
File "C:\Users\ML.conda\envs\ace_step\lib\site-packages\pytorch_lightning\trainer\call.py", line 47, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "C:\Users\ML.conda\envs\ace_step\lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
File "C:\Users\ML.conda\envs\ace_step\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 599, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "C:\Users\ML.conda\envs\ace_step\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 968, in _run
self.strategy.setup_environment()
File "C:\Users\ML.conda\envs\ace_step\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 154, in setup_environment
self.setup_distributed()
File "C:\Users\ML.conda\envs\ace_step\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 203, in setup_distributed
_init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)
File "C:\Users\ML.conda\envs\ace_step\lib\site-packages\lightning_fabric\utilities\distributed.py", line 298, in _init_dist_connection
torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
File "C:\Users\ML.conda\envs\ace_step\lib\site-packages\torch\distributed\c10d_logger.py", line 81, in wrapper
return func(*args, **kwargs)
File "C:\Users\ML.conda\envs\ace_step\lib\site-packages\torch\distributed\c10d_logger.py", line 95, in wrapper
func_return = func(*args, **kwargs)
File "C:\Users\ML.conda\envs\ace_step\lib\site-packages\torch\distributed\distributed_c10d.py", line 1764, in init_process_group
default_pg, _ = _new_process_group_helper(
File "C:\Users\ML.conda\envs\ace_step\lib\site-packages\torch\distributed\distributed_c10d.py", line 1999, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL built in")
RuntimeError: Distributed package doesn't have NCCL built in

1979nothuman-droid · 2025-10-03T13:22:21Z

1979nothuman-droid
Oct 3, 2025
Author

If that could help, here is the my environment details:

(ace_step) I:\AI Systems\ACE-Step>python -m torch.utils.collect_env
C:\Users\ML.conda\envs\ace_step\lib\runpy.py:126: RuntimeWarning: 'torch.utils.collect_env' found in sys.modules after import of package 'torch.utils', but prior to execution of 'torch.utils.collect_env'; this may result in unpredictable behaviour
warn(RuntimeWarning(msg))
Collecting environment information...
PyTorch version: 2.8.0+cu126
Is debug build: False
CUDA used to build PyTorch: 12.6
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Home (10.0.26100 64 bits)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.10.18 | packaged by Anaconda, Inc. | (main, Jun 5 2025, 13:08:55) [MSC v.1929 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.26100-SP0
Is CUDA available: True
CUDA runtime version: 11.0.194
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version: 576.83
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Name: 11th Gen Intel(R) Core(TM) i5-11400F @ 2.60GHz
Manufacturer: GenuineIntel
Family: 1
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 2601
MaxClockSpeed: 2601
L2CacheSize: 3072
L2CacheSpeed: None
Revision: None

Versions of relevant libraries:
[pip3] numpy==2.1.2
[pip3] pytorch-lightning==2.5.1
[pip3] torch==2.8.0+cu126
[pip3] torchaudio==2.8.0+cu126
[pip3] torchmetrics==1.8.2
[pip3] torchvision==0.23.0+cu126
[conda] numpy 2.1.2 pypi_0 pypi
[conda] pytorch-lightning 2.5.1 pypi_0 pypi
[conda] torch 2.8.0+cu126 pypi_0 pypi
[conda] torchaudio 2.8.0+cu126 pypi_0 pypi
[conda] torchmetrics 1.8.2 pypi_0 pypi
[conda] torchvision 0.23.0+cu126 pypi_0 pypi

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERROR: python trainer.py - [W1002 20:55:29.000000000 socket.cpp:755] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:54955 (system error: 10049 - The requested address is not valid in its context.). #332

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

ERROR: python trainer.py - [W1002 20:55:29.000000000 socket.cpp:755] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:54955 (system error: 10049 - The requested address is not valid in its context.). #332

Uh oh!

1979nothuman-droid Oct 3, 2025

Replies: 1 comment

Uh oh!

1979nothuman-droid Oct 3, 2025 Author

1979nothuman-droid
Oct 3, 2025

1979nothuman-droid
Oct 3, 2025
Author