torch.distributed.elastic.multiprocessing.errors.ChildFailedError

I am new to AI and trying to use `llama2` model locally using `pyllama`. 

I tried different options, but nothing seems to work. I downloaded llama using https://github.com/facebookresearch/llama.

Here is what I tried (see below for installed packages):
```
$ torchrun --nproc_per_node 1 example.py --ckpt_dir ../codellama/CodeLlama-7b/ --tokenizer_path ../codellama/CodeLlama-7b/tokenizer.model                                        

Traceback (most recent call last):                                                                                                                                                                           
  File "/home/xxxxx/pyllama/example.py", line 80, in <module>                                                                                                                                                 
    fire.Fire(main)                                                                                                                                                                                          
  File "/home/xxxxx/miniconda3/envs/llama2/lib/python3.11/site-packages/fire/core.py", line 141, in Fire                                                                                                      
    component_trace = _Fire(component, args, parsed_flag_args, context, name)                                                                                                                                
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
..
 File "/home/xxxxx/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1268, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL built in")
RuntimeError: Distributed package doesn't have NCCL built in
[2024-01-01 20:58:30,998] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1814953) of binary: /home/xxxxx/miniconda3/envs/llama2/bin/python
Traceback (most recent call last):
..
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
```

Below seems to work, but I don't get any response whatsoever:

```
KV_CACHE_IN_GPU=0 python inference.py --ckpt_dir ../codellama/CodeLlama-7b/ --tokenizer_path ../codellama/CodeLlama-7b/tokenizer.model
.. <after waiting for several seconds .. typed in the following command and pressed Enter> ..
Prompt:['I believe in '] 
<no response whatsoever>
```

------------------------------------------
I tried both pytorch cuda and non-cuda packages from https://pytorch.org/get-started/locally/. Example: `conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia` but same NCCL error in torchrun and no output from `inference.py`

I am on an HP workstation running ubuntu (23.04 (Lunar Lobster))
```
CPU(s):                  4
  On-line CPU(s) list:   0-3
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU           W3565  @ 3.20GHz
    CPU family:          6
    Model:               26
    Thread(s) per core:  1
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            5
```

```
$ sudo lshw -numeric -C display
..
  *-display                 
       description: VGA compatible controller
       product: G94GL [Quadro FX 1800] [10DE:638]
       vendor: NVIDIA Corporation [10DE]
...
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

torch.distributed.elastic.multiprocessing.errors.ChildFailedError #113

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

torch.distributed.elastic.multiprocessing.errors.ChildFailedError #113

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions