Assuming you have 2 A100-80GB GPUs and have download and devide the Dromedary/LLaMA checkpoints into 2 shards.
bash scripts/demo_dromedary_stream_2shards.sh
Or assuming you have 8 V100-32GB GPUs and have download and devide the Dromedary/LLaMA checkpoints into 8 shards.
bash scripts/demo_dromedary_stream_8shards.sh
Generally, since Dromedary is a 65B model, it requires a minimum of 130GB GPU memory to accommodate the entirety of its model weights within the GPU memory.
When using model parallel on MP = 1, 2, 4, 8
GPUs, you should divide the model to MP
shards with utils/convert_hf_weights_to_llama_ckpt.py
python -u utils/convert_hf_weights_to_llama_ckpt.py \
--base_model "/path/to/your/llama-65b-hf" \
--lora_weights "/path/to/your/lora/weights" \
--output_dir "/path/to/your/sharded_ckpt" \
--total_ranks MP \
--lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' \
--lora_r=16
When using model parallel on MP = 3, 6, 9, 12
GPUs, you should use utils/convert_hf_weights_to_llama_expanded.py
to divide the original checkpoint into shards and install our customized llama_dromedary
package for inference.
python -u utils/convert_hf_weights_to_llama_ckpt_expanded.py \
--base_model "/path/to/your/llama-65b-hf" \
--lora_weights "/path/to/your/lora/weights" \
--output_dir "/path/to/your/sharded_ckpt" \
--total_ranks MP \
--expanded_att_dim 9216 \
--expanded_ffn_dim 23040 \
--expanded_vocab_size 32256 \
--lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' \
--lora_r=16
For MP = 5, 10
GPUs, here is the recommended expansion configuration for llama_dromedary
.
python -u utils/convert_hf_weights_to_llama_ckpt_expanded.py \
--base_model "/path/to/your/llama-65b-hf" \
--lora_weights "/path/to/your/lora/weights" \
--output_dir "/path/to/your/sharded_ckpt" \
--total_ranks MP \
--expanded_att_dim 8320 \
--expanded_ffn_dim 22400 \
--expanded_vocab_size 32000 \
--lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' \
--lora_r=16