This directory contains a script that showcases how to perform image to text generation on Intel® Gaudi® AI Accelerators.
Habana FusedSDPA is a fused and optimized implementation of torch.nn.functional.scaled_dot_product_attention() for Gaudi. For more details, refer to Gaudi online documentation. We optimized many models with FusedSDPA implementation as in optimum/habana/transformers/models. If a model is not optimized with FusedSDPA, it uses SDPA implementation.
To run Llama inference with SDPA, use the following command:
PT_HPU_LAZY_MODE=1 python3 run_pipeline.py \
--model_name_or_path meta-llama/Llama-3.2-11B-Vision-Instruct \
--use_hpu_graphs \
--bf16 \
--sdp_on_bf16SDPA may introduce reduced precison
To run inference with THUDM/glm-4v-9b, use the following command (Note that you need to set the environment variable GLM=4v to distinguish between glm4v and chatglm, as these models are customized and share the same model type named "chatglm"):
PT_HPU_LAZY_MODE=1 GLM=4v python3 run_pipeline.py \
--model_name_or_path THUDM/glm-4v-9b \
--use_hpu_graphs \
--bf16 \
--sdp_on_bf16 \
--use_flash_attention \
--use_kv_cacheUse the following commands to run Llama-3.2-90B-Vision-Instruct BF16 inference with FusedSDPA on 8 HPUs:
PT_HPU_LAZY_MODE=1 PT_HPU_ENABLE_LAZY_COLLECTIVES=true python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \
--model_name_or_path meta-llama/Llama-3.2-90B-Vision-Instruct \
--image_path "https://llava-vl.github.io/static/images/view.jpg" \
--use_hpu_graphs \
--bf16 \
--use_flash_attention \
--flash_attention_recomputeInference with FP8 precision is enabled using Intel Neural Compressor (INC), which provides model measurement and quantization capabilities in PyTorch. More information on enabling FP8 in SynapseAI is available here: Run Inference Using FP8
Here is an example to measure the tensor quantization statistics on Llava-v1.6-vicuna-13b with SDPA:
PT_HPU_LAZY_MODE=1 QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_pipeline.py \
--model_name_or_path llava-hf/llava-v1.6-vicuna-13b-hf \
--image_path "https://llava-vl.github.io/static/images/view.jpg" \
--use_hpu_graphs \
--bf16 \
--sdp_on_bf16Here is an example to quantize the model based on previous measurements for Llava-v1.6-vicuna-13b with SDPA:
PT_HPU_LAZY_MODE=1 QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python run_pipeline.py \
--model_name_or_path llava-hf/llava-v1.6-vicuna-13b-hf \
--image_path "https://llava-vl.github.io/static/images/view.jpg" \
--use_hpu_graphs \
--bf16 \
--sdp_on_bf16Here is an example of measuring the tensor quantization statistics on Llava-v1.6-mistral-7b with FusedSDPA on 8 HPUs:
PT_HPU_LAZY_MODE=1 QUANT_CONFIG=./quantization_config/maxabs_measure.json python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \
--model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
--image_path "https://llava-vl.github.io/static/images/view.jpg" \
--use_hpu_graphs \
--bf16 \
--use_flash_attention \
--flash_attention_recomputeHere is an example of quantizing the model based on previous measurements for Llava-v1.6-mistral-7b with FusedSDPA on 8 HPUs:
PT_HPU_LAZY_MODE=1 QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \
--model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
--image_path "https://llava-vl.github.io/static/images/view.jpg" \
--use_hpu_graphs \
--bf16 \
--use_flash_attention \
--flash_attention_recomputeHere are single-/multi-device command examples for meta-llama/Llama-3.2-11B-Vision-Instruct.
PT_HPU_LAZY_MODE=1 python3 run_image2text_lora_finetune.py \
--model_name_or_path meta-llama/Llama-3.2-11B-Vision-Instruct \
--dataset_name nielsr/docvqa_1200_examples \
--bf16 True \
--output_dir ./model_lora_llama \
--num_train_epochs 2 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 8 \
--weight_decay 0.01 \
--logging_steps 25 \
--eval_strategy "no" \
--save_strategy "no" \
--learning_rate 5e-5 \
--warmup_steps 50 \
--lr_scheduler_type "constant" \
--input_column_names 'image' 'query' \
--output_column_names 'answers' \
--remove_unused_columns False \
--do_train \
--do_eval \
--use_habana \
--use_lazy_mode \
--lora_rank=8 \
--lora_alpha=8 \
--lora_dropout=0.1 \
--low_cpu_mem_usage True \
--max_seq_length=512 \
--use_hpu_graphs_for_inference True \
--lora_target_modules ".*(language_model).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$"PT_HPU_LAZY_MODE=1 python3 ../gaudi_spawn.py \
--world_size 8 --use_mpi run_image2text_lora_finetune.py \
--model_name_or_path meta-llama/Llama-3.2-11B-Vision-Instruct \
--dataset_name nielsr/docvqa_1200_examples \
--bf16 True \
--output_dir ./model_lora_llama \
--num_train_epochs 2 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 8 \
--weight_decay 0.01 \
--logging_steps 25 \
--eval_strategy "no" \
--save_strategy "no" \
--learning_rate 5e-5 \
--warmup_steps 50 \
--lr_scheduler_type "constant" \
--input_column_names 'image' 'query' \
--output_column_names 'answers' \
--remove_unused_columns False \
--do_train \
--do_eval \
--use_habana \
--use_lazy_mode \
--lora_rank=8 \
--lora_alpha=8 \
--lora_dropout=0.1 \
--low_cpu_mem_usage True \
--max_seq_length=512 \
--use_hpu_graphs_for_inference True \
--lora_target_modules '".*(language_model).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$"'The single card training command for llava-hf/llava-1.5-7b-hf is similar.
For different models, please adjust training parameters and
lora_target_modules. Such as replacelora_target_moduleswith below for HuggingFaceM4/idefics2-8b. '".(text_model|modality_projection|perceiver_resampler).(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$"'