Release v1.16.0: Deepseek V3, SynapseAI v1.20, Llama 405b, AWQ · huggingface/optimum-habana

SynapseAI v1.20

This release has been tested on and validated for SynapseAI v1.20.

New models

Add Qwen2-VL #1542 @nngokhale
Add video-llava model support #1522 @kaixuanliu
Enable the i2vgen pipeline #1670 @yuanwu2017
DeepSeek_v3 support #1735 @srajabos

Llama 405b

Enable Llama 3.1 405B in FP8 #1745 @jaygala223
v1.16 Llama3-405B text-generation. Added DEEPSPEED_USE_HABANA_FRAMEWORKS_DETERMINISTIC_API flag. #1812 @dsmertin
Revert placing llama on cpu #1827 @ugolowic

AWQ

Enable awq int4 in Gaudi #1691 @sywangyi
Fix dependency issue with --load_quantized_model_with_autoawq #1759 @schoi-habana

Various model optimizations

Optimizations and WAs to support HPU execution for Detr-Resnet-50 #1334 @sandeep-maddipatla
Optimized DeepSeek-v2 on Gaudi #1677 @gyou2021
Add xlm-roberta model support for tei-gaudi use case #1715 @kaixuanliu
Optimized SD3 pipeline #1682 @deepak-gowda-narayana
Add clear hpu cache flag for stable perf #1634 @jaygala223
Fix graph breaks in Mixtral #1705 @ShengYang1
Add batch splitting in attention layer to hide NIC latency #1640 @kalyank007
Fix llama FP8 perf issue, kvcache.update should be used since FP8 patches KVCache #1756 @sywangyi
Add HPU fp8 Dynamic MOE #1761 @dudilester

Sentence Transformers

Sentence transformers 3.3.1 #1628 @yafshar

CI

Implement baselines as a fixture and with simple rebase support #1732 @uartie

Other

Fixed formatting #1693 @imangohari1
Fix FlUX.1_dev guidance_batches bug for pad case in _split_inputs_into_batches #1607 @huijuanzh
Fix peft error in Gaudi1 #1627 @sywangyi
Update README.md #1678 @skaulintel
Fix custom ops loading in diffusers #1655 @dsocek
Fix ddpo finetune issue in torch2.5.1 #1666 @sywangyi
Adding Deepspeed zero1 config #1675 @bhargaveede
Enable warmup also for full prompt length case in text generation #1676 @yeonsily
Add padding to input for mllama/paligemma/idefices2 #1671 @sywangyi
Fix for Mixtral G1 pytest failures #1652 @12010486
Fix textual_inversion_sdxl failure on docker 1.20 #1697 @atakaha
Updated Encoder_decoder Tests #1688 @slokesha
Add checks for parallel_state initialization #1680 @yafshar
Update the readme to remove validated models #1703 @jiminha
FP8 baichuan-13b gets oom when running lm_eval with @Liangyx2
Lm eval upgraded to 0.4.7 #1692 @12010486
Enable attention selection in wav2vec-ac #1713 @ugolowic
Fix bug when preparing quant files, starcoder model does not support #1672 @kaixuanliu
Update training pytests to reduce total time #1712 @jiminha
Dropping some ci tests from image_to_text and text_generation #1710 @hsubramony
Add save_checkpoint arg for TIMM training to simplify validation #1701 @ZhengHongming888
Added Unit Test for Gemma-2-27b model #1616 @slokesha
Update TRL README.md to clean up models #1706 @shepark
Support regional compilation #1618 @chaojun-zhang
Fix text generation quality for bf16 models when sampling #1644 @skavulya
Readme modification #1700 @libinta
Fix mpt model generation #1696 @mengniwang95
Fix lm_eval issue of llama #1606 @sywangyi
Align diffusers CI tests with examples #1679 @dsocek
Update audio-classification/requirements.txt to fix numpy version #1717 @hsubramony
Improve automation for stable-diffusion training scripts in README #1651 @dsocek
Fix video diffusion black output if --bf16 is set #1685 @sywangyi
Fix sdxl mlperf time bug #1580 @huijuanzh
Enabling minimize memory for zero3 runs #1724 @bhargaveede
Add gated models to diffusers CI tests #1690 @dsocek
Fix formatting of the kubeVersion range in Kubernetes helm chart #1733 @dmsuehir
Fix llava/llava next issue when working with AutoProcessor #1674 @sywangyi
fea(): reworked the 8x hpu skipping strategy #1694 @imangohari1
Process getting killed while loading data for Llama3.2 90b, 8x #1723 @kalyank007
Fix: Adjust recipe to fit within QueueComputeScal HBM global memory size limit #1722 @kalyank007
Add PRC models to test_text_generation_example.py #1695 @wenbinc-Bin
Added quant config files for new scenarios #1681 @ulivne
Update README.md - correction in diffusers example #1742 @ramyij
Update DS config to align with recommended settings #1730 @ckvermaAI
Add dynamo cache size limit option #1619 @chaojun-zhang
Resolve 'NoneType' object has no attribute 'gate_proj' err when applying EP in DeepSeek-V2 #1740 @IT-Forrest
Edit mixtral quantization config file #1739 @dudilester
Fix the incorrect output of sdxl inpaint #1737 @yuanwu2017
Supports Bitsandbytes development on HPU #1714 @rsshaik1
FLAN-T5 has bad performance when using regional compilation #1744 @chaojun-zhang
Add batch dim idx to support latest deepspeed DistributedAttention #1725 @bhargaveede
Add the inline_inbuilt_nn_modules option #1617 @chaojun-zhang
Clean up README examples #1709 @yeonsily
Accuracy fix for llama3.1-70B in eager/torch.compile mode #1746 @ckvermaAI
Adjust baselines for lower number of epochs improved perplexity, lower throughput #1748 @emascarenhas
Change clip-roberta/bridgetower not to use fast_ddp #1749 @jiminha
Adds requirements.txt to sentence transformers training paraphrases #1753 @pi314ever
Add requirements.txt to sentence transformer training sts #1754 @pi314ever
Add diffuser tests for optimized sdxl flow on HPU #1554 @sushildubey171
Fix the output length in image_to_text test #1751 @sywangyi
Fix Experts Indexing in MoE for Mixtral: Align experts_max with Number of Available Experts #1755 @deepak-gowda-narayana
Add requirements.txt to sentence transformers nli example #1767 @pi314ever
UX code change #1764 @talexjohn
Enable saving and loading FP8 model #1683 @xin3he
Update measurements for Stable Diffusion XL #1773 @mkrze
Add datasets to the requirements for Stable Diffusion training #1782 @yafshar
Enable wav2vec-large model for speech_recognition test #1783 @jiminha
Update multi-node-training environment variables for GaudiNIC #1779 @Jianhong-Zhang
Fixed Gemma2 error when saving pretrain #1781 @kplau1128
Support llava1.5 lora finetuning. #1487 @lkk12014402
Fix DeepSeek-V2 expert-parallelism crash due to indexing error #1765 @skavulya
Update transformer_engine._convert_model to skip LoRA layers #1766 @vivekgoe
Create Habana_Validated_Models.md to list all the models validated #1778 @hsubramony
Enable attention selection for wav2vec2 #1757 @ugolowic
Add --attn_implementation to wav2vec2 slow tests #1788 @ugolowic
Add sentencepiece to the requirements #1792 @hsubramony
Fix LoRA weights loading in text-to-image generation sample script #1789 @dsocek
Add trust_remote_code #1786 @atakaha
Fix the restart issue for Sentence Transformer STS example in validation #1799 @ZhengHongming888
Exp flags for acc issues #1795 @hsubramony
Temporary WA for get_type error #1806 @12010486
Fix Sentence Transformer STS restart issue #1814 @ZhengHongming888
Fix broken link for GenerationConfig #1819 @xin3he
Fix for text-generation, AttributeError: 'GenerationConfig' object has no attribute 'use_fused_rope' #1823 @hsubramony
Fix dataset_version for ST example requirement.txt #1809 @ZhengHongming888
Move model to device before wrapping with FSDP #1830 @skaulintel
Update warmup ratio for adalora #1820 @astachowiczhabana
Fix for attention selection in wav2vec2 #1836 @ugolowic
Revert "Lm eval upgraded to 0.4.7 (#1692)" #1837 @astachowiczhabana
Removing HL_DS_DISTRIBUTED_ATTENTION_SEQ_DIM as it's not needed from 1.20 #1726 @bhargaveede
Temporary workaround to avoid segmentation fault #1798 @yafshar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.16.0: Deepseek V3, SynapseAI v1.20, Llama 405b, AWQ

SynapseAI v1.20

New models

Llama 405b

AWQ

Various model optimizations

Sentence Transformers

CI

Other

Contributors

Uh oh!