v1.16.0: Deepseek V3, SynapseAI v1.20, Llama 405b, AWQ
SynapseAI v1.20
This release has been tested on and validated for SynapseAI v1.20.
New models
- Add Qwen2-VL #1542 @nngokhale
- Add video-llava model support #1522 @kaixuanliu
- Enable the i2vgen pipeline #1670 @yuanwu2017
- DeepSeek_v3 support #1735 @srajabos
Llama 405b
- Enable Llama 3.1 405B in FP8 #1745 @jaygala223
- v1.16 Llama3-405B text-generation. Added DEEPSPEED_USE_HABANA_FRAMEWORKS_DETERMINISTIC_API flag. #1812 @dsmertin
- Revert placing llama on cpu #1827 @ugolowic
AWQ
- Enable awq int4 in Gaudi #1691 @sywangyi
- Fix dependency issue with --load_quantized_model_with_autoawq #1759 @schoi-habana
Various model optimizations
- Optimizations and WAs to support HPU execution for Detr-Resnet-50 #1334 @sandeep-maddipatla
- Optimized DeepSeek-v2 on Gaudi #1677 @gyou2021
- Add xlm-roberta model support for tei-gaudi use case #1715 @kaixuanliu
- Optimized SD3 pipeline #1682 @deepak-gowda-narayana
- Add clear hpu cache flag for stable perf #1634 @jaygala223
- Fix graph breaks in Mixtral #1705 @ShengYang1
- Add batch splitting in attention layer to hide NIC latency #1640 @kalyank007
- Fix llama FP8 perf issue, kvcache.update should be used since FP8 patches KVCache #1756 @sywangyi
- Add HPU fp8 Dynamic MOE #1761 @dudilester
Sentence Transformers
CI
Other
- Fixed formatting #1693 @imangohari1
- Fix FlUX.1_dev guidance_batches bug for pad case in _split_inputs_into_batches #1607 @huijuanzh
- Fix peft error in Gaudi1 #1627 @sywangyi
- Update README.md #1678 @skaulintel
- Fix custom ops loading in diffusers #1655 @dsocek
- Fix ddpo finetune issue in torch2.5.1 #1666 @sywangyi
- Adding Deepspeed zero1 config #1675 @bhargaveede
- Enable warmup also for full prompt length case in text generation #1676 @yeonsily
- Add padding to input for mllama/paligemma/idefices2 #1671 @sywangyi
- Fix for Mixtral G1 pytest failures #1652 @12010486
- Fix textual_inversion_sdxl failure on docker 1.20 #1697 @atakaha
- Updated Encoder_decoder Tests #1688 @slokesha
- Add checks for parallel_state initialization #1680 @yafshar
- Update the readme to remove validated models #1703 @jiminha
- FP8 baichuan-13b gets oom when running lm_eval with @Liangyx2
- Lm eval upgraded to 0.4.7 #1692 @12010486
- Enable attention selection in wav2vec-ac #1713 @ugolowic
- Fix bug when preparing quant files, starcoder model does not support #1672 @kaixuanliu
- Update training pytests to reduce total time #1712 @jiminha
- Dropping some ci tests from image_to_text and text_generation #1710 @hsubramony
- Add save_checkpoint arg for TIMM training to simplify validation #1701 @ZhengHongming888
- Added Unit Test for Gemma-2-27b model #1616 @slokesha
- Update TRL README.md to clean up models #1706 @shepark
- Support regional compilation #1618 @chaojun-zhang
- Fix text generation quality for bf16 models when sampling #1644 @skavulya
- Readme modification #1700 @libinta
- Fix mpt model generation #1696 @mengniwang95
- Fix lm_eval issue of llama #1606 @sywangyi
- Align diffusers CI tests with examples #1679 @dsocek
- Update audio-classification/requirements.txt to fix numpy version #1717 @hsubramony
- Improve automation for stable-diffusion training scripts in README #1651 @dsocek
- Fix video diffusion black output if --bf16 is set #1685 @sywangyi
- Fix sdxl mlperf time bug #1580 @huijuanzh
- Enabling minimize memory for zero3 runs #1724 @bhargaveede
- Add gated models to diffusers CI tests #1690 @dsocek
- Fix formatting of the kubeVersion range in Kubernetes helm chart #1733 @dmsuehir
- Fix llava/llava next issue when working with AutoProcessor #1674 @sywangyi
- fea(): reworked the 8x hpu skipping strategy #1694 @imangohari1
- Process getting killed while loading data for Llama3.2 90b, 8x #1723 @kalyank007
- Fix: Adjust recipe to fit within QueueComputeScal HBM global memory size limit #1722 @kalyank007
- Add PRC models to test_text_generation_example.py #1695 @wenbinc-Bin
- Added quant config files for new scenarios #1681 @ulivne
- Update README.md - correction in diffusers example #1742 @ramyij
- Update DS config to align with recommended settings #1730 @ckvermaAI
- Add dynamo cache size limit option #1619 @chaojun-zhang
- Resolve 'NoneType' object has no attribute 'gate_proj' err when applying EP in DeepSeek-V2 #1740 @IT-Forrest
- Edit mixtral quantization config file #1739 @dudilester
- Fix the incorrect output of sdxl inpaint #1737 @yuanwu2017
- Supports Bitsandbytes development on HPU #1714 @rsshaik1
- FLAN-T5 has bad performance when using regional compilation #1744 @chaojun-zhang
- Add batch dim idx to support latest deepspeed DistributedAttention #1725 @bhargaveede
- Add the inline_inbuilt_nn_modules option #1617 @chaojun-zhang
- Clean up README examples #1709 @yeonsily
- Accuracy fix for llama3.1-70B in eager/torch.compile mode #1746 @ckvermaAI
- Adjust baselines for lower number of epochs improved perplexity, lower throughput #1748 @emascarenhas
- Change clip-roberta/bridgetower not to use fast_ddp #1749 @jiminha
- Adds requirements.txt to sentence transformers training paraphrases #1753 @pi314ever
- Add requirements.txt to sentence transformer training sts #1754 @pi314ever
- Add diffuser tests for optimized sdxl flow on HPU #1554 @sushildubey171
- Fix the output length in image_to_text test #1751 @sywangyi
- Fix Experts Indexing in MoE for Mixtral: Align experts_max with Number of Available Experts #1755 @deepak-gowda-narayana
- Add requirements.txt to sentence transformers nli example #1767 @pi314ever
- UX code change #1764 @talexjohn
- Enable saving and loading FP8 model #1683 @xin3he
- Update measurements for Stable Diffusion XL #1773 @mkrze
- Add datasets to the requirements for Stable Diffusion training #1782 @yafshar
- Enable wav2vec-large model for speech_recognition test #1783 @jiminha
- Update multi-node-training environment variables for GaudiNIC #1779 @Jianhong-Zhang
- Fixed Gemma2 error when saving pretrain #1781 @kplau1128
- Support llava1.5 lora finetuning. #1487 @lkk12014402
- Fix DeepSeek-V2 expert-parallelism crash due to indexing error #1765 @skavulya
- Update transformer_engine._convert_model to skip LoRA layers #1766 @vivekgoe
- Create Habana_Validated_Models.md to list all the models validated #1778 @hsubramony
- Enable attention selection for wav2vec2 #1757 @ugolowic
- Add --attn_implementation to wav2vec2 slow tests #1788 @ugolowic
- Add sentencepiece to the requirements #1792 @hsubramony
- Fix LoRA weights loading in text-to-image generation sample script #1789 @dsocek
- Add trust_remote_code #1786 @atakaha
- Fix the restart issue for Sentence Transformer STS example in validation #1799 @ZhengHongming888
- Exp flags for acc issues #1795 @hsubramony
- Temporary WA for get_type error #1806 @12010486
- Fix Sentence Transformer STS restart issue #1814 @ZhengHongming888
- Fix broken link for GenerationConfig #1819 @xin3he
- Fix for text-generation, AttributeError: 'GenerationConfig' object has no attribute 'use_fused_rope' #1823 @hsubramony
- Fix dataset_version for ST example requirement.txt #1809 @ZhengHongming888
- Move model to device before wrapping with FSDP #1830 @skaulintel
- Update warmup ratio for adalora #1820 @astachowiczhabana
- Fix for attention selection in wav2vec2 #1836 @ugolowic
- Revert "Lm eval upgraded to 0.4.7 (#1692)" #1837 @astachowiczhabana
- Removing HL_DS_DISTRIBUTED_ATTENTION_SEQ_DIM as it's not needed from 1.20 #1726 @bhargaveede
- Temporary workaround to avoid segmentation fault #1798 @yafshar