Releases · huggingface/optimum-habana

rt-detr: optimize loss calculation #1998 @mgonchar
Use FusedSDPA in self_attention of Bert model #2115 @miaojinc
Enable FusedRMSNorm for FLUX #2011 @dsocek
Enable distributed CFG for SD3 pipeline #2015 @dsocek
Refactor Qwen2 Family - FP32 SDPA and max_position_embedding #2030 @Wei-Lin-Intel
Add Qwen classification #2062 @tianyuan211
Reduce index_copy to fp8 in llama2 - QDQ flow #2065 @Tiefen-boop

Safe softmax

Safe_softmax demonstration (#263) #1950 @astachowiczhabana

Bitsandbytes

Integrated NF4 inference tests to text-generation #2058 @rsshaik1
Remove bitsandbytes monkey-patching (II) #2114 @ckvermaAI

Other

Fix to limit inputs_embeds.clone() to training only as it affects inference #1992 @emascarenhas
Add additional info about attn batch split flag #1990 @jaygala223
Update readme files for explicit lazy mode #1921 @jasi306
Fix SD3 flag in README example #2013 @dsocek
Fix text-generation requirements #1989 @vidyasiv
Migrate tests to upstream repos #2002 @IlyasMoutawwakil
Fix makefile commands #2025 @IlyasMoutawwakil
Use AutoAWQ version right before introduction of qwen3 #2033 @IlyasMoutawwakil
Add token to single card tests CI #2034 @IlyasMoutawwakil
Minor Code Comments and Formatting Improvements #2035 @leopardracer
More makefile fixes #2036 @IlyasMoutawwakil
Remove text-generation-inference folder #2068 @regisss
Updated the readme for mediapipe support #2012 @imangohari1
Use makefile in Sentence Transformers CI #2073 @IlyasMoutawwakil
Remove capture_pre_autograd_graph call #2042 @astachowiczhabana
Enable_running_lm_eval_with_log_samples #2046 @astachowiczhabana
Fixed lost modules in regional compilation #2047 @astachowiczhabana
Enable accuracy benchmark using torch compile #2049 @astachowiczhabana
Add support for reduced model #2050 @astachowiczhabana
Enable QDQ #2051 @astachowiczhabana
Minor Documentation Updates and Comments Clarification #2048 @kilavvy
Hot fix compiled fsdp model saving failure #2028 @IlyasMoutawwakil
Use PT_ENABLE_INT64_SUPPORT=1 for trl examples #2089 @pbielak
Remove loss_kwargs from Gemma2 model.forward() and added missing positional_embeddings for Attention layer to sync with Transformers 4.49.0 #2100 @Luca-Calabria
Silence Trainer.tokenizer warnings #2116 @pbielak
Llama 3.2 - Fix the issue for eager mode (#260) #1976 @TANA-BHU
Float inputs for Mixtral 8x7B #2043 @astachowiczhabana
Fix diffuser tests #2054 @astachowiczhabana
Ifeval and MMLU now better supported #2045 @astachowiczhabana
Profiling improvements #1931 @ugolowic
Add documentation workflow #2086 @echarlaix
Add feature manager #1926 @astachowiczhabana
Fix utils package #2141 @pbielak
Use profiler in text-generation-pipeline #2154 @pbielak
Add the PT_HPU_LAZY_MODE=1 env variable when testing in lazy mode #2161 @yafshar
Updated peft version #2160 @imangohari1
Fix version extraction regex and pip command in get_build() #2159 @yafshar
Add warn0 utility to emit warnings only on main process #2157 @yafshar
Remove DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED #2171 @yafshar
Extract HabanaModelAdapter from run_lm_eval to new script file. #2170 @AKloniecki
Remove 'is_pt_flax_cross_test from wav2vec` tests #2174 @pbielak
Fix test_model_weights_reload_no_missing_tied_weights #2175 @pbielak
datasets updated to 3.6.0 version #2176 @alekseyfa
Updated/Fixed the TIMM example readme #2172 @imangohari1
Move torch, transformers and optimum.habana imports to local scope. #2183 @AKloniecki
Move torch and transformers imports to local scope in run_generation.py. #2181 @AKloniecki
Transformers deepseek-v3 Porting to optimum-habana #2186 @rkumar2patel
Remove .float() conversion from Mixtral #2178 @pbielak
Remove potential weakness reported by static code analysis -- CWE 569 -- in transformers/trainer.py #2196 @karol-brejna-i
Ensure output directory exists before trying to write to output file. #2188 @AKloniecki
Remove instances of logically dead code #2194 @ugolowic
Remove unnecessary comparisons to None #2191 @ugolowic
Fixes for bad use of potential None value #2198 @ugolowic
qwen3: Fix missing max_position_embeddings init from config #2173 @mengker33
Allow usage of cached books from project Guttenberg. #2190 @AKloniecki
Remove potential weakness reported by static code analysis -- CWE 398 -- redundant if #2199 @karol-brejna-i
Fix PT_HPU_LAZY_MODE assertion to match updated default value #2189 @AKloniecki
Remove unnecessary null checks - modeling_mpt.py #2204 @karol-brejna-i
Protecting mask undefined value. #2203 @karol-brejna-i
Protecting all_cross_attentions in optimum/habana/transformers/models/blip/modeling_blip_text.py #2202 @karol-brejna-i
Remove unnececary None checks for attention_mask #2205 @karol-brejna-i
Configure qlora tests with additional arguments #2056 @ckvermaAI
Skip unnecessary padding in text generation task #2055 @kyotoyx
Unify SetTrueOrFalseOrNone and StoreTrueFalseAction #2119 @astachowiczhabana
Fix profiler #2134 @astachowiczhabana
Fix missing openorca dataset #2133 @astachowiczhabana
Sync/videollava #2129 @yafshar
Add support for local dataset loading for LibriSpeech and COCO #2136 @gplutop7
Add sentencepiece to setup.py #2153 @pbielak
Extract model adapter class from run_lm_eval.py to a new script file. #2184 @AKloniecki
Fix for granite accuracy #2187 @12010486
Temporarily revert SD quant files to fix promotion #2069 @astachowiczhabana

Contributors

yafshar, jasi306, and 33 other contributors

Assets 2

28 Jul 14:21

regisss

v1.18.1

d598a62

v1.18.1: Transformers v4.51, Qwen3, dynamic quantization

Transformers v4.51

This release supports and has been validated for Transformers v4.51

Upgrade to Transformers v4.51 #1943 @regisss

Qwen3

This release adds optimized support for Qwen3 models on Gaudi.

Add Qwen3 family #1948 @tianyuan211

Dynamic Quantization

This release adds support for dynamic quantization.

Enable dynamic quantization #2085 @astachowiczhabana

Contributors

regisss, tianyuan211, and astachowiczhabana

Assets 2

13 Jun 10:51

regisss

v1.18.0

ea00dc2

v1.18.0: SynapseAI v1.21, Accelerate, CogVideoX, Llava-onevision

SynapseAI v1.21

This release has been tested on and validated for SynapseAI v1.21.

Accelerate

Gaudi is now natively supported in Accelerate, checkout the doc for more information.

Update GaudiAccelerator #1876 @IlyasMoutawwakil
Fix lost modules in regional compilation #1885 @xinyu-intel
fix fsdp and get rid of GaudiPartialState #1942 @IlyasMoutawwakil
Restore dynamic compilation setting and Fix compile_regions Call #1973 @yafshar
Hot fix regional compilation #2005 @IlyasMoutawwakil
Fix fp8 #2010 @IlyasMoutawwakil

Diffusers

fea(diffusers): Upgraded to version 0.32.0 #1939 @imangohari1
fea(): Diffuser upgrade to 0.33.1 #1981 @imangohari1

CogVideoX

Add cogvideox support for gaudi #1600 @nc-BobLee

GLM4V

Add GLM4V #1668 @mengker33

Siglip and Llava-onevision

Add support for Siglip and Llava Onevision #1883 @emascarenhas

Model optimizations

Optimize memory utilization by keeping logits in BF16 #1859 @kalyank007
Integrate DistributedAttention for Qwen2 #1860 @Jianhong-Zhang
[Llama-vision] Add support for Fused RMS Norm #1892 @ANSHUMAN87
Enable torch compile for llama 3.2 vision #1873 @jaygala223
Flag to enable leaf promotion to avoid graph breaks in MLP for compile #1880 @bhargaveede
Adding Deepspeed config for Llama3 Fine Tuning (#165) #1881 @bhargaveede
Add trim_logits support in deepseekV3 #1933 @jthakurH
Add flag to enable compiled_autograd with Deepspeed for training #1785 @vivekgoe
chatglm: Fix a bug when attention mask is None #1896 @mengker33
Optimized DeepSeek-V2 attention prefill with MHA #1791 @gyou2021
[Llama-Vision] Trim logits #1894 @ANSHUMAN87
Align cross_attention_mask for Llama 3.2 90B to avoid partial writes and graph retracing #1917 @kalyank007
Add FSDP config for Granite model #1897 @kplau1128
[Llama-Vision] Add support for bucketing #1895 @ANSHUMAN87
Add Moonlight Support #1868 @jinyouzhi
Add support for expert parallelism with mixtral #1908 @kwisniewski98
Fix issue with in-place operation with requires grad with modeling_qwen2_vl.py #1970 @emascarenhas
Adjust VideoLlavaProcessor to avoid performance regression on gaudi3 #1969 @kaixuanliu
Speed up FLUX training over 2x with Gaudi optimized attention #1963 @dsocek
[llama-vision] Remove token_idx_cpu parameter #2018 @ugolowic

Other

Makefile improvements #1811 @jasi306
[DeepSeek-V3] README update #1911 @ANSHUMAN87
Skipping falcon rope scaling test #1916 @karol-brejna-i
Workaround for DS issue in Llama #1932 @ugolowic
Upgrade LM Eval to 0.4.7 #1901 @astachowiczhabana
Disabling timers synchronization #1879 @bhargaveede
Limit max pos embeds to 8k to prevent OOM #1923 @jaygala223
Fix prompt argument handling in run_pipeline.py #1874 @varu060603
Allow offline mode in CI tests #1924 @astachowiczhabana
Adding memory and graph stats #1858 @jaygala223
Enable QLoRA tests with torch.compile mode #1918 @ckvermaAI
detr: fix possible incorrect tensor type #1899 @mgonchar
Fix --save_last_ckpt if --save_strategy no is set #1934 @vidyasiv
Reimplement HabanaGenerationTime #1920 @ugolowic
Pad the examples for QLoRa finetuning test #1941 @ckvermaAI
Reimplement HabanaGenerationTime fix for timer_checkpoint in sdxl training #1945 @gplutop7
Move bitsandbytes requirements from setup.py to bnb tests #1946 @ckvermaAI
Support allow_unspec_int_on_nn_module #1887 @xinyu-intel
Tokenizer config fix for dynamic mode #1903 @pramodkumar-habanalabs
Support compile from the 2nd iteration #1886 @xinyu-intel
fea(): ReadMe remote_trust fixes #1940 @imangohari1
Run upstream tests #1938 @IlyasMoutawwakil
Fix READMEs - SD paths and LLM PEFT example #1949 @dsocek
Add average latency metrics #1954 @RongLei-intel
Bitsandbytes installation for qlora tests #1951 @ckvermaAI
Update datasets requirement in examples #1956 @regisss
Use data cache in slow_tests_8x #1914 @karol-brejna-i
Add sentencepiece to requirements to support vicuna text generation #1962 @tthakkal
Fix FLUX fine-tuning script #1960 @dsocek
Fix typos #1967 @omahs
Update t5-small samples_per_second value #1968 @12010486
fea(): Added the --sdp_on_bf16 to textual inversion example #1964 @imangohari1
pytest t5 roberta fix #1971 @imangohari1
Update makefile for explicit lazy mode #1925 @jasi306
fea(): Added PT_HPU_LAZY_MODE=1 for diffuser tests #1975 @imangohari1
Fix deepspeed zero3 #1977 @IlyasMoutawwakil
Enable regional compilation in text generation #1927 @karol-brejna-i
README changes for Llama3.1 8B Finetuning with LoRA #1947 @bhargaveede
pt2e quant changes into the main script #1875 @vivek5-ai
Use IKS runners for CI #1953 @regisss
Fix sentence-transformers CI with new runners #1980 @regisss
Update dynamic env handling #1978 @yafshar
Fix wrong calculation of e2e latency #1984 @RongLei-intel
Update test baseline for mistralai/Mixtral-8x7B-v0.1 #1987 @yafshar
Switch to Spawn in PyTorch DataLoader when num_worker>0 #1982 @Wei-Lin-Intel
Enable mixtral 8x7b accuracy evaluation #1986 @rbogdano
Update readme files for explicit lazy mode #1921 @jasi306
Update README examples #2020 @pbielak
Pin latest optimum to force mutual updates #2016 @IlyasMoutawwakil

Contributors

yafshar, jasi306, and 37 other contributors

Assets 2

14 Apr 16:34

regisss

v1.17.0

59e2d1d

v1.17.0: Transformers v4.49

Transformers v4.49

This release has been tested and validated for Transformers v4.49 and SynapseAI v1.20.

Upgrade to Transformers v4.49 #1698 @regisss

Model optimizations

Use token_idx_cpu int instead of token_idx tensor in slicing #1848 @jaygala223
Keep logits in bf16 #1835 @jaygala223
Optimize SD3 Pipeline : Padding prompt Embeddings for softmax_hf8 compatibility and Efficient Utilization #1816 @deepak-gowda-narayana
Add G3 perf WA for Qwen2VL #1884 @nngokhale
Fix MPT regression #1857 @atakaha

Tests and CI

Slow test updates #1804 @ugolowic
Fix race condition when downloading nltk tokenizer #1802 @ugolowic
fea(): Skipped the torch_fx tests #1797 @imangohari1
Upstream tests #1834 @IlyasMoutawwakil
test_examples: add missing clip-roberta baseline #1852 @uartie
Separate slow tests by required number of cards #1803 @ugolowic
Update PR doc build workflow #1904 @regisss

Other

Disable HPU migration (future add-on to HF diffusers) for OH diffusers #1866 @dsocek
Allow explicit control over flash_attention_fast_softmax setting #1851 @astachowiczhabana

Contributors

uartie, ugolowic, and 9 other contributors

Assets 2

12 Mar 09:57

regisss

v1.16.0

106c10c

v1.16.0: Deepseek V3, SynapseAI v1.20, Llama 405b, AWQ

SynapseAI v1.20

This release has been tested on and validated for SynapseAI v1.20.

New models

Add Qwen2-VL #1542 @nngokhale
Add video-llava model support #1522 @kaixuanliu
Enable the i2vgen pipeline #1670 @yuanwu2017
DeepSeek_v3 support #1735 @srajabos

Llama 405b

Enable Llama 3.1 405B in FP8 #1745 @jaygala223
v1.16 Llama3-405B text-generation. Added DEEPSPEED_USE_HABANA_FRAMEWORKS_DETERMINISTIC_API flag. #1812 @dsmertin
Revert placing llama on cpu #1827 @ugolowic

AWQ

Enable awq int4 in Gaudi #1691 @sywangyi
Fix dependency issue with --load_quantized_model_with_autoawq #1759 @schoi-habana

Various model optimizations

Optimizations and WAs to support HPU execution for Detr-Resnet-50 #1334 @sandeep-maddipatla
Optimized DeepSeek-v2 on Gaudi #1677 @gyou2021
Add xlm-roberta model support for tei-gaudi use case #1715 @kaixuanliu
Optimized SD3 pipeline #1682 @deepak-gowda-narayana
Add clear hpu cache flag for stable perf #1634 @jaygala223
Fix graph breaks in Mixtral #1705 @ShengYang1
Add batch splitting in attention layer to hide NIC latency #1640 @kalyank007
Fix llama FP8 perf issue, kvcache.update should be used since FP8 patches KVCache #1756 @sywangyi
Add HPU fp8 Dynamic MOE #1761 @dudilester

Sentence Transformers

Sentence transformers 3.3.1 #1628 @yafshar

CI

Implement baselines as a fixture and with simple rebase support #1732 @uartie

Other

Fixed formatting #1693 @imangohari1
Fix FlUX.1_dev guidance_batches bug for pad case in _split_inputs_into_batches #1607 @huijuanzh
Fix peft error in Gaudi1 #1627 @sywangyi
Update README.md #1678 @skaulintel
Fix custom ops loading in diffusers #1655 @dsocek
Fix ddpo finetune issue in torch2.5.1 #1666 @sywangyi
Adding Deepspeed zero1 config #1675 @bhargaveede
Enable warmup also for full prompt length case in text generation #1676 @yeonsily
Add padding to input for mllama/paligemma/idefices2 #1671 @sywangyi
Fix for Mixtral G1 pytest failures #1652 @12010486
Fix textual_inversion_sdxl failure on docker 1.20 #1697 @atakaha
Updated Encoder_decoder Tests #1688 @slokesha
Add checks for parallel_state initialization #1680 @yafshar
Update the readme to remove validated models #1703 @jiminha
FP8 baichuan-13b gets oom when running lm_eval with @Liangyx2
Lm eval upgraded to 0.4.7 #1692 @12010486
Enable attention selection in wav2vec-ac #1713 @ugolowic
Fix bug when preparing quant files, starcoder model does not support #1672 @kaixuanliu
Update training pytests to reduce total time #1712 @jiminha
Dropping some ci tests from image_to_text and text_generation #1710 @hsubramony
Add save_checkpoint arg for TIMM training to simplify validation #1701 @ZhengHongming888
Added Unit Test for Gemma-2-27b model #1616 @slokesha
Update TRL README.md to clean up models #1706 @shepark
Support regional compilation #1618 @chaojun-zhang
Fix text generation quality for bf16 models when sampling #1644 @skavulya
Readme modification #1700 @libinta
Fix mpt model generation #1696 @mengniwang95
Fix lm_eval issue of llama #1606 @sywangyi
Align diffusers CI tests with examples #1679 @dsocek
Update audio-classification/requirements.txt to fix numpy version #1717 @hsubramony
Improve automation for stable-diffusion training scripts in README #1651 @dsocek
Fix video diffusion black output if --bf16 is set #1685 @sywangyi
Fix sdxl mlperf time bug #1580 @huijuanzh
Enabling minimize memory for zero3 runs #1724 @bhargaveede
Add gated models to diffusers CI tests #1690 @dsocek
Fix formatting of the kubeVersion range in Kubernetes helm chart #1733 @dmsuehir
Fix llava/llava next issue when working with AutoProcessor #1674 @sywangyi
fea(): reworked the 8x hpu skipping strategy #1694 @imangohari1
Process getting killed while loading data for Llama3.2 90b, 8x #1723 @kalyank007
Fix: Adjust recipe to fit within QueueComputeScal HBM global memory size limit #1722 @kalyank007
Add PRC models to test_text_generation_example.py #1695 @wenbinc-Bin
Added quant config files for new scenarios #1681 @ulivne
Update README.md - correction in diffusers example #1742 @ramyij
Update DS config to align with recommended settings #1730 @ckvermaAI
Add dynamo cache size limit option #1619 @chaojun-zhang
Resolve 'NoneType' object has no attribute 'gate_proj' err when applying EP in DeepSeek-V2 #1740 @IT-Forrest
Edit mixtral quantization config file #1739 @dudilester
Fix the incorrect output of sdxl inpaint #1737 @yuanwu2017
Supports Bitsandbytes development on HPU #1714 @rsshaik1
FLAN-T5 has bad performance when using regional compilation #1744 @chaojun-zhang
Add batch dim idx to support latest deepspeed DistributedAttention #1725 @bhargaveede
Add the inline_inbuilt_nn_modules option #1617 @chaojun-zhang
Clean up README examples #1709 @yeonsily
Accuracy fix for llama3.1-70B in eager/torch.compile mode #1746 @ckvermaAI
Adjust baselines for lower number of epochs improved perplexity, lower throughput #1748 @emascarenhas
Change clip-roberta/bridgetower not to use fast_ddp #1749 @jiminha
Adds requirements.txt to sentence transformers training paraphrases #1753 @pi314ever
Add requirements.txt to sentence transformer training sts #1754 @pi314ever
Add diffuser tests for optimized sdxl flow on HPU #1554 @sushildubey171
Fix the output length in image_to_text test #1751 @sywangyi
Fix Experts Indexing in MoE for Mixtral: Align experts_max with Number of Available Experts #1755 @deepak-gowda-narayana
Add requirements.txt to sentence transformers nli example #1767 @pi314ever
UX code change #1764 @talexjohn
Enable saving and loading FP8 model #1683 @xin3he
Update measurements for Stable Diffusion XL #1773 @mkrze
Add datasets to the requirements for Stable Diffusion training #1782 @yafshar
Enable wav2vec-large model for speech_recognition test #1783 @jiminha
Update multi-node-training environment variables for GaudiNIC #1779 @Jianhong-Zhang
Fixed Gemma2 error when saving pretrain #1781 @kplau1128
Support llava1.5 lora finetuning. #1487 @lkk12014402
Fix DeepSeek-V2 expert-parallelism crash due to indexing error #1765 @skavulya
Update transformer_engine._convert_model to skip LoRA layers #1766 @vivekgoe
Create Habana_Validated_Models.md to list all the models validated #1778 @hsubramony
Enable attention selection for wav2vec2 #1757 @ugolowic
Add --attn_implementation to wav2vec2 slow tests #1788 @ugolowic
Add sentencepiece to the requirements #1792 @hsubramony
Fix LoRA weights loading in text-to-image generation sample script #1789 @dsocek
Add trust_remote_code #1786 @atakaha
Fix the restart issue for Sentence Transformer STS example in validation #1799 @ZhengHongming888
Exp flags for acc issues #1795 @hsubramony
Temporary WA for get_type error #1806 @12010486
Fix Sentence Transformer STS restart issue #1814 @ZhengHongming888
Fix broken link for GenerationConfig #1819 @xin3he
Fix for text-generation, AttributeError: 'GenerationConfig' object has no attribute 'use_fused_rope' #1823 @hsubramony
Fix dataset_version for ST example requirement.txt #1809 @ZhengHongming888
Move model to device before wrapping with FSDP #1830 @skaulintel
Update warmup ratio for adalora #1820 @astachowiczhabana
Fix for attention selection in wav2vec2 #1836 @ugolowic
Revert "Lm eval upgraded to 0.4.7 (#1692)" #1837 @astachowiczhabana
Removing HL_DS_DISTRIBUTED_ATTENTION_SEQ_DIM as it's not needed from 1.20 #1726 @bhargaveede
Temporary workaround to avoid segmentation fault #1798 @yafshar

Contributors

uartie, chaojun-zhang, and 50 other contributors

Assets 2

02 Jan 11:36

regisss

v1.15.0

f0438ae

v1.15.0: SynapseAI v1.19.0, FLUX, Mllama, DeepSeek, Falcon 3

SynapseAI v1.19

Upgrade to SynapseAI 1.19 #1667 @regisss

FLUX

FLUX with diffusers 0.31.0 #1450 @dsocek
FLUX Fine-Tuning for Gaudi #1482 @dsocek
Flux Image-To-Image pipeline #1524 @dsocek

New models

Optimized inference of Cohere model on HPU #1329 @XinyuYe-Intel
Idefics2 #1270 @sywangyi
Optimized inference of XGLM model on HPU #1323 @XinyuYe-Intel
Add mllama support #1419 @sywangyi
Enable paligemma model for image-to-text example #1407 @kaixuanliu
Enable Gemma2 Inference on Gaudi #1504 @Luca-Calabria
Minicpm enabling #1342 @pi314ever
Enable Falcon-mamba #1480 @yuanwu2017
Add support for Baichuan2 #1479 @xhaihao
Enable DeepSeek-V2 #1475 @yao-matrix
Add chatglm #1478 @mengker33
Falcon Model Support #1612 @alekseyfa

Various model optimizations

Enable flash attention for gemma #1454 @atakaha
Support loading 4 bit Qwen2 #1476 @mengniwang95
Fixed Gemma FP8 flash_attention lower throughput issue #1510 @kplau1128
Disable default sdpa in Albert (#22) #1517 @astachowiczhabana
Implement fused sdpa for wav2vec2 (#18) #1520 @astachowiczhabana
Memory optimization for gpt_bitcode #1513 @astachowiczhabana
Support beam search with reuse_cache and bucket_internal #1472 @Wei-Lin-Intel
Add mixtral trl sft #1349 @lkk12014402
Enable tiiuae/falcon-11B-vlm in image_to_text example #1490 @sywangyi
Enable fusedsdpa kernel for vision part of mllama #1531 @sywangyi
Enable dynamic compile for mpi(training) #1509 @chaojun-zhang
Add DynamicMoE support for Mixtral #1511 @kwisniewski98
Implemented fusedSDPA for stable diffusion (#36) #1545 @astachowiczhabana
Fix Accuracy Calculation Issue in GPT-NeoX #1591 @yafshar

Sentence Transformers

Update sentence transformer to v3.2.1 #1470 @ZhengHongming888

Textual Inversion XL

Add textual inversion XL for Gaudi #868 @dsocek

TIMM

Enable pyTorch-IMage-Models (TIMM) with HPUs #1459 @ZhengHongming888

Context Parallelism

Adding support for Context Parallelism using Deepseed's DistributedAttention #1501 @bhargaveede
Move parallel_state.py to the distributed folder a6ee7c2044e6ddf7d19ae3ad663149e51d6f89e7 @regisss

CI improvements

Tests for text gen output text #1411 @vidyasiv
Add split runners to CI (2 devices per runner for fast tests) 72df37df46d1d2a2665c5d1be43b13704b7c8ada @regisss
Fix fast CI to work with split runners #1534 @regisss
Add Llama 3.1 ft to CI #1529 @MohitIntel

Documentation

Optimum-Habana docs re-org #1488 @dsocek

Other

Fix facebook/hf-seamless-m4t-medium crash #1433 @sywangyi
Fix bias update in scoped all reduce #1456 @skavulya
fea(pytests): Added skip for unsuported tests for mistral/mixtral #1462 @imangohari1
Remove deprecated Mixed precision flags #1471 @vivekgoe
Readme: replace tabs with spaces #1485 @mgonchar
Move fast tests to Gaudi2 #1498 @regisss
Remove torch req from LM example #1491 @astachowiczhabana
Remove keep_input_mutations #1492 @astachowiczhabana
Fix trust_remote_code #1493 @astachowiczhabana
Upgrade ViT README with torch.compile #1494 @astachowiczhabana
Corrected Throughput measure for GaudiDDPMPipeline #1460 @deepak-gowda-narayana
[SW-196761] Add G3 in T5-L README #1523 @astachowiczhabana
Fix tuple object error #1354 @SupreetSinghPalne
Add warmup time and compile time log for the eval/prediction. #1489 @jiminha
Add support for MLPERF optimized pipeline from example #1465 @ANSHUMAN87
Add check_neural_compressor_min_version for 4 bit behavior #1500 @xin3he
Pass "lazy_mode" arg to GaudiLlamaModel GaudiTrainer #1515 @astachowiczhabana
Removed workaround for NaN bug causing graph break. #1516 @astachowiczhabana
text_generation: improve parameters check #1527 @mgonchar
transformers: fixed some typos #1528 @mgonchar
Makes the with_stack of the profiler changeable #1497 @ranzhejiang
Fix dtype issue with valid sequence length in torch.compile bs=1 #1532 @wszczurekhabana
Migrate OH CLIP (roberta-clip) training to torch.compile #1507 @chaojun-zhang
test_text_generation: fix non-Gaudi2 case #1530 @mgonchar
text-generation: improve output printing #1486 @mgonchar
Text-generation, model set-up: torch.compile for attributes instead of models' types #1452 @dsmertin
Fix bridgetower example #1481 @astachowiczhabana
Migrate OH Wave2Vec-AC training to torch.compile - README update #1537 @astachowiczhabana
Migrate OH T5-large training to torch.compile #1506 @chaojun-zhang
trainer: fixed spelling #1538 @mgonchar
Create CI Eager/Lazy for Language Modeling #1448 @Luca-Calabria
Fixes for llava-next test failures in 1.19 #1535 @tthakkal
Refactor Qwen2 Family #1541 @Wei-Lin-Intel
Add support for optimized SDXL pipeline #1519 @sushildubey171
Add the checkout parameters of falcon-mamba pytest #1540 @yuanwu2017
Avoid negative values in eval metrics #1533 @deepak-gowda-narayana
Fix lm_eval script for starcoder and gemma #1463 @skavulya
Add option to use bf16 in PT sdp (#5) #1514 @astachowiczhabana
Fix tests.test_peft_inference failure #1543 @sywangyi
Update lm_eval version #1473 @alexey-belyakov
Fix bad import in Baichuan code #1547 @regisss
Restore performance in generate #1546 @ugolowic
Fix for llava models not generating text with test failures in 1.19 #1548 @tthakkal
Refactor KV cache, Rope , reduce common code #1148 @abhilash1910
Adjust Qwen2-7B test case #1551 @Wei-Lin-Intel
[run_lm_eval.py] Fixed too many print dump json info #1553 @FocusLuo
Fix for single_card llama7b and falcon40b CI errors #1549 @MohitIntel
Apply --sdp_on_bf16 to image-to-text examples #1557 @schoi-habana
Fix accuracy regression in Gemma #1556 @skavulya
Fix FusedSDPA wrapper from TransformerEngine #1562 @pbielak
Run albert-xxlarge-v1 CI as torch.compile mode #1563 @yeonsily
Update README commands for the models to use --sdp_on_bf16 #1566 @yeonsily
Minicpm patch #1567 @pi314ever
Updated gemma_2b_it CI #1561 @Luca-Calabria
Fixed Adalora Test for OH 1.15 #1564 @npiroozan
Fixed LORACP Test for OH 1.15 #1568 @npiroozan
Fix prefix llama ci failure #1570 @sywangyi
Fix mllama test #1569 @sywangyi
Fix lazy_mode assignment #1558 @vidyasiv
Generation utils update (minor) #1468 @yafshar
Style: removed tabs #1577 @mgonchar
Enable num_return_sequences in beam search #1536 @mengker33
gpt_bigcode: added internal bucketing fix #1526 @mgonchar
Update the Gaudi trainer with transformers 4.45.2 #1398 @yafshar
Revert "add check_neural_compressor_min_version for 4 bit behavior" #1578 @xin3he
Revert PR #1473 #1582 @regisss
Fixed spelling #1576 @mgonchar
Update docs for baichuan2 training #1586 @xhaihao
Add WA flag for falcon-180b to resolve text-gen critical reset error during tests #1590 @hchauhan123
Update transformers tests generation util v4.45.2 #1441 @malkomes
Limit position embeddings in inference #1598 @bhargaveede
Verify model output is provided when check_output is enabled #1597 @vidyasiv
Update README.md #1595 @skaulintel
Fix scikit-learn to 1.5.2 to fix f1 evaluation crash in 1.6.0 #1596 @sywangyi
Update language-modeling README file #1599 @vivekgoe
Revert common KVCache not to check token_idx #1594 @jiminha
Revert LlamaKVCache due to memory increase #1605 @jiminha
Replace the UNET custom attention processors #1608 @yafshar
Fix run_generation test commands for TRL out usage example #1621 @shepark
Update sdp_on_bf16 option for ST example #1615 @ZhengHongming888
Update save lora weights for diffusers with text_encoder_2 layers #1626 @skavulya
Fix save_lora_weights in pipeline_utils.py #1643 @regisss
Check rope_scaling attr #1609 @jiminha
Skip certain tests for G1 with empty param list #1613 @hsubramony
Revert "Update transformers tests generation util v4.45.2 (#1441)" #1614 @yeonsily
Audio classification readme update #1604 @hsubramony
Fix readme cmds for clip-roberta #1603 @hsubramony
Add arbitrary scales #1625 @jiminha
Modify Qwen2 TRL command to avoid OOM. #1630 @jiminha
Fix distributed issue for ST Trainer #1649 @ZhengHongming888
Fix distributed issue for timm #1653 @ZhengHongming888
Refactor mixtral moe block. #1635 @lkk12014402
Speech-recognition: downgrade datasets version #1646 @hsubramony
Add sdp_on_bf16 to controlnet #1631 @skaulintel
Quick fix for quantization/custom op list loading #1657 @dsocek
Fix bug for GaudiMixtralAttentionLongSequence forward #1650 @kaixuanliu

Contributors

chaojun-zhang, yafshar, and 50 other contributors

Assets 2

29 Oct 17:13

regisss

v1.14.1

84b6455

v1.14.1: Patch release

Enable DeepSpeed for image-to-text example #1455 @schoi-habana
Fix bug when loading 4bit checkpoint quantized in INC #1447 @xin3he
Fixes 'Tokenizer does not have padding token' introduced by #1444 for Llama3.1 #1457 @MohitIntel

Full Changelog: v1.14.0...v1.14.1

Contributors

schoi-habana, MohitIntel, and xin3he

Assets 2

22 Oct 16:11

regisss

v1.14.0

058e91c

v1.14.0: Transformers v4.45, SynapseAI v1.18, Qwen2-MoE, text-to-video generation

Transformers v4.45

Upgrade to Transformers v4.45 #1359 @regisss

SynapseAI v1.18

Upgrade to SynapseAI 1.18.0 #1418 @regisss

Qwen2-MoE

Added Qwen2-MoE model, optimizing its performance on Gaudi #1316 @gyou2021

Text-to-video generation

Enabling Text to Video Diffusion Model Generation #1109 @pi314ever
Porting Stable Video Diffusion ControNet to HPU #1037 @wenbinc-Bin

Depth-to-image generation

Depth to Image Generation #1175 @pi314ever

Model optimizations

Enable FusedSDPA for Mpt #1101 @Jianhong-Zhang
Mixtral fp8 #1269 @imangohari1
Prevent Graph break in Llama when using flash attention #1301 @pramodkumar-habanalabs
Boost SDXL speed with initialized schedule step reset #1284 @dsocek
Improve MPT fp8 #1256 @atakaha
Add Whisper static generation #1275 @Spycsh
Gemma: enabled HPU Graphs and Flash Attention #1173 @dsmertin
Recommend jemalloc for gpt-neox-20b 8x #1350 @hsubramony
Optimized inference of GPT-NEO model on HPU #1319 @XinyuYe-Intel
Fix graph breaks for BART in torch.compile mode. #1379 @astachowiczhabana
Gpt_bigcode: added internal_bucketing support #1218 @mgonchar
refine bucket_internal for mpt #1194 @Jing1Ling
Qwen finetuning bucketing #1130 @ssarkar2
Enable FusedSDPA fp8 in Llama FT #1388 @pbielak
Added gemma specific fp8 quantization file #1445 @yeonsily

Intel Neural Compressor

Enable INC for llava models and change softmax to use torch.nn.functional.softmax as its supported module by INC #1325 @tthakkal
Load INC GPTQ checkpoint & rename params #1364 @HolyFalafel
Fix load INC load weights compile error due to Transformer 4.45 upgrade. #1421 @jiminha

Vera/LN-tuning

Vera/ln_tuning add and test case add #1294 @sywangyi

Other

Add callable workflow to post comments when code quality check failed #1263 @regisss
Fix failed code quality check comment workflow #1264 @regisss
Accelerate Diffusers CI #1265 @regisss
Add profiler to SD3 #1267 @atakaha
Fix profiling step with device finish execution for text-generation #1283 @libinta
Update FusedSDPA calling method as Gaudi documentation #1285 @yeonsily
Switch failed code quality check comment to workflow_run #1297 @regisss
Potential fix for the failed code quality check comment workflow #1299 @regisss
Fix text-generation example lm_eval evaluation #1308 @changwangss
Add section to README about Transformers development branch #1307 @regisss
Fix eager mode in run_generation by removing graph logs #1231 @Vasud-ha
Fix bug when running google/paligemma-3b-mix-224 #1279 @kaixuanliu
Use native checkpointing under compile mode #1313 @xinyu-intel
fixed fused_qkv object AttributeError due to 'LlamaConfig' #1203 @rkumar2patel
Image to Image Generation Enabling #1196 @pi314ever
Diffusers timing #1277 @imangohari1
Fix eos issue in finetune/generation #1253 @sywangyi
Update CI, tests and examples #1315 @regisss
Fix Sentence Transformer HPU graphs for training with PEFT model #1320 @nngokhale
Fix ZeroDivisionError in constrained beam search with static shapes #1317 @skavulya
Update esmfold model not to use param_buffer_assignment #1324 @jiminha
Falcon inference crash fix for falcon-40b model #1161 @yeonsily
Add --use_kv_cache to image-to-text pipeline #1292 @KimBioInfoStudio
Trl upgrade #1245 @sywangyi
Fix uint4 url typo. #1340 @kding1
Use eager attention for wav2vec2 #1333 @skaulintel
Add _reorder_cache back to Llama for HPU #1233 @jiminha
SDXL CI script throughput #1296 @imangohari1
Add image so that transformers tests can run #1338 @skaulintel
Fixes the no attribute error with the falcon multicard test #1344 @mounikamandava
Add profiler to sdxl mlperf pipeline #1339 @Jianhong-Zhang
Fix decoder only generation #948 @tjs-intel
Upgrade gradient chekpointing #1347 @yafshar
Run_generation example: fixed graph compilation statistics reporting #1352 @mgonchar
Fix deepseeed crash with Sentence Transformer Trainer #1328 @nngokhale
fea(ci): reduced slow test_diffusers timing. minor fixes #1330 @imangohari1
Flash attn args for GaudiGemmaForCausalLM #1356 @kkoryun
Transformer models generation supports user-provided input embeddings #1276 @zongwave
Fixed the expected values after for img2img slice #1332 @imangohari1
Gpt_big_code: make flash attention impl quantization friendly #1282 @mgonchar
Fix OOM when inference with llama-3.1-70b #1302 @harborn
Fix the conditional #1362 @yafshar
Revert "use native checkpointing under compile mode" #1365 @xinyu-intel
Remove repetitive pip install commands #1367 @MohitIntel
Minor UX enhancement #1373 @MohitIntel
Fix bug when running image-to-text example #1371 @kaixuanliu
Gpt_bigcode: fixed wrong indentation #1376 @mgonchar
Support for transformers without self.model to torch.compile #1380 @astachowiczhabana
Only pass the use_kv_cache True to generator #1366 @yafshar
Clean up the code and remove unnecessary class #1382 @yafshar
Add the diffusers examples of inference Tech #1244 @yuanwu2017
Enhance transformers test suite in Optimum-habana-4.43.4 Auto pr 07654de #1387 @rkumar2patel
Enhance transformers test suite in Optimum-habana-4.43.4 (auto PR 8926a4b) #1386 @rkumar2patel
Add README.md for Sentence transformer examples with HPU device #1355 @ZhengHongming888
Change Falcon/GPT-Neox rotary embedding function to use seq_len for #1368 @yeonsily
Enhance Optimum-habana as per transformers-4.43.4 #1381 @rkumar2patel
CI fix - Install stable-diffusion reqs #1389 @vidyasiv
Fix error caused by uninitialized attn_weights #1391 @hsubramony
Replace flash attention flag #1393 @skaulintel
Fix DeepSpeed CI on Gaudi2 #1395 @regisss
Truncate the cached max seq len #1394 @astachowiczhabana
Fix gpt-neox training accuracy issue. #1397 @yeonsily
Simplify HQT config files #1219 @Tiefen-boop
unify_measurements.py script support to unify PCQ 70B 8x #1322 @Yantom1
Add misc. training args #1346 @SanityRemnants
Add quantization config for low bs case #1377 @ulivne
Remove HQT from OHF #1257 @Yantom1
Valid sequence length for sdpa #1183 @ssarkar2
Multiple fixes (dynamo graph break, qwen-moe, multicard) #1410 @ssarkar2
Change the image path for transformers tests back to the correct location #1401 @skaulintel
Fix Gaudi2 regression tests #1403 @regisss
Reverting some of transformer pytest funcs/values #1399 @imangohari1
Fix StarCoder2 inference #1405 @regisss
Change the order for test_diffusers #1406 @hsubramony
Fix llama model text generation error #1402 @zongwave
Datasets downgrade version to 2.21.0 #1413 @hsubramony
Update ci sentence_transformer.sh #1424 @ZhengHongming888
Update language-modeling README.md, add trust_remote_code for flan-t5-xl #1422 @hsubramony
Update unify_measurements.py support info #1425 @shepark
Fix GPT_neox incorrect output with batch query #1358 @Jianhong-Zhang
Fix text-to-image example #1429 @regisss
Add flag to run inference with partial dataset #1420 @pramodkumar-habanalabs
Add peft generation example #1427 @sywangyi
Added missing allocate_kv_cache() call in CausalLM class #1431 @yeonsily
Fix merge error and update text-to-speech readme #1436 @hsubramony
Fix OOM error for code llama #1437 @jiminha
Fix error on 4bit checkpoint load with run_lm_eval on TF4.45.2 #1439 @jiminha
GPT2 torch.compile fix #1434 @dsmertin
Update text-gen README.md to add auto-gptq fork install steps #1442 @hsubramony
Fix scoped linear all-reduce for starcoder model #1432 @skavulya
Fixed recursion error in SentenceTransformer #1428 @yafshar
Fix Llama 3.1 generation #1444 @regisss
Remove cache folder from image data folder #1446 @shepark

Contributors

harborn, yafshar, and 47 other contributors

Assets 2

06 Sep 20:17

regisss

v1.13.2

1266993

v1.13.2: Patch release

Llava(-next) improvements

This patch release adds multi-card support for Llava(-next) and enables users to turn on/off recomputing for flash attention.

Llava: Added flash_attention_recompute arg to provide an option to enable/disable recompute #1278 @tthakkal
Add the deepspeed injection_policy of mistral #1309 @yuanwu2017

Full Changelog: v1.13.1...v1.13.2

Contributors

yuanwu2017 and tthakkal

Assets 2

25 Aug 13:34

regisss

v1.13.1

52e22cb

v1.13.1: Patch release

Fixed memory regressions

Remove _expand_inputs_for_generation for greedy search (#1266) @libinta
Fix memory regression for modeling llama (#1271) @libinta

FSDP

FSDP checkpoint saving is fixed.

Fix BERT FSDP test (#1281) @regisss

Known limitations

ESMFold does not work on Gaudi1, this will be fixed in a future version

Full Changelog: v1.13.0...v1.13.1

Contributors

regisss and libinta

Assets 2

Releases: huggingface/optimum-habana

v1.19.0: SynapseAI v1.22, GRPO, Snowflake Arctic, Diffusers v0.34

SynapseAI v1.22

Diffusers v0.34

GRPO trainer

FP8 with FSDP

Deepspeed regional compilation

Stable Diffusion

Snowflake Arctic

Model optimizations

Safe softmax

Bitsandbytes

Other

Contributors

Uh oh!

v1.18.1: Transformers v4.51, Qwen3, dynamic quantization

Transformers v4.51

Qwen3

Dynamic Quantization

Contributors

Uh oh!

v1.18.0: SynapseAI v1.21, Accelerate, CogVideoX, Llava-onevision

SynapseAI v1.21

Accelerate

Diffusers

CogVideoX

GLM4V

Siglip and Llava-onevision

Model optimizations

Other

Contributors

Uh oh!

v1.17.0: Transformers v4.49

Transformers v4.49

Model optimizations

Tests and CI

Other

Contributors

Uh oh!

v1.16.0: Deepseek V3, SynapseAI v1.20, Llama 405b, AWQ

SynapseAI v1.20

New models

Llama 405b

AWQ

Various model optimizations

Sentence Transformers

CI

Other

Contributors

Uh oh!

v1.15.0: SynapseAI v1.19.0, FLUX, Mllama, DeepSeek, Falcon 3

SynapseAI v1.19

FLUX

New models

Various model optimizations

Sentence Transformers

Textual Inversion XL

TIMM

Context Parallelism

CI improvements

Documentation

Other

Contributors

Uh oh!

v1.14.1: Patch release

Contributors

Uh oh!

v1.14.0: Transformers v4.45, SynapseAI v1.18, Qwen2-MoE, text-to-video generation

Transformers v4.45

SynapseAI v1.18

Qwen2-MoE

Text-to-video generation

Depth-to-image generation

Model optimizations

Intel Neural Compressor

Vera/LN-tuning

Other

Contributors

Uh oh!

v1.13.2: Patch release