Releases: huggingface/optimum-habana
Releases · huggingface/optimum-habana
v1.19.0: SynapseAI v1.22, GRPO, Snowflake Arctic, Diffusers v0.34
SynapseAI v1.22
- Upgrade to SynapseAI v1.22 8171a96 @astachowiczhabana
Diffusers v0.34
- Diffusers 0.34.0 #2152 @imangohari1
GRPO trainer
- Enable trl GRPO trainer #2088 @schoi-habana
FP8 with FSDP
- Add support for fp8 fsdpa in the Mixtral model #2026 @astachowiczhabana
Deepspeed regional compilation
- Deepspeed regional compilation #2021 @IlyasMoutawwakil
Stable Diffusion
Snowflake Arctic
- Enabling Snowflake Arctic on Gaudi 3 #1719 @pi314ever
Model optimizations
- rt-detr: optimize loss calculation #1998 @mgonchar
- Use FusedSDPA in self_attention of Bert model #2115 @miaojinc
- Enable FusedRMSNorm for FLUX #2011 @dsocek
- Enable distributed CFG for SD3 pipeline #2015 @dsocek
- Refactor Qwen2 Family - FP32 SDPA and max_position_embedding #2030 @Wei-Lin-Intel
- Add Qwen classification #2062 @tianyuan211
- Reduce index_copy to fp8 in llama2 - QDQ flow #2065 @Tiefen-boop
Safe softmax
- Safe_softmax demonstration (#263) #1950 @astachowiczhabana
Bitsandbytes
- Integrated NF4 inference tests to text-generation #2058 @rsshaik1
- Remove bitsandbytes monkey-patching (II) #2114 @ckvermaAI
Other
- Fix to limit inputs_embeds.clone() to training only as it affects inference #1992 @emascarenhas
- Add additional info about attn batch split flag #1990 @jaygala223
- Update readme files for explicit lazy mode #1921 @jasi306
- Fix SD3 flag in README example #2013 @dsocek
- Fix text-generation requirements #1989 @vidyasiv
- Migrate tests to upstream repos #2002 @IlyasMoutawwakil
- Fix makefile commands #2025 @IlyasMoutawwakil
- Use AutoAWQ version right before introduction of qwen3 #2033 @IlyasMoutawwakil
- Add token to single card tests CI #2034 @IlyasMoutawwakil
- Minor Code Comments and Formatting Improvements #2035 @leopardracer
- More makefile fixes #2036 @IlyasMoutawwakil
- Remove text-generation-inference folder #2068 @regisss
- Updated the readme for mediapipe support #2012 @imangohari1
- Use makefile in Sentence Transformers CI #2073 @IlyasMoutawwakil
- Remove capture_pre_autograd_graph call #2042 @astachowiczhabana
- Enable_running_lm_eval_with_log_samples #2046 @astachowiczhabana
- Fixed lost modules in regional compilation #2047 @astachowiczhabana
- Enable accuracy benchmark using torch compile #2049 @astachowiczhabana
- Add support for reduced model #2050 @astachowiczhabana
- Enable QDQ #2051 @astachowiczhabana
- Minor Documentation Updates and Comments Clarification #2048 @kilavvy
- Hot fix compiled fsdp model saving failure #2028 @IlyasMoutawwakil
- Use PT_ENABLE_INT64_SUPPORT=1 for trl examples #2089 @pbielak
- Remove loss_kwargs from Gemma2 model.forward() and added missing positional_embeddings for Attention layer to sync with Transformers 4.49.0 #2100 @Luca-Calabria
- Silence Trainer.tokenizer warnings #2116 @pbielak
- Llama 3.2 - Fix the issue for eager mode (#260) #1976 @TANA-BHU
- Float inputs for Mixtral 8x7B #2043 @astachowiczhabana
- Fix diffuser tests #2054 @astachowiczhabana
- Ifeval and MMLU now better supported #2045 @astachowiczhabana
- Profiling improvements #1931 @ugolowic
- Add documentation workflow #2086 @echarlaix
- Add feature manager #1926 @astachowiczhabana
- Fix utils package #2141 @pbielak
- Use profiler in text-generation-pipeline #2154 @pbielak
- Add the PT_HPU_LAZY_MODE=1 env variable when testing in lazy mode #2161 @yafshar
- Updated peft version #2160 @imangohari1
- Fix version extraction regex and pip command in get_build() #2159 @yafshar
- Add warn0 utility to emit warnings only on main process #2157 @yafshar
- Remove DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED #2171 @yafshar
- Extract HabanaModelAdapter from run_lm_eval to new script file. #2170 @AKloniecki
- Remove 'is_pt_flax_cross_test from wav2vec` tests #2174 @pbielak
- Fix test_model_weights_reload_no_missing_tied_weights #2175 @pbielak
- datasets updated to 3.6.0 version #2176 @alekseyfa
- Updated/Fixed the TIMM example readme #2172 @imangohari1
- Move torch, transformers and optimum.habana imports to local scope. #2183 @AKloniecki
- Move torch and transformers imports to local scope in run_generation.py. #2181 @AKloniecki
- Transformers deepseek-v3 Porting to optimum-habana #2186 @rkumar2patel
- Remove .float() conversion from Mixtral #2178 @pbielak
- Remove potential weakness reported by static code analysis -- CWE 569 -- in transformers/trainer.py #2196 @karol-brejna-i
- Ensure output directory exists before trying to write to output file. #2188 @AKloniecki
- Remove instances of logically dead code #2194 @ugolowic
- Remove unnecessary comparisons to None #2191 @ugolowic
- Fixes for bad use of potential None value #2198 @ugolowic
- qwen3: Fix missing max_position_embeddings init from config #2173 @mengker33
- Allow usage of cached books from project Guttenberg. #2190 @AKloniecki
- Remove potential weakness reported by static code analysis -- CWE 398 -- redundant if #2199 @karol-brejna-i
- Fix PT_HPU_LAZY_MODE assertion to match updated default value #2189 @AKloniecki
- Remove unnecessary null checks - modeling_mpt.py #2204 @karol-brejna-i
- Protecting mask undefined value. #2203 @karol-brejna-i
- Protecting all_cross_attentions in optimum/habana/transformers/models/blip/modeling_blip_text.py #2202 @karol-brejna-i
- Remove unnececary None checks for attention_mask #2205 @karol-brejna-i
- Configure qlora tests with additional arguments #2056 @ckvermaAI
- Skip unnecessary padding in text generation task #2055 @kyotoyx
- Unify SetTrueOrFalseOrNone and StoreTrueFalseAction #2119 @astachowiczhabana
- Fix profiler #2134 @astachowiczhabana
- Fix missing openorca dataset #2133 @astachowiczhabana
- Sync/videollava #2129 @yafshar
- Add support for local dataset loading for LibriSpeech and COCO #2136 @gplutop7
- Add sentencepiece to setup.py #2153 @pbielak
- Extract model adapter class from run_lm_eval.py to a new script file. #2184 @AKloniecki
- Fix for granite accuracy #2187 @12010486
- Temporarily revert SD quant files to fix promotion #2069 @astachowiczhabana
v1.18.1: Transformers v4.51, Qwen3, dynamic quantization
Transformers v4.51
This release supports and has been validated for Transformers v4.51
Qwen3
This release adds optimized support for Qwen3 models on Gaudi.
- Add Qwen3 family #1948 @tianyuan211
Dynamic Quantization
This release adds support for dynamic quantization.
- Enable dynamic quantization #2085 @astachowiczhabana
v1.18.0: SynapseAI v1.21, Accelerate, CogVideoX, Llava-onevision
SynapseAI v1.21
This release has been tested on and validated for SynapseAI v1.21.
Accelerate
Gaudi is now natively supported in Accelerate, checkout the doc for more information.
- Update GaudiAccelerator #1876 @IlyasMoutawwakil
- Fix lost modules in regional compilation #1885 @xinyu-intel
- fix fsdp and get rid of GaudiPartialState #1942 @IlyasMoutawwakil
- Restore dynamic compilation setting and Fix compile_regions Call #1973 @yafshar
- Hot fix regional compilation #2005 @IlyasMoutawwakil
- Fix fp8 #2010 @IlyasMoutawwakil
Diffusers
- fea(diffusers): Upgraded to version 0.32.0 #1939 @imangohari1
- fea(): Diffuser upgrade to 0.33.1 #1981 @imangohari1
CogVideoX
- Add cogvideox support for gaudi #1600 @nc-BobLee
GLM4V
- Add GLM4V #1668 @mengker33
Siglip and Llava-onevision
- Add support for Siglip and Llava Onevision #1883 @emascarenhas
Model optimizations
- Optimize memory utilization by keeping logits in BF16 #1859 @kalyank007
- Integrate DistributedAttention for Qwen2 #1860 @Jianhong-Zhang
- [Llama-vision] Add support for Fused RMS Norm #1892 @ANSHUMAN87
- Enable torch compile for llama 3.2 vision #1873 @jaygala223
- Flag to enable leaf promotion to avoid graph breaks in MLP for compile #1880 @bhargaveede
- Adding Deepspeed config for Llama3 Fine Tuning (#165) #1881 @bhargaveede
- Add trim_logits support in deepseekV3 #1933 @jthakurH
- Add flag to enable compiled_autograd with Deepspeed for training #1785 @vivekgoe
- chatglm: Fix a bug when attention mask is None #1896 @mengker33
- Optimized DeepSeek-V2 attention prefill with MHA #1791 @gyou2021
- [Llama-Vision] Trim logits #1894 @ANSHUMAN87
- Align cross_attention_mask for Llama 3.2 90B to avoid partial writes and graph retracing #1917 @kalyank007
- Add FSDP config for Granite model #1897 @kplau1128
- [Llama-Vision] Add support for bucketing #1895 @ANSHUMAN87
- Add Moonlight Support #1868 @jinyouzhi
- Add support for expert parallelism with mixtral #1908 @kwisniewski98
- Fix issue with in-place operation with requires grad with modeling_qwen2_vl.py #1970 @emascarenhas
- Adjust VideoLlavaProcessor to avoid performance regression on gaudi3 #1969 @kaixuanliu
- Speed up FLUX training over 2x with Gaudi optimized attention #1963 @dsocek
- [llama-vision] Remove token_idx_cpu parameter #2018 @ugolowic
Other
- Makefile improvements #1811 @jasi306
- [DeepSeek-V3] README update #1911 @ANSHUMAN87
- Skipping falcon rope scaling test #1916 @karol-brejna-i
- Workaround for DS issue in Llama #1932 @ugolowic
- Upgrade LM Eval to 0.4.7 #1901 @astachowiczhabana
- Disabling timers synchronization #1879 @bhargaveede
- Limit max pos embeds to 8k to prevent OOM #1923 @jaygala223
- Fix prompt argument handling in run_pipeline.py #1874 @varu060603
- Allow offline mode in CI tests #1924 @astachowiczhabana
- Adding memory and graph stats #1858 @jaygala223
- Enable QLoRA tests with torch.compile mode #1918 @ckvermaAI
- detr: fix possible incorrect tensor type #1899 @mgonchar
- Fix --save_last_ckpt if --save_strategy no is set #1934 @vidyasiv
- Reimplement HabanaGenerationTime #1920 @ugolowic
- Pad the examples for QLoRa finetuning test #1941 @ckvermaAI
- Reimplement HabanaGenerationTime fix for timer_checkpoint in sdxl training #1945 @gplutop7
- Move bitsandbytes requirements from setup.py to bnb tests #1946 @ckvermaAI
- Support allow_unspec_int_on_nn_module #1887 @xinyu-intel
- Tokenizer config fix for dynamic mode #1903 @pramodkumar-habanalabs
- Support compile from the 2nd iteration #1886 @xinyu-intel
- fea(): ReadMe remote_trust fixes #1940 @imangohari1
- Run upstream tests #1938 @IlyasMoutawwakil
- Fix READMEs - SD paths and LLM PEFT example #1949 @dsocek
- Add average latency metrics #1954 @RongLei-intel
- Bitsandbytes installation for qlora tests #1951 @ckvermaAI
- Update datasets requirement in examples #1956 @regisss
- Use data cache in slow_tests_8x #1914 @karol-brejna-i
- Add sentencepiece to requirements to support vicuna text generation #1962 @tthakkal
- Fix FLUX fine-tuning script #1960 @dsocek
- Fix typos #1967 @omahs
- Update t5-small samples_per_second value #1968 @12010486
- fea(): Added the --sdp_on_bf16 to textual inversion example #1964 @imangohari1
- pytest t5 roberta fix #1971 @imangohari1
- Update makefile for explicit lazy mode #1925 @jasi306
- fea(): Added PT_HPU_LAZY_MODE=1 for diffuser tests #1975 @imangohari1
- Fix deepspeed zero3 #1977 @IlyasMoutawwakil
- Enable regional compilation in text generation #1927 @karol-brejna-i
- README changes for Llama3.1 8B Finetuning with LoRA #1947 @bhargaveede
- pt2e quant changes into the main script #1875 @vivek5-ai
- Use IKS runners for CI #1953 @regisss
- Fix sentence-transformers CI with new runners #1980 @regisss
- Update dynamic env handling #1978 @yafshar
- Fix wrong calculation of e2e latency #1984 @RongLei-intel
- Update test baseline for mistralai/Mixtral-8x7B-v0.1 #1987 @yafshar
- Switch to Spawn in PyTorch DataLoader when num_worker>0 #1982 @Wei-Lin-Intel
- Enable mixtral 8x7b accuracy evaluation #1986 @rbogdano
- Update readme files for explicit lazy mode #1921 @jasi306
- Update README examples #2020 @pbielak
- Pin latest optimum to force mutual updates #2016 @IlyasMoutawwakil
v1.17.0: Transformers v4.49
Transformers v4.49
This release has been tested and validated for Transformers v4.49 and SynapseAI v1.20.
Model optimizations
- Use token_idx_cpu int instead of token_idx tensor in slicing #1848 @jaygala223
- Keep logits in bf16 #1835 @jaygala223
- Optimize SD3 Pipeline : Padding prompt Embeddings for softmax_hf8 compatibility and Efficient Utilization #1816 @deepak-gowda-narayana
- Add G3 perf WA for Qwen2VL #1884 @nngokhale
- Fix MPT regression #1857 @atakaha
Tests and CI
- Slow test updates #1804 @ugolowic
- Fix race condition when downloading nltk tokenizer #1802 @ugolowic
- fea(): Skipped the torch_fx tests #1797 @imangohari1
- Upstream tests #1834 @IlyasMoutawwakil
- test_examples: add missing clip-roberta baseline #1852 @uartie
- Separate slow tests by required number of cards #1803 @ugolowic
- Update PR doc build workflow #1904 @regisss
Other
- Disable HPU migration (future add-on to HF diffusers) for OH diffusers #1866 @dsocek
- Allow explicit control over flash_attention_fast_softmax setting #1851 @astachowiczhabana
v1.16.0: Deepseek V3, SynapseAI v1.20, Llama 405b, AWQ
SynapseAI v1.20
This release has been tested on and validated for SynapseAI v1.20.
New models
- Add Qwen2-VL #1542 @nngokhale
- Add video-llava model support #1522 @kaixuanliu
- Enable the i2vgen pipeline #1670 @yuanwu2017
- DeepSeek_v3 support #1735 @srajabos
Llama 405b
- Enable Llama 3.1 405B in FP8 #1745 @jaygala223
- v1.16 Llama3-405B text-generation. Added DEEPSPEED_USE_HABANA_FRAMEWORKS_DETERMINISTIC_API flag. #1812 @dsmertin
- Revert placing llama on cpu #1827 @ugolowic
AWQ
- Enable awq int4 in Gaudi #1691 @sywangyi
- Fix dependency issue with --load_quantized_model_with_autoawq #1759 @schoi-habana
Various model optimizations
- Optimizations and WAs to support HPU execution for Detr-Resnet-50 #1334 @sandeep-maddipatla
- Optimized DeepSeek-v2 on Gaudi #1677 @gyou2021
- Add xlm-roberta model support for tei-gaudi use case #1715 @kaixuanliu
- Optimized SD3 pipeline #1682 @deepak-gowda-narayana
- Add clear hpu cache flag for stable perf #1634 @jaygala223
- Fix graph breaks in Mixtral #1705 @ShengYang1
- Add batch splitting in attention layer to hide NIC latency #1640 @kalyank007
- Fix llama FP8 perf issue, kvcache.update should be used since FP8 patches KVCache #1756 @sywangyi
- Add HPU fp8 Dynamic MOE #1761 @dudilester
Sentence Transformers
CI
Other
- Fixed formatting #1693 @imangohari1
- Fix FlUX.1_dev guidance_batches bug for pad case in _split_inputs_into_batches #1607 @huijuanzh
- Fix peft error in Gaudi1 #1627 @sywangyi
- Update README.md #1678 @skaulintel
- Fix custom ops loading in diffusers #1655 @dsocek
- Fix ddpo finetune issue in torch2.5.1 #1666 @sywangyi
- Adding Deepspeed zero1 config #1675 @bhargaveede
- Enable warmup also for full prompt length case in text generation #1676 @yeonsily
- Add padding to input for mllama/paligemma/idefices2 #1671 @sywangyi
- Fix for Mixtral G1 pytest failures #1652 @12010486
- Fix textual_inversion_sdxl failure on docker 1.20 #1697 @atakaha
- Updated Encoder_decoder Tests #1688 @slokesha
- Add checks for parallel_state initialization #1680 @yafshar
- Update the readme to remove validated models #1703 @jiminha
- FP8 baichuan-13b gets oom when running lm_eval with @Liangyx2
- Lm eval upgraded to 0.4.7 #1692 @12010486
- Enable attention selection in wav2vec-ac #1713 @ugolowic
- Fix bug when preparing quant files, starcoder model does not support #1672 @kaixuanliu
- Update training pytests to reduce total time #1712 @jiminha
- Dropping some ci tests from image_to_text and text_generation #1710 @hsubramony
- Add save_checkpoint arg for TIMM training to simplify validation #1701 @ZhengHongming888
- Added Unit Test for Gemma-2-27b model #1616 @slokesha
- Update TRL README.md to clean up models #1706 @shepark
- Support regional compilation #1618 @chaojun-zhang
- Fix text generation quality for bf16 models when sampling #1644 @skavulya
- Readme modification #1700 @libinta
- Fix mpt model generation #1696 @mengniwang95
- Fix lm_eval issue of llama #1606 @sywangyi
- Align diffusers CI tests with examples #1679 @dsocek
- Update audio-classification/requirements.txt to fix numpy version #1717 @hsubramony
- Improve automation for stable-diffusion training scripts in README #1651 @dsocek
- Fix video diffusion black output if --bf16 is set #1685 @sywangyi
- Fix sdxl mlperf time bug #1580 @huijuanzh
- Enabling minimize memory for zero3 runs #1724 @bhargaveede
- Add gated models to diffusers CI tests #1690 @dsocek
- Fix formatting of the kubeVersion range in Kubernetes helm chart #1733 @dmsuehir
- Fix llava/llava next issue when working with AutoProcessor #1674 @sywangyi
- fea(): reworked the 8x hpu skipping strategy #1694 @imangohari1
- Process getting killed while loading data for Llama3.2 90b, 8x #1723 @kalyank007
- Fix: Adjust recipe to fit within QueueComputeScal HBM global memory size limit #1722 @kalyank007
- Add PRC models to test_text_generation_example.py #1695 @wenbinc-Bin
- Added quant config files for new scenarios #1681 @ulivne
- Update README.md - correction in diffusers example #1742 @ramyij
- Update DS config to align with recommended settings #1730 @ckvermaAI
- Add dynamo cache size limit option #1619 @chaojun-zhang
- Resolve 'NoneType' object has no attribute 'gate_proj' err when applying EP in DeepSeek-V2 #1740 @IT-Forrest
- Edit mixtral quantization config file #1739 @dudilester
- Fix the incorrect output of sdxl inpaint #1737 @yuanwu2017
- Supports Bitsandbytes development on HPU #1714 @rsshaik1
- FLAN-T5 has bad performance when using regional compilation #1744 @chaojun-zhang
- Add batch dim idx to support latest deepspeed DistributedAttention #1725 @bhargaveede
- Add the inline_inbuilt_nn_modules option #1617 @chaojun-zhang
- Clean up README examples #1709 @yeonsily
- Accuracy fix for llama3.1-70B in eager/torch.compile mode #1746 @ckvermaAI
- Adjust baselines for lower number of epochs improved perplexity, lower throughput #1748 @emascarenhas
- Change clip-roberta/bridgetower not to use fast_ddp #1749 @jiminha
- Adds requirements.txt to sentence transformers training paraphrases #1753 @pi314ever
- Add requirements.txt to sentence transformer training sts #1754 @pi314ever
- Add diffuser tests for optimized sdxl flow on HPU #1554 @sushildubey171
- Fix the output length in image_to_text test #1751 @sywangyi
- Fix Experts Indexing in MoE for Mixtral: Align experts_max with Number of Available Experts #1755 @deepak-gowda-narayana
- Add requirements.txt to sentence transformers nli example #1767 @pi314ever
- UX code change #1764 @talexjohn
- Enable saving and loading FP8 model #1683 @xin3he
- Update measurements for Stable Diffusion XL #1773 @mkrze
- Add datasets to the requirements for Stable Diffusion training #1782 @yafshar
- Enable wav2vec-large model for speech_recognition test #1783 @jiminha
- Update multi-node-training environment variables for GaudiNIC #1779 @Jianhong-Zhang
- Fixed Gemma2 error when saving pretrain #1781 @kplau1128
- Support llava1.5 lora finetuning. #1487 @lkk12014402
- Fix DeepSeek-V2 expert-parallelism crash due to indexing error #1765 @skavulya
- Update transformer_engine._convert_model to skip LoRA layers #1766 @vivekgoe
- Create Habana_Validated_Models.md to list all the models validated #1778 @hsubramony
- Enable attention selection for wav2vec2 #1757 @ugolowic
- Add --attn_implementation to wav2vec2 slow tests #1788 @ugolowic
- Add sentencepiece to the requirements #1792 @hsubramony
- Fix LoRA weights loading in text-to-image generation sample script #1789 @dsocek
- Add trust_remote_code #1786 @atakaha
- Fix the restart issue for Sentence Transformer STS example in validation #1799 @ZhengHongming888
- Exp flags for acc issues #1795 @hsubramony
- Temporary WA for get_type error #1806 @12010486
- Fix Sentence Transformer STS restart issue #1814 @ZhengHongming888
- Fix broken link for GenerationConfig #1819 @xin3he
- Fix for text-generation, AttributeError: 'GenerationConfig' object has no attribute 'use_fused_rope' #1823 @hsubramony
- Fix dataset_version for ST example requirement.txt #1809 @ZhengHongming888
- Move model to device before wrapping with FSDP #1830 @skaulintel
- Update warmup ratio for adalora #1820 @astachowiczhabana
- Fix for attention selection in wav2vec2 #1836 @ugolowic
- Revert "Lm eval upgraded to 0.4.7 (#1692)" #1837 @astachowiczhabana
- Removing HL_DS_DISTRIBUTED_ATTENTION_SEQ_DIM as it's not needed from 1.20 #1726 @bhargaveede
- Temporary workaround to avoid segmentation fault #1798 @yafshar
v1.15.0: SynapseAI v1.19.0, FLUX, Mllama, DeepSeek, Falcon 3
SynapseAI v1.19
FLUX
- FLUX with diffusers 0.31.0 #1450 @dsocek
- FLUX Fine-Tuning for Gaudi #1482 @dsocek
- Flux Image-To-Image pipeline #1524 @dsocek
New models
- Optimized inference of Cohere model on HPU #1329 @XinyuYe-Intel
- Idefics2 #1270 @sywangyi
- Optimized inference of XGLM model on HPU #1323 @XinyuYe-Intel
- Add mllama support #1419 @sywangyi
- Enable paligemma model for image-to-text example #1407 @kaixuanliu
- Enable Gemma2 Inference on Gaudi #1504 @Luca-Calabria
- Minicpm enabling #1342 @pi314ever
- Enable Falcon-mamba #1480 @yuanwu2017
- Add support for Baichuan2 #1479 @xhaihao
- Enable DeepSeek-V2 #1475 @yao-matrix
- Add chatglm #1478 @mengker33
- Falcon Model Support #1612 @alekseyfa
Various model optimizations
- Enable flash attention for gemma #1454 @atakaha
- Support loading 4 bit Qwen2 #1476 @mengniwang95
- Fixed Gemma FP8 flash_attention lower throughput issue #1510 @kplau1128
- Disable default sdpa in Albert (#22) #1517 @astachowiczhabana
- Implement fused sdpa for wav2vec2 (#18) #1520 @astachowiczhabana
- Memory optimization for gpt_bitcode #1513 @astachowiczhabana
- Support beam search with reuse_cache and bucket_internal #1472 @Wei-Lin-Intel
- Add mixtral trl sft #1349 @lkk12014402
- Enable tiiuae/falcon-11B-vlm in image_to_text example #1490 @sywangyi
- Enable fusedsdpa kernel for vision part of mllama #1531 @sywangyi
- Enable dynamic compile for mpi(training) #1509 @chaojun-zhang
- Add DynamicMoE support for Mixtral #1511 @kwisniewski98
- Implemented fusedSDPA for stable diffusion (#36) #1545 @astachowiczhabana
- Fix Accuracy Calculation Issue in GPT-NeoX #1591 @yafshar
Sentence Transformers
- Update sentence transformer to v3.2.1 #1470 @ZhengHongming888
Textual Inversion XL
TIMM
- Enable pyTorch-IMage-Models (TIMM) with HPUs #1459 @ZhengHongming888
Context Parallelism
- Adding support for Context Parallelism using Deepseed's DistributedAttention #1501 @bhargaveede
- Move parallel_state.py to the distributed folder a6ee7c2044e6ddf7d19ae3ad663149e51d6f89e7 @regisss
CI improvements
- Tests for text gen output text #1411 @vidyasiv
- Add split runners to CI (2 devices per runner for fast tests) 72df37df46d1d2a2665c5d1be43b13704b7c8ada @regisss
- Fix fast CI to work with split runners #1534 @regisss
- Add Llama 3.1 ft to CI #1529 @MohitIntel
Documentation
Other
- Fix facebook/hf-seamless-m4t-medium crash #1433 @sywangyi
- Fix bias update in scoped all reduce #1456 @skavulya
- fea(pytests): Added skip for unsuported tests for mistral/mixtral #1462 @imangohari1
- Remove deprecated Mixed precision flags #1471 @vivekgoe
- Readme: replace tabs with spaces #1485 @mgonchar
- Move fast tests to Gaudi2 #1498 @regisss
- Remove torch req from LM example #1491 @astachowiczhabana
- Remove keep_input_mutations #1492 @astachowiczhabana
- Fix trust_remote_code #1493 @astachowiczhabana
- Upgrade ViT README with torch.compile #1494 @astachowiczhabana
- Corrected Throughput measure for GaudiDDPMPipeline #1460 @deepak-gowda-narayana
- [SW-196761] Add G3 in T5-L README #1523 @astachowiczhabana
- Fix tuple object error #1354 @SupreetSinghPalne
- Add warmup time and compile time log for the eval/prediction. #1489 @jiminha
- Add support for MLPERF optimized pipeline from example #1465 @ANSHUMAN87
- Add check_neural_compressor_min_version for 4 bit behavior #1500 @xin3he
- Pass "lazy_mode" arg to GaudiLlamaModel GaudiTrainer #1515 @astachowiczhabana
- Removed workaround for NaN bug causing graph break. #1516 @astachowiczhabana
- text_generation: improve parameters check #1527 @mgonchar
- transformers: fixed some typos #1528 @mgonchar
- Makes the with_stack of the profiler changeable #1497 @ranzhejiang
- Fix dtype issue with valid sequence length in torch.compile bs=1 #1532 @wszczurekhabana
- Migrate OH CLIP (roberta-clip) training to torch.compile #1507 @chaojun-zhang
- test_text_generation: fix non-Gaudi2 case #1530 @mgonchar
- text-generation: improve output printing #1486 @mgonchar
- Text-generation, model set-up: torch.compile for attributes instead of models' types #1452 @dsmertin
- Fix bridgetower example #1481 @astachowiczhabana
- Migrate OH Wave2Vec-AC training to torch.compile - README update #1537 @astachowiczhabana
- Migrate OH T5-large training to torch.compile #1506 @chaojun-zhang
- trainer: fixed spelling #1538 @mgonchar
- Create CI Eager/Lazy for Language Modeling #1448 @Luca-Calabria
- Fixes for llava-next test failures in 1.19 #1535 @tthakkal
- Refactor Qwen2 Family #1541 @Wei-Lin-Intel
- Add support for optimized SDXL pipeline #1519 @sushildubey171
- Add the checkout parameters of falcon-mamba pytest #1540 @yuanwu2017
- Avoid negative values in eval metrics #1533 @deepak-gowda-narayana
- Fix lm_eval script for starcoder and gemma #1463 @skavulya
- Add option to use bf16 in PT sdp (#5) #1514 @astachowiczhabana
- Fix tests.test_peft_inference failure #1543 @sywangyi
- Update lm_eval version #1473 @alexey-belyakov
- Fix bad import in Baichuan code #1547 @regisss
- Restore performance in generate #1546 @ugolowic
- Fix for llava models not generating text with test failures in 1.19 #1548 @tthakkal
- Refactor KV cache, Rope , reduce common code #1148 @abhilash1910
- Adjust Qwen2-7B test case #1551 @Wei-Lin-Intel
- [run_lm_eval.py] Fixed too many print dump json info #1553 @FocusLuo
- Fix for single_card llama7b and falcon40b CI errors #1549 @MohitIntel
- Apply --sdp_on_bf16 to image-to-text examples #1557 @schoi-habana
- Fix accuracy regression in Gemma #1556 @skavulya
- Fix FusedSDPA wrapper from TransformerEngine #1562 @pbielak
- Run albert-xxlarge-v1 CI as torch.compile mode #1563 @yeonsily
- Update README commands for the models to use --sdp_on_bf16 #1566 @yeonsily
- Minicpm patch #1567 @pi314ever
- Updated gemma_2b_it CI #1561 @Luca-Calabria
- Fixed Adalora Test for OH 1.15 #1564 @npiroozan
- Fixed LORACP Test for OH 1.15 #1568 @npiroozan
- Fix prefix llama ci failure #1570 @sywangyi
- Fix mllama test #1569 @sywangyi
- Fix lazy_mode assignment #1558 @vidyasiv
- Generation utils update (minor) #1468 @yafshar
- Style: removed tabs #1577 @mgonchar
- Enable num_return_sequences in beam search #1536 @mengker33
- gpt_bigcode: added internal bucketing fix #1526 @mgonchar
- Update the Gaudi trainer with transformers 4.45.2 #1398 @yafshar
- Revert "add check_neural_compressor_min_version for 4 bit behavior" #1578 @xin3he
- Revert PR #1473 #1582 @regisss
- Fixed spelling #1576 @mgonchar
- Update docs for baichuan2 training #1586 @xhaihao
- Add WA flag for falcon-180b to resolve text-gen critical reset error during tests #1590 @hchauhan123
- Update transformers tests generation util v4.45.2 #1441 @malkomes
- Limit position embeddings in inference #1598 @bhargaveede
- Verify model output is provided when check_output is enabled #1597 @vidyasiv
- Update README.md #1595 @skaulintel
- Fix scikit-learn to 1.5.2 to fix f1 evaluation crash in 1.6.0 #1596 @sywangyi
- Update language-modeling README file #1599 @vivekgoe
- Revert common KVCache not to check token_idx #1594 @jiminha
- Revert LlamaKVCache due to memory increase #1605 @jiminha
- Replace the UNET custom attention processors #1608 @yafshar
- Fix run_generation test commands for TRL out usage example #1621 @shepark
- Update sdp_on_bf16 option for ST example #1615 @ZhengHongming888
- Update save lora weights for diffusers with text_encoder_2 layers #1626 @skavulya
- Fix save_lora_weights in pipeline_utils.py #1643 @regisss
- Check rope_scaling attr #1609 @jiminha
- Skip certain tests for G1 with empty param list #1613 @hsubramony
- Revert "Update transformers tests generation util v4.45.2 (#1441)" #1614 @yeonsily
- Audio classification readme update #1604 @hsubramony
- Fix readme cmds for clip-roberta #1603 @hsubramony
- Add arbitrary scales #1625 @jiminha
- Modify Qwen2 TRL command to avoid OOM. #1630 @jiminha
- Fix distributed issue for ST Trainer #1649 @ZhengHongming888
- Fix distributed issue for timm #1653 @ZhengHongming888
- Refactor mixtral moe block. #1635 @lkk12014402
- Speech-recognition: downgrade datasets version #1646 @hsubramony
- Add sdp_on_bf16 to controlnet #1631 @skaulintel
- Quick fix for quantization/custom op list loading #1657 @dsocek
- Fix bug for GaudiMixtralAttentionLongSequence forward #1650 @kaixuanliu
v1.14.1: Patch release
- Enable DeepSpeed for image-to-text example #1455 @schoi-habana
- Fix bug when loading 4bit checkpoint quantized in INC #1447 @xin3he
- Fixes 'Tokenizer does not have padding token' introduced by #1444 for Llama3.1 #1457 @MohitIntel
Full Changelog: v1.14.0...v1.14.1
v1.14.0: Transformers v4.45, SynapseAI v1.18, Qwen2-MoE, text-to-video generation
Transformers v4.45
SynapseAI v1.18
Qwen2-MoE
Text-to-video generation
- Enabling Text to Video Diffusion Model Generation #1109 @pi314ever
- Porting Stable Video Diffusion ControNet to HPU #1037 @wenbinc-Bin
Depth-to-image generation
- Depth to Image Generation #1175 @pi314ever
Model optimizations
- Enable FusedSDPA for Mpt #1101 @Jianhong-Zhang
- Mixtral fp8 #1269 @imangohari1
- Prevent Graph break in Llama when using flash attention #1301 @pramodkumar-habanalabs
- Boost SDXL speed with initialized schedule step reset #1284 @dsocek
- Improve MPT fp8 #1256 @atakaha
- Add Whisper static generation #1275 @Spycsh
- Gemma: enabled HPU Graphs and Flash Attention #1173 @dsmertin
- Recommend jemalloc for gpt-neox-20b 8x #1350 @hsubramony
- Optimized inference of GPT-NEO model on HPU #1319 @XinyuYe-Intel
- Fix graph breaks for BART in torch.compile mode. #1379 @astachowiczhabana
- Gpt_bigcode: added internal_bucketing support #1218 @mgonchar
- refine bucket_internal for mpt #1194 @Jing1Ling
- Qwen finetuning bucketing #1130 @ssarkar2
- Enable FusedSDPA fp8 in Llama FT #1388 @pbielak
- Added gemma specific fp8 quantization file #1445 @yeonsily
Intel Neural Compressor
- Enable INC for llava models and change softmax to use torch.nn.functional.softmax as its supported module by INC #1325 @tthakkal
- Load INC GPTQ checkpoint & rename params #1364 @HolyFalafel
- Fix load INC load weights compile error due to Transformer 4.45 upgrade. #1421 @jiminha
Vera/LN-tuning
Other
- Add callable workflow to post comments when code quality check failed #1263 @regisss
- Fix failed code quality check comment workflow #1264 @regisss
- Accelerate Diffusers CI #1265 @regisss
- Add profiler to SD3 #1267 @atakaha
- Fix profiling step with device finish execution for text-generation #1283 @libinta
- Update FusedSDPA calling method as Gaudi documentation #1285 @yeonsily
- Switch failed code quality check comment to workflow_run #1297 @regisss
- Potential fix for the failed code quality check comment workflow #1299 @regisss
- Fix text-generation example lm_eval evaluation #1308 @changwangss
- Add section to README about Transformers development branch #1307 @regisss
- Fix eager mode in run_generation by removing graph logs #1231 @Vasud-ha
- Fix bug when running google/paligemma-3b-mix-224 #1279 @kaixuanliu
- Use native checkpointing under compile mode #1313 @xinyu-intel
- fixed fused_qkv object AttributeError due to 'LlamaConfig' #1203 @rkumar2patel
- Image to Image Generation Enabling #1196 @pi314ever
- Diffusers timing #1277 @imangohari1
- Fix eos issue in finetune/generation #1253 @sywangyi
- Update CI, tests and examples #1315 @regisss
- Fix Sentence Transformer HPU graphs for training with PEFT model #1320 @nngokhale
- Fix ZeroDivisionError in constrained beam search with static shapes #1317 @skavulya
- Update esmfold model not to use param_buffer_assignment #1324 @jiminha
- Falcon inference crash fix for falcon-40b model #1161 @yeonsily
- Add --use_kv_cache to image-to-text pipeline #1292 @KimBioInfoStudio
- Trl upgrade #1245 @sywangyi
- Fix uint4 url typo. #1340 @kding1
- Use eager attention for wav2vec2 #1333 @skaulintel
- Add _reorder_cache back to Llama for HPU #1233 @jiminha
- SDXL CI script throughput #1296 @imangohari1
- Add image so that transformers tests can run #1338 @skaulintel
- Fixes the no attribute error with the falcon multicard test #1344 @mounikamandava
- Add profiler to sdxl mlperf pipeline #1339 @Jianhong-Zhang
- Fix decoder only generation #948 @tjs-intel
- Upgrade gradient chekpointing #1347 @yafshar
- Run_generation example: fixed graph compilation statistics reporting #1352 @mgonchar
- Fix deepseeed crash with Sentence Transformer Trainer #1328 @nngokhale
- fea(ci): reduced slow test_diffusers timing. minor fixes #1330 @imangohari1
- Flash attn args for GaudiGemmaForCausalLM #1356 @kkoryun
- Transformer models generation supports user-provided input embeddings #1276 @zongwave
- Fixed the expected values after for img2img slice #1332 @imangohari1
- Gpt_big_code: make flash attention impl quantization friendly #1282 @mgonchar
- Fix OOM when inference with llama-3.1-70b #1302 @harborn
- Fix the conditional #1362 @yafshar
- Revert "use native checkpointing under compile mode" #1365 @xinyu-intel
- Remove repetitive pip install commands #1367 @MohitIntel
- Minor UX enhancement #1373 @MohitIntel
- Fix bug when running image-to-text example #1371 @kaixuanliu
- Gpt_bigcode: fixed wrong indentation #1376 @mgonchar
- Support for transformers without self.model to torch.compile #1380 @astachowiczhabana
- Only pass the use_kv_cache True to generator #1366 @yafshar
- Clean up the code and remove unnecessary class #1382 @yafshar
- Add the diffusers examples of inference Tech #1244 @yuanwu2017
- Enhance transformers test suite in Optimum-habana-4.43.4 Auto pr 07654de #1387 @rkumar2patel
- Enhance transformers test suite in Optimum-habana-4.43.4 (auto PR 8926a4b) #1386 @rkumar2patel
- Add README.md for Sentence transformer examples with HPU device #1355 @ZhengHongming888
- Change Falcon/GPT-Neox rotary embedding function to use seq_len for #1368 @yeonsily
- Enhance Optimum-habana as per transformers-4.43.4 #1381 @rkumar2patel
- CI fix - Install stable-diffusion reqs #1389 @vidyasiv
- Fix error caused by uninitialized attn_weights #1391 @hsubramony
- Replace flash attention flag #1393 @skaulintel
- Fix DeepSpeed CI on Gaudi2 #1395 @regisss
- Truncate the cached max seq len #1394 @astachowiczhabana
- Fix gpt-neox training accuracy issue. #1397 @yeonsily
- Simplify HQT config files #1219 @Tiefen-boop
- unify_measurements.py script support to unify PCQ 70B 8x #1322 @Yantom1
- Add misc. training args #1346 @SanityRemnants
- Add quantization config for low bs case #1377 @ulivne
- Remove HQT from OHF #1257 @Yantom1
- Valid sequence length for sdpa #1183 @ssarkar2
- Multiple fixes (dynamo graph break, qwen-moe, multicard) #1410 @ssarkar2
- Change the image path for transformers tests back to the correct location #1401 @skaulintel
- Fix Gaudi2 regression tests #1403 @regisss
- Reverting some of transformer pytest funcs/values #1399 @imangohari1
- Fix StarCoder2 inference #1405 @regisss
- Change the order for test_diffusers #1406 @hsubramony
- Fix llama model text generation error #1402 @zongwave
- Datasets downgrade version to 2.21.0 #1413 @hsubramony
- Update ci sentence_transformer.sh #1424 @ZhengHongming888
- Update language-modeling README.md, add trust_remote_code for flan-t5-xl #1422 @hsubramony
- Update unify_measurements.py support info #1425 @shepark
- Fix GPT_neox incorrect output with batch query #1358 @Jianhong-Zhang
- Fix text-to-image example #1429 @regisss
- Add flag to run inference with partial dataset #1420 @pramodkumar-habanalabs
- Add peft generation example #1427 @sywangyi
- Added missing allocate_kv_cache() call in CausalLM class #1431 @yeonsily
- Fix merge error and update text-to-speech readme #1436 @hsubramony
- Fix OOM error for code llama #1437 @jiminha
- Fix error on 4bit checkpoint load with run_lm_eval on TF4.45.2 #1439 @jiminha
- GPT2 torch.compile fix #1434 @dsmertin
- Update text-gen README.md to add auto-gptq fork install steps #1442 @hsubramony
- Fix scoped linear all-reduce for starcoder model #1432 @skavulya
- Fixed recursion error in SentenceTransformer #1428 @yafshar
- Fix Llama 3.1 generation #1444 @regisss
- Remove cache folder from image data folder #1446 @shepark
v1.13.2: Patch release
Llava(-next) improvements
This patch release adds multi-card support for Llava(-next) and enables users to turn on/off recomputing for flash attention.
- Llava: Added flash_attention_recompute arg to provide an option to enable/disable recompute #1278 @tthakkal
- Add the deepspeed injection_policy of mistral #1309 @yuanwu2017
Full Changelog: v1.13.1...v1.13.2
v1.13.1: Patch release
Fixed memory regressions
- Remove _expand_inputs_for_generation for greedy search (#1266) @libinta
- Fix memory regression for modeling llama (#1271) @libinta
FSDP
FSDP checkpoint saving is fixed.
Known limitations
- ESMFold does not work on Gaudi1, this will be fixed in a future version
Full Changelog: v1.13.0...v1.13.1