v1.18.0: SynapseAI v1.21, Accelerate, CogVideoX, Llava-onevision
SynapseAI v1.21
This release has been tested on and validated for SynapseAI v1.21.
Accelerate
Gaudi is now natively supported in Accelerate, checkout the doc for more information.
- Update GaudiAccelerator #1876 @IlyasMoutawwakil
- Fix lost modules in regional compilation #1885 @xinyu-intel
- fix fsdp and get rid of GaudiPartialState #1942 @IlyasMoutawwakil
- Restore dynamic compilation setting and Fix compile_regions Call #1973 @yafshar
- Hot fix regional compilation #2005 @IlyasMoutawwakil
- Fix fp8 #2010 @IlyasMoutawwakil
Diffusers
- fea(diffusers): Upgraded to version 0.32.0 #1939 @imangohari1
- fea(): Diffuser upgrade to 0.33.1 #1981 @imangohari1
CogVideoX
- Add cogvideox support for gaudi #1600 @nc-BobLee
GLM4V
- Add GLM4V #1668 @mengker33
Siglip and Llava-onevision
- Add support for Siglip and Llava Onevision #1883 @emascarenhas
Model optimizations
- Optimize memory utilization by keeping logits in BF16 #1859 @kalyank007
- Integrate DistributedAttention for Qwen2 #1860 @Jianhong-Zhang
- [Llama-vision] Add support for Fused RMS Norm #1892 @ANSHUMAN87
- Enable torch compile for llama 3.2 vision #1873 @jaygala223
- Flag to enable leaf promotion to avoid graph breaks in MLP for compile #1880 @bhargaveede
- Adding Deepspeed config for Llama3 Fine Tuning (#165) #1881 @bhargaveede
- Add trim_logits support in deepseekV3 #1933 @jthakurH
- Add flag to enable compiled_autograd with Deepspeed for training #1785 @vivekgoe
- chatglm: Fix a bug when attention mask is None #1896 @mengker33
- Optimized DeepSeek-V2 attention prefill with MHA #1791 @gyou2021
- [Llama-Vision] Trim logits #1894 @ANSHUMAN87
- Align cross_attention_mask for Llama 3.2 90B to avoid partial writes and graph retracing #1917 @kalyank007
- Add FSDP config for Granite model #1897 @kplau1128
- [Llama-Vision] Add support for bucketing #1895 @ANSHUMAN87
- Add Moonlight Support #1868 @jinyouzhi
- Add support for expert parallelism with mixtral #1908 @kwisniewski98
- Fix issue with in-place operation with requires grad with modeling_qwen2_vl.py #1970 @emascarenhas
- Adjust VideoLlavaProcessor to avoid performance regression on gaudi3 #1969 @kaixuanliu
- Speed up FLUX training over 2x with Gaudi optimized attention #1963 @dsocek
- [llama-vision] Remove token_idx_cpu parameter #2018 @ugolowic
Other
- Makefile improvements #1811 @jasi306
- [DeepSeek-V3] README update #1911 @ANSHUMAN87
- Skipping falcon rope scaling test #1916 @karol-brejna-i
- Workaround for DS issue in Llama #1932 @ugolowic
- Upgrade LM Eval to 0.4.7 #1901 @astachowiczhabana
- Disabling timers synchronization #1879 @bhargaveede
- Limit max pos embeds to 8k to prevent OOM #1923 @jaygala223
- Fix prompt argument handling in run_pipeline.py #1874 @varu060603
- Allow offline mode in CI tests #1924 @astachowiczhabana
- Adding memory and graph stats #1858 @jaygala223
- Enable QLoRA tests with torch.compile mode #1918 @ckvermaAI
- detr: fix possible incorrect tensor type #1899 @mgonchar
- Fix --save_last_ckpt if --save_strategy no is set #1934 @vidyasiv
- Reimplement HabanaGenerationTime #1920 @ugolowic
- Pad the examples for QLoRa finetuning test #1941 @ckvermaAI
- Reimplement HabanaGenerationTime fix for timer_checkpoint in sdxl training #1945 @gplutop7
- Move bitsandbytes requirements from setup.py to bnb tests #1946 @ckvermaAI
- Support allow_unspec_int_on_nn_module #1887 @xinyu-intel
- Tokenizer config fix for dynamic mode #1903 @pramodkumar-habanalabs
- Support compile from the 2nd iteration #1886 @xinyu-intel
- fea(): ReadMe remote_trust fixes #1940 @imangohari1
- Run upstream tests #1938 @IlyasMoutawwakil
- Fix READMEs - SD paths and LLM PEFT example #1949 @dsocek
- Add average latency metrics #1954 @RongLei-intel
- Bitsandbytes installation for qlora tests #1951 @ckvermaAI
- Update datasets requirement in examples #1956 @regisss
- Use data cache in slow_tests_8x #1914 @karol-brejna-i
- Add sentencepiece to requirements to support vicuna text generation #1962 @tthakkal
- Fix FLUX fine-tuning script #1960 @dsocek
- Fix typos #1967 @omahs
- Update t5-small samples_per_second value #1968 @12010486
- fea(): Added the --sdp_on_bf16 to textual inversion example #1964 @imangohari1
- pytest t5 roberta fix #1971 @imangohari1
- Update makefile for explicit lazy mode #1925 @jasi306
- fea(): Added PT_HPU_LAZY_MODE=1 for diffuser tests #1975 @imangohari1
- Fix deepspeed zero3 #1977 @IlyasMoutawwakil
- Enable regional compilation in text generation #1927 @karol-brejna-i
- README changes for Llama3.1 8B Finetuning with LoRA #1947 @bhargaveede
- pt2e quant changes into the main script #1875 @vivek5-ai
- Use IKS runners for CI #1953 @regisss
- Fix sentence-transformers CI with new runners #1980 @regisss
- Update dynamic env handling #1978 @yafshar
- Fix wrong calculation of e2e latency #1984 @RongLei-intel
- Update test baseline for mistralai/Mixtral-8x7B-v0.1 #1987 @yafshar
- Switch to Spawn in PyTorch DataLoader when num_worker>0 #1982 @Wei-Lin-Intel
- Enable mixtral 8x7b accuracy evaluation #1986 @rbogdano
- Update readme files for explicit lazy mode #1921 @jasi306
- Update README examples #2020 @pbielak
- Pin latest optimum to force mutual updates #2016 @IlyasMoutawwakil