Releases: InternLM/lmdeploy
Releases · InternLM/lmdeploy
v0.7.3
What's Changed
🚀 Features
- Add Qwen3 and Qwen3MoE by @lzhangzz in #3305
- [Feature] support qwen3 and qwen3-moe for pytorch engine by @CUHKSZzxy in #3315
- [ascend]support deepseekv2 by @yao-fengchen in #3206
- support ascend w8a8 graph_mode by @yao-fengchen in #3267
- support Llama4 by @grimoire in #3408
💥 Improvements
- Add spaces_between_special_tokens to /v1/interactive and make compatible with empty text by @AllentDan in #3283
- add env var to control timeout by @CUHKSZzxy in #3291
- optimize mla, remove load
vby @grimoire in #3334 - refactor dlinfer rope by @yao-fengchen in #3326
- enable qwenvl2.5 graph mode on ascend by @jinminxi104 in #3367
- Optimize ascend moe by @yao-fengchen in #3364
- find port by @grimoire in #3429
🐞 Bug fixes
- fix activation grid oversize by @grimoire in #3282
- Set ensure_ascii=False for tool calling by @AllentDan in #3295
- add
vcheck by @grimoire in #3307 - Fix Qwen3MoE config parsing by @lzhangzz in #3336
- Fix finish reasons by @AllentDan in #3338
- remove think_end_token_id in streaming content by @AllentDan in #3327
- Fix the finish_reason by @AllentDan in #3350
- support List[dict] prompt input without do_preprocess by @irexyc in #3385
- fix tensor dispatch in dynamo by @wanfengcxz in #3417
📚 Documentations
- update ascend doc by @yao-fengchen in #3420
🌐 Other
- bump version to v0.7.2.post1 by @lvhan028 in #3298
- Optimize internvit by @caikun-pjlab in #3316
- bump version to v0.7.3 by @lvhan028 in #3416
New Contributors
- @wanfengcxz made their first contribution in #3417
- @caikun-pjlab made their first contribution in #3316
Full Changelog: v0.7.2...v0.7.3
v0.7.2.post1
What's Changed
💥 Improvements
- Add spaces_between_special_tokens to /v1/interactive and make compatible with empty text by @AllentDan in #3283
- add env var to control timeout by @CUHKSZzxy in #3291
🐞 Bug fixes
- fix activation grid oversize by @grimoire in #3282
- Set ensure_ascii=False for tool calling by @AllentDan in #3295
🌐 Other
Full Changelog: v0.7.2...v0.7.2.post1
v0.7.2
What's Changed
🚀 Features
- [Feature] support qwen2.5-vl for pytorch engine by @CUHKSZzxy in #3194
- Support reward models by @lvhan028 in #3192
- Add collective communication kernels by @lzhangzz in #3163
- PytorchEngine multi-node support v2 by @grimoire in #3147
- Add flash mla by @AllentDan in #3218
- Add gemma3 implementation by @AllentDan in #3272
💥 Improvements
- remove update badwords by @grimoire in #3183
- defaullt executor ray by @grimoire in #3210
- change ascend&camb default_batch_size to 256 by @jinminxi104 in #3251
- Tool reasoning parsers and streaming function call by @AllentDan in #3198
- remove torchelastic flag by @grimoire in #3242
- disable flashmla warning on sm<90 by @grimoire in #3271
🐞 Bug fixes
- Fix missing cli chat option by @lzhangzz in #3209
- [ascend] fix multi-card distributed inference failures by @tangzhiyi11 in #3215
- fix for small cache-max-entry-count by @grimoire in #3221
- [dlinfer] fix glm-4v graph mode on ascend by @jinminxi104 in #3235
- fix qwen2.5 pytorch engine dtype error on NPU by @tcye in #3247
- [Fix] failed to update the tokenizer's eos_token_id into stop_word list by @lvhan028 in #3257
- fix dsv3 gate scaling by @grimoire in #3263
- Fix the bug for reading dict error by @GxjGit in #3196
- Fix get ppl by @lvhan028 in #3268
📚 Documentations
- Specifiy lmdeploy version in benchmark guide by @lyj0309 in #3216
- [ascend] add Ascend docker image by @jinminxi104 in #3239
🌐 Other
- [ci] testcase refactoring by @zhulinJulia24 in #3151
- [ci] add testcase for native communicator by @zhulinJulia24 in #3217
- [ci] add volc evaluation testcase by @zhulinJulia24 in #3240
- [ci] remove v100 testconfig by @zhulinJulia24 in #3253
- add rdma dependencies into docker file by @CUHKSZzxy in #3262
- docs: update ascend docs for docker running by @CyCle1024 in #3266
- bump version to v0.7.2 by @lvhan028 in #3252
New Contributors
Full Changelog: v0.7.1...v0.7.2
v0.7.1
What's Changed
🚀 Features
- support release pipeline by @irexyc in #3069
- [feature] add dlinfer w8a8 support. by @Reinerzhou in #2988
- [maca] support deepseekv2 for maca backend. by @Reinerzhou in #2918
- [Feature] support deepseek-vl2 for pytorch engine by @CUHKSZzxy in #3149
💥 Improvements
- use weights iterator while loading by @RunningLeon in #2886
- Add deepseek-r1 chat template by @AllentDan in #3072
- Update tokenizer by @lvhan028 in #3061
- Set max concurrent requests by @AllentDan in #2961
- remove logitswarper by @grimoire in #3109
- Update benchmark script and user guide by @lvhan028 in #3110
- support eos_token list in turbomind by @irexyc in #3044
- Use aiohttp inside proxy server && add --disable-cache-status argument by @AllentDan in #3020
- Update runtime package dependencies by @zgjja in #3142
- Make turbomind support embedding inputs on GPU by @chengyuma in #3177
🐞 Bug fixes
- [dlinfer] fix ascend qwen2_vl graph_mode by @yao-fengchen in #3045
- fix error in interactive api by @lvhan028 in #3074
- fix sliding window mgr by @grimoire in #3068
- More arguments in api_client, update docstrings by @AllentDan in #3077
- Add system role to deepseek chat template by @AllentDan in #3031
- Fix xcomposer2d5 by @irexyc in #3087
- fix user guide about cogvlm deployment by @lvhan028 in #3088
- fix postional argument by @lvhan028 in #3086
- Fix UT of deepseek chat template by @lvhan028 in #3125
- Fix internvl2.5 error after eviction by @grimoire in #3122
- Fix cogvlm and phi3vision by @RunningLeon in #3137
- [fix] fix vl gradio, use pipeline api and remove interactive chat by @irexyc in #3136
- fix the issue that stop_token may be less than defined in model.py by @irexyc in #3148
- fix typing by @lz1998 in #3153
- fix min length penalty by @irexyc in #3150
- fix default temperature value by @irexyc in #3166
- Use pad_token_id as image_token_id for vl models by @RunningLeon in #3158
- Fix tool call prompt for InternLM and Qwen by @AllentDan in #3156
- Update qwen2.py by @GxjGit in #3174
- fix temperature=0 by @grimoire in #3176
- fix blocked fp8 moe by @grimoire in #3181
- fix deepseekv2 has no attribute use_mla error by @CUHKSZzxy in #3188
- fix unstoppable chat by @lvhan028 in #3189
🌐 Other
- [ci] add internlm3 into testcase by @zhulinJulia24 in #3038
- add internlm3 to supported models by @lvhan028 in #3041
- update pre-commit config by @lvhan028 in #2683
- [maca] add cudagraph support on maca backend. by @Reinerzhou in #2834
- bump version to v0.7.0.post1 by @lvhan028 in #3076
- bump version to v0.7.0.post2 by @lvhan028 in #3094
- [Fix] fix the URL judgment problem in Windows by @Lychee-acaca in #3103
- bump version to v0.7.0.post3 by @lvhan028 in #3115
- [ci] fix some fail in daily testcase by @zhulinJulia24 in #3134
- Bump version to v0.7.1 by @lvhan028 in #3178
New Contributors
- @Lychee-acaca made their first contribution in #3103
- @lz1998 made their first contribution in #3153
- @GxjGit made their first contribution in #3174
- @chengyuma made their first contribution in #3177
- @CUHKSZzxy made their first contribution in #3149
Full Changelog: v0.7.0...v0.7.1
v0.7.0.post3
What's Changed
💥 Improvements
- Set max concurrent requests by @AllentDan in #2961
- remove logitswarper by @grimoire in #3109
🐞 Bug fixes
- fix user guide about cogvlm deployment by @lvhan028 in #3088
- fix postional argument by @lvhan028 in #3086
🌐 Other
- [Fix] fix the URL judgment problem in Windows by @Lychee-acaca in #3103
- bump version to v0.7.0.post3 by @lvhan028 in #3115
New Contributors
- @Lychee-acaca made their first contribution in #3103
Full Changelog: v0.7.0.post2...v0.7.0.post3
LMDeploy Release V0.7.0.post2
What's Changed
💥 Improvements
- Add deepseek-r1 chat template by @AllentDan in #3072
- Update tokenizer by @lvhan028 in #3061
🐞 Bug fixes
- Add system role to deepseek chat template by @AllentDan in #3031
- Fix xcomposer2d5 by @irexyc in #3087
🌐 Other
Full Changelog: v0.7.0.post1...v0.7.0.post2
LMDeploy Release V0.7.0.post1
What's Changed
💥 Improvements
- use weights iterator while loading by @RunningLeon in #2886
🐞 Bug fixes
- [dlinfer] fix ascend qwen2_vl graph_mode by @yao-fengchen in #3045
- fix error in interactive api by @lvhan028 in #3074
- fix sliding window mgr by @grimoire in #3068
- More arguments in api_client, update docstrings by @AllentDan in #3077
🌐 Other
- [ci] add internlm3 into testcase by @zhulinJulia24 in #3038
- add internlm3 to supported models by @lvhan028 in #3041
- update pre-commit config by @lvhan028 in #2683
- [maca] add cudagraph support on maca backend. by @Reinerzhou in #2834
- bump version to v0.7.0.post1 by @lvhan028 in #3076
Full Changelog: v0.7.0...v0.7.0.post1
LMDeploy Release v0.7.0
What's Changed
🚀 Features
- Support moe w8a8 in pytorch engine by @grimoire in #2894
- Support DeepseekV3 fp8 by @grimoire in #2967
- support new backend cambricon by @JackWeiw in #3002
- support-moe-fp8 by @RunningLeon in #3007
- add internlm3-dense(turbomind) & chat template by @irexyc in #3024
- support internlm3 on pt by @RunningLeon in #3026
- Support internlm3 quantization by @AllentDan in #3027
💥 Improvements
- Optimize awq kernel in pytorch engine by @grimoire in #2965
- Support fp8 w8a8 for pt backend by @RunningLeon in #2959
- Optimize lora kernel by @grimoire in #2975
- Remove threadsafe by @grimoire in #2907
- Refactor async engine & turbomind IO by @lzhangzz in #2968
- [dlinfer]rope refine by @JackWeiw in #2984
- Expose spaces_between_special_tokens by @AllentDan in #2991
- [dlinfer]change llm op interface of paged_prefill_attention. by @JackWeiw in #2977
- Update request logger by @lvhan028 in #2981
- remove decoding by @grimoire in #3016
🐞 Bug fixes
- Fix build crash in nvcr.io/nvidia/pytorch:24.06-py3 image by @zgjja in #2964
- add tool role in BaseChatTemplate as tool response in messages by @AllentDan in #2979
- Fix ascend dockerfile by @jinminxi104 in #2989
- fix internvl2 qk norm by @grimoire in #2987
- fix xcomposer2 when transformers is upgraded greater than 4.46 by @irexyc in #3001
- Fix get_ppl & get_logits by @lvhan028 in #3008
- Fix typo in w4a16 guide by @Yan-Xiangjun in #3018
- fix blocked fp8 moe kernel by @grimoire in #3009
- Fix async engine by @lzhangzz in #3029
- [hotfix] Fix get_ppl by @lvhan028 in #3023
- Fix MoE gating for DeepSeek V2 by @lzhangzz in #3030
- Fix empty response for pipeline by @lzhangzz in #3034
- Fix potential hang during TP model initialization by @lzhangzz in #3033
🌐 Other
- [ci] add w8a8 and internvl2.5 models into testcase by @zhulinJulia24 in #2949
- bump version to v0.7.0 by @lvhan028 in #3010
New Contributors
- @zgjja made their first contribution in #2964
- @Yan-Xiangjun made their first contribution in #3018
Full Changelog: 0.6.5...v0.7.0
LMDeploy Release v0.6.5
What's Changed
🚀 Features
- [dlinfer] feat: add DlinferFlashAttention to support qwen vl. by @Reinerzhou in #2952
💥 Improvements
- refactor PyTorchEngine check env by @grimoire in #2870
- refine multi-backend setup.py by @jinminxi104 in #2880
- Refactor VLM modules by @lvhan028 in #2810
- [dlinfer] only compile the language model in vl models by @tangzhiyi11 in #2893
- Optimize tp broadcast by @grimoire in #2889
- unfeeze torch version in dockerfile by @RunningLeon in #2906
- support tp > n_kv_heads for pt engine by @RunningLeon in #2872
- replicate kv for some models when tp is divisble by kv_head_num by @irexyc in #2874
- Fallback to pytorch engine when the model is quantized by smooth quant by @lvhan028 in #2953
- Torchrun launching multiple api_server by @AllentDan in #2402
🐞 Bug fixes
- [Feature] Support for loading lora adapter weights in safetensors format by @Galaxy-Husky in #2860
- fix cpu cache by @grimoire in #2881
- Fix args type in docstring by @Galaxy-Husky in #2888
- Fix llama3.1 chat template by @fzyzcjy in #2862
- Fix typo by @ghntd in #2916
- fix: Incorrect stats size during inference of throughput benchmark when concurrency > num_prompts by @pancak3 in #2928
- fix lora name and rearange wqkv for internlm2 by @RunningLeon in #2912
- [dlinfer] fix moe op for dlinfer. by @Reinerzhou in #2917
- [side effect] fix vlm quant failed by @lvhan028 in #2914
- fix torch_dtype by @RunningLeon in #2933
- support unaligned qkv heads by @grimoire in #2930
- fix mllama inference without image by @RunningLeon in #2947
- Support torch_dtype modification and update FAQs for AWQ quantization by @AllentDan in #2898
- Fix exception handler for proxy server by @AllentDan in #2901
- Fix torch_dtype in lite by @AllentDan in #2956
- [side-effect] bring back quantization of qwen2-vl, glm4v and etc. by @lvhan028 in #2954
- add a thread pool executor to control the vl engine traffic by @lvhan028 in #2970
- [side-effect] fix gradio demo error by @lvhan028 in #2976
🌐 Other
- [dlinfer] fix engine checker by @tangzhiyi11 in #2891
- Bump version to v0.6.5 by @lvhan028 in #2955
New Contributors
- @Galaxy-Husky made their first contribution in #2860
- @fzyzcjy made their first contribution in #2862
- @ghntd made their first contribution in #2916
- @pancak3 made their first contribution in #2928
Full Changelog: v0.6.4...0.6.5
LMDeploy Release v0.6.4
What's Changed
🚀 Features
- feature: support qwen2.5 fuction_call by @akai-shuuichi in #2737
- [Feature] support minicpm-v_2_6 for pytorch engine. by @Reinerzhou in #2767
- Support qwen2-vl AWQ quantization by @AllentDan in #2787
- Add DeepSeek-V2 support by @lzhangzz in #2763
- [ascend]feat: support kv int8 by @yao-fengchen in #2736
💥 Improvements
- Optimize update_step_ctx on Ascend by @jinminxi104 in #2804
- Add Ascend installation adapter by @zhabuye in #2817
- Refactor turbomind (2/N) by @lzhangzz in #2818
- add openssh-server installation in dockerfile by @lvhan028 in #2830
- Add version restrictions in runtime_ascend.txt to ensure functionality by @zhabuye in #2836
- better kv allocate by @grimoire in #2814
- Update internvl chat template by @AllentDan in #2832
- profile throughput without new threads by @grimoire in #2826
- [dlinfer] change dlinfer kv_cache layout and ajust paged_prefill_attention api. by @Reinerzhou in #2847
- [maca] add env to support different mm layout on maca. by @Reinerzhou in #2835
- Supports W8A8 quantization for more models by @AllentDan in #2850
🐞 Bug fixes
- disable prefix-caching for vl model by @grimoire in #2825
- Fix gemma2 accuracy through the correct softcapping logic by @AllentDan in #2842
- fix accessing before initialization by @lvhan028 in #2845
- fix the logic to verify whether AutoAWQ has been successfully installed by @grimoire in #2844
- check whether backend_config is None or not before accessing its attr by @lvhan028 in #2848
- [ascend] convert kv cache to nd format in ascend graph mode by @tangzhiyi11 in #2853
📚 Documentations
- Update supported models & Ascend doc by @jinminxi104 in #2765
- update supported models by @lvhan028 in #2849
🌐 Other
- [CI] Split vl testcases into turbomind and pytorch backend by @zhulinJulia24 in #2751
- [dlinfer] Fix qwenvl rope error for dlinfer backend by @JackWeiw in #2795
- [CI] add more testcase for mllm models by @zhulinJulia24 in #2791
- Update dlinfer-ascend version in runtime_ascend.txt by @jinminxi104 in #2865
- bump version to v0.6.4 by @lvhan028 in #2864
New Contributors
- @akai-shuuichi made their first contribution in #2737
- @JackWeiw made their first contribution in #2795
- @zhabuye made their first contribution in #2817
Full Changelog: v0.6.3...v0.6.4