xorbitsai · qinxuye · Jan 22, 2025 · Jan 21, 2025 · Jan 22, 2025 · Jan 22, 2025
diff --git a/README.md b/README.md
@@ -13,6 +13,7 @@
 [![PyPI Latest Release](https://img.shields.io/pypi/v/xinference.svg?style=for-the-badge)](https://pypi.org/project/xinference/)
 [![License](https://img.shields.io/pypi/l/xinference.svg?style=for-the-badge)](https://github.com/xorbitsai/inference/blob/main/LICENSE)
 [![Build Status](https://img.shields.io/github/actions/workflow/status/xorbitsai/inference/python.yaml?branch=main&style=for-the-badge&label=GITHUB%20ACTIONS&logo=github)](https://actions-badge.atrox.dev/xorbitsai/inference/goto?ref=main)
+[![Discord](https://img.shields.io/badge/join_Discord-5462eb.svg?logo=discord&style=for-the-badge&logoColor=%23f5f5f5)](https://discord.gg/Xw9tszSkr5)
 [![Slack](https://img.shields.io/badge/join_Slack-781FF5.svg?logo=slack&style=for-the-badge)](https://join.slack.com/t/xorbitsio/shared_invite/zt-1o3z9ucdh-RbfhbPVpx7prOVdM1CAuxg)
 [![Twitter](https://img.shields.io/twitter/follow/xorbitsio?logo=x&style=for-the-badge)](https://twitter.com/xorbitsio)
 
@@ -33,7 +34,7 @@ researcher, developer, or data scientist, Xorbits Inference empowers you to unle
 potential of cutting-edge AI models.
 
 <div align="center">
-<i><a href="https://join.slack.com/t/xorbitsio/shared_invite/zt-1z3zsm9ep-87yI9YZ_B79HLB2ccTq4WA">👉 Join our Slack community!</a></i>
+<i><a href="https://discord.gg/Xw9tszSkr5">👉 Join our Discord community!</a></i>
 </div>
 
 ## 🔥 Hot Topics
@@ -47,14 +48,14 @@ potential of cutting-edge AI models.
 - Support speech recognition model: [#929](https://github.com/xorbitsai/inference/pull/929)
 - Metrics support: [#906](https://github.com/xorbitsai/inference/pull/906)
 ### New Models
+- Built-in support for [MeloTTS](https://github.com/myshell-ai/MeloTTS): [#2760](https://github.com/xorbitsai/inference/pull/2760)
 - Built-in support for [CogAgent](https://github.com/THUDM/CogAgent): [#2740](https://github.com/xorbitsai/inference/pull/2740)
 - Built-in support for [HunyuanVideo](https://github.com/Tencent/HunyuanVideo): [#2721](https://github.com/xorbitsai/inference/pull/2721)
 - Built-in support for [HunyuanDiT](https://github.com/Tencent/HunyuanDiT): [#2727](https://github.com/xorbitsai/inference/pull/2727)
 - Built-in support for [Macro-o1](https://github.com/AIDC-AI/Marco-o1): [#2749](https://github.com/xorbitsai/inference/pull/2749)
 - Built-in support for [Stable Diffusion 3.5](https://huggingface.co/collections/stabilityai/stable-diffusion-35-671785cca799084f71fa2838): [#2706](https://github.com/xorbitsai/inference/pull/2706)
 - Built-in support for [CosyVoice 2](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B): [#2684](https://github.com/xorbitsai/inference/pull/2684)
 - Built-in support for [Fish Speech V1.5](https://huggingface.co/fishaudio/fish-speech-1.5): [#2672](https://github.com/xorbitsai/inference/pull/2672)
-- Built-in support for [F5-TTS](https://github.com/SWivid/F5-TTS): [#2626](https://github.com/xorbitsai/inference/pull/2626)
 ### Integrations
 - [Dify](https://docs.dify.ai/advanced/model-configuration/xinference): an LLMOps platform that enables developers (and even non-developers) to quickly build useful applications based on large language models, ensuring they are visual, operable, and improvable.
 - [FastGPT](https://github.com/labring/FastGPT): a knowledge-based platform built on the LLM, offers out-of-the-box data processing and model invocation capabilities, allows for workflow orchestration through Flow visualization.

diff --git a/README_zh_CN.md b/README_zh_CN.md
@@ -43,14 +43,14 @@ Xorbits Inference（Xinference）是一个性能强大且功能全面的分布
 - 支持语音识别模型: [#929](https://github.com/xorbitsai/inference/pull/929)
 - 增加 Metrics 统计信息: [#906](https://github.com/xorbitsai/inference/pull/906)
 ### 新模型
+- 内置 [MeloTTS](https://github.com/myshell-ai/MeloTTS): [#2760](https://github.com/xorbitsai/inference/pull/2760)
 - 内置 [CogAgent](https://github.com/THUDM/CogAgent): [#2740](https://github.com/xorbitsai/inference/pull/2740)
 - 内置 [HunyuanVideo](https://github.com/Tencent/HunyuanVideo): [#2721](https://github.com/xorbitsai/inference/pull/2721)
 - 内置 [HunyuanDiT](https://github.com/Tencent/HunyuanDiT): [#2727](https://github.com/xorbitsai/inference/pull/2727)
 - 内置 [Macro-o1](https://github.com/AIDC-AI/Marco-o1): [#2749](https://github.com/xorbitsai/inference/pull/2749)
 - 内置 [Stable Diffusion 3.5](https://huggingface.co/collections/stabilityai/stable-diffusion-35-671785cca799084f71fa2838): [#2706](https://github.com/xorbitsai/inference/pull/2706)
 - 内置 [CosyVoice 2](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B): [#2684](https://github.com/xorbitsai/inference/pull/2684)
 - 内置 [Fish Speech V1.5](https://huggingface.co/fishaudio/fish-speech-1.5): [#2672](https://github.com/xorbitsai/inference/pull/2672)
-- 内置 [F5-TTS](https://github.com/SWivid/F5-TTS): [#2626](https://github.com/xorbitsai/inference/pull/2626)
 ### 集成
 - [FastGPT](https://doc.fastai.site/docs/development/custom-models/xinference/)：一个基于 LLM 大模型的开源 AI 知识库构建平台。提供了开箱即用的数据处理、模型调用、RAG 检索、可视化 AI 工作流编排等能力，帮助您轻松实现复杂的问答场景。
 - [Dify](https://docs.dify.ai/advanced/model-configuration/xinference): 一个涵盖了大型语言模型开发、部署、维护和优化的 LLMOps 平台。

diff --git a/doc/source/conf.py b/doc/source/conf.py
@@ -96,9 +96,9 @@
 
 if version_match != 'zh-cn':
     html_theme_options['icon_links'].extend([{
-        "name": "Slack",
-        "url": "https://join.slack.com/t/xorbitsio/shared_invite/zt-1o3z9ucdh-RbfhbPVpx7prOVdM1CAuxg",
-        "icon": "fa-brands fa-slack",
+        "name": "Discord",
+        "url": "https://discord.gg/Xw9tszSkr5",
+        "icon": "fa-brands fa-discord",
         "type": "fontawesome",
     },
     {

diff --git a/doc/source/getting_started/using_xinference.rst b/doc/source/getting_started/using_xinference.rst
@@ -8,7 +8,7 @@ Using Xinference
 Run Xinference Locally
 ======================
 
-Let's start by running Xinference on a local machine and running a classic LLM model: ``llama-2-chat``.
+Let's start by running Xinference on a local machine and running a classic LLM model: ``qwen2.5-instruct``.
 
 After this quickstart, you will move on to learning how to deploy Xinference in a cluster environment.
 
@@ -82,14 +82,21 @@ The command line tool is ``xinference``. You can list the commands that can be u
         --help              Show this message and exit.
 
       Commands:
+        cached
+        cal-model-mem
         chat
+        engine
         generate
         launch
         list
+        login
         register
         registrations
+        remove-cache
+        stop-cluster
         terminate
         unregister
+        vllm-models
 
 
 You can install the Xinference Python client with minimal dependencies using the following command.
@@ -110,6 +117,7 @@ Currently, xinference supports the following inference engines:
 * ``sglang``
 * ``llama.cpp``
 * ``transformers``
+* ``MLX``
 
 About the details of these inference engine, please refer to :ref:`here <inference_backend>`.
 
@@ -145,11 +153,31 @@ you need to additionally pass the ``model_engine`` parameter.
 You can retrieve information about the supported inference engines and their related parameter combinations
 through the ``xinference engine`` command.
 
+.. note::
+
+    Here are some recommendations on when to use which engine:
+
+    - **Linux**
+
+       - When possible, prioritize using **vLLM** or **SGLang** for better performance.
+       - If resources are limited, consider using **llama.cpp**, as it offers more quantization options.
+       - For other cases, consider using **Transformers**, which supports nearly all models.
+
+    - **Windows**
 
-Run Llama-2
------------
+       - It is recommended to use **WSL**, and in this case, follow the same choices as Linux.
+       - Otherwise, prefer **llama.cpp**, and for unsupported models, opt for **Transformers**.
+
+    - **Mac**
+
+       - If supported by the model, use the **MLX engine**, as it delivers the best performance.
+       - For other cases, prefer **llama.cpp**, and for unsupported models, choose **Transformers**.
+
+
+Run qwen2.5-instruct
+--------------------
 
-Let's start by running a built-in model: ``llama-2-chat``. When you start a model for the first time, Xinference will
+Let's start by running a built-in model: ``qwen2.5-instruct``. When you start a model for the first time, Xinference will
 download the model parameters from HuggingFace, which might take a few minutes depending on the size of the model weights.
 We cache the model files locally, so there's no need to redownload them for subsequent starts.
 
@@ -163,13 +191,13 @@ We cache the model files locally, so there's no need to redownload them for subs
     XINFERENCE_MODEL_SRC=modelscope xinference-local --host 0.0.0.0 --port 9997
 
 We can specify the model's UID using the ``--model-uid`` or ``-u`` flag. If not specified, Xinference will generate a unique ID.
-This create a new model instance with unique ID ``my-llama-2``:
+The default unique ID will be identical to the model name.
 
 .. tabs::
 
   .. code-tab:: bash shell
 
-    xinference launch --model-engine <inference_engine> -u my-llama-2 -n llama-2-chat -s 13 -f pytorch
+    xinference launch --model-engine <inference_engine> -n qwen2.5-instruct -s 0_5 -f pytorch
 
   .. code-tab:: bash cURL
 
@@ -179,10 +207,9 @@ This create a new model instance with unique ID ``my-llama-2``:
       -H 'Content-Type: application/json' \
       -d '{
       "model_engine": "<inference_engine>",
-      "model_uid": "my-llama-2",
-      "model_name": "llama-2-chat",
+      "model_name": "qwen2.5-instruct",
       "model_format": "pytorch",
-      "size_in_billions": 13
+      "size_in_billions": "0_5"
     }'
 
   .. code-tab:: python
@@ -191,16 +218,15 @@ This create a new model instance with unique ID ``my-llama-2``:
     client = RESTfulClient("http://127.0.0.1:9997")
     model_uid = client.launch_model(
       model_engine="<inference_engine>",
-      model_uid="my-llama-2",
-      model_name="llama-2-chat",
+      model_name="qwen2.5-instruct",
       model_format="pytorch",
-      size_in_billions=13
+      size_in_billions="0_5"
     )
     print('Model uid: ' + model_uid)
 
   .. code-tab:: bash output
 
-    Model uid: my-llama-2
+    Model uid: qwen2.5-instruct
 
 .. note::
   For some engines, such as vllm, users need to specify the engine-related parameters when
@@ -209,11 +235,11 @@ This create a new model instance with unique ID ``my-llama-2``:
 
   .. code-block:: bash
 
-    xinference launch --model-engine vllm -u my-llama-2 -n llama-2-chat -s 13 -f pytorch --gpu_memory_utilization 0.9
+    xinference launch --model-engine vllm -n qwen2.5-instruct -s 0_5 -f pytorch --gpu_memory_utilization 0.9
 
   `gpu_memory_utilization=0.9` will pass to vllm when launching model.
 
-Congrats! You now have ``llama-2-chat`` running by Xinference. Once the model is running, we can try it out either via cURL,
+Congrats! You now have ``qwen2.5-instruct`` running by Xinference. Once the model is running, we can try it out either via cURL,
 or via Xinference's python client:
 
 .. tabs::
@@ -225,7 +251,7 @@ or via Xinference's python client:
       -H 'accept: application/json' \
       -H 'Content-Type: application/json' \
       -d '{
-        "model": "my-llama-2",
+        "model": "qwen2.5-instruct",
         "messages": [
             {
                 "role": "system",
@@ -242,7 +268,7 @@ or via Xinference's python client:
 
     from xinference.client import RESTfulClient
     client = RESTfulClient("http://127.0.0.1:9997")
-    model = client.get_model("my-llama-2")
+    model = client.get_model("qwen2.5-instruct")
     model.chat(
         messages=[
             {"role": "user", "content": "Who won the world series in 2020?"}
@@ -253,7 +279,7 @@ or via Xinference's python client:
 
     {
       "id": "chatcmpl-8d76b65a-bad0-42ef-912d-4a0533d90d61",
-      "model": "my-llama-2",
+      "model": "qwen2.5-instruct",
       "object": "chat.completion",
       "created": 1688919187,
       "choices": [
@@ -281,7 +307,7 @@ Xinference provides OpenAI-compatible APIs for its supported models, so you can
   client = OpenAI(base_url="http://127.0.0.1:9997/v1", api_key="not used actually")
 
   response = client.chat.completions.create(
-      model="my-llama-2",
+      model="qwen2.5-instruct",
       messages=[
           {"role": "system", "content": "You are a helpful assistant."},
           {"role": "user", "content": "What is the largest animal?"}
@@ -345,17 +371,17 @@ When you no longer need a model that is currently running, you can remove it in
 
   .. code-tab:: bash shell
 
-    xinference terminate --model-uid "my-llama-2"
+    xinference terminate --model-uid "qwen2.5-instruct"
 
   .. code-tab:: bash cURL
 
-    curl -X DELETE http://127.0.0.1:9997/v1/models/my-llama-2
+    curl -X DELETE http://127.0.0.1:9997/v1/models/qwen2.5-instruct
 
   .. code-tab:: python
 
     from xinference.client import RESTfulClient
     client = RESTfulClient("http://127.0.0.1:9997")
-    client.terminate_model(model_uid="my-llama-2")
+    client.terminate_model(model_uid="qwen2.5-instruct")
 
 Deploy Xinference In a Cluster
 ==============================
@@ -398,7 +424,7 @@ On each of the other servers where you want to run Xinference workers, run the f
 
   .. code-block:: bash
 
-      xinference launch -n llama-2-chat -s 13 -f pytorch -e "http://${supervisor_host}:9997"
+      xinference launch -n qwen2.5-instruct -s 0_5 -f pytorch -e "http://${supervisor_host}:9997"
 
 Using Xinference With Docker
 =============================

diff --git a/doc/source/index.rst b/doc/source/index.rst
@@ -250,10 +250,10 @@ Getting Involved
 
             :fab:`weixin` Find community on WeChat
 
-         .. grid-item-card:: 
-            :link: https://join.slack.com/t/xorbitsio/shared_invite/zt-1o3z9ucdh-RbfhbPVpx7prOVdM1CAuxg
-            
-            :fab:`slack` Find community on Slack
+         .. grid-item-card::
+            :link: https://discord.gg/Xw9tszSkr5
+
+            :fab:`discord` Find community on Discord
 
          .. grid-item-card::  
             :link: https://github.com/xorbitsai/inference/issues/new/choose