intel-staging · ivy-lv11 · May 23, 2024 · May 20, 2024 · May 20, 2024 · May 20, 2024
diff --git a/.github/workflows/publish_sub_package.yml b/.github/workflows/publish_sub_package.yml
@@ -4,14 +4,19 @@ on:
   push:
     branches:
       - main
+  pull_request:
+    branches: [main, ipex-llm-llm-gpu]
+    paths:
+      - ".github/workflows/publish_sub_package.yml"
+      - "llama-index-integrations/**"
 
 env:
   POETRY_VERSION: "1.6.1"
   PYTHON_VERSION: "3.10"
 
 jobs:
   publish_subpackage_if_needed:
-    if: github.repository == 'run-llama/llama_index'
+    # if: github.repository == 'run-llama/llama_index'
     runs-on: ubuntu-latest
     steps:
       - uses: actions/checkout@v3
@@ -30,14 +35,11 @@ jobs:
         run: |
           echo "changed_files=$(git diff --name-only ${{ github.event.before }} ${{ github.event.after }} | grep -v llama-index-core | grep llama-index | grep pyproject | xargs)" >> $GITHUB_OUTPUT
       - name: Publish changed packages
-        env:
-          PYPI_TOKEN: ${{ secrets.LLAMA_INDEX_PYPI_TOKEN }}
         run: |
-          for file in ${{ steps.changed-files.outputs.changed_files }}; do
+          for file in llama-index-integrations/llms/llama-index-llms-ipex-llm/pyproject.toml; do
               cd `echo $file | sed 's/\/pyproject.toml//g'`
               poetry lock
               pip install -e .
-              poetry config pypi-token.pypi $PYPI_TOKEN
-              poetry publish --build
+              poetry publish --build --dry-run
               cd -
           done
diff --git a/docs/docs/examples/llm/ipex_llm_gpu.ipynb b/docs/docs/examples/llm/ipex_llm_gpu.ipynb
@@ -0,0 +1,285 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# IPEX-LLM \n",
+    "\n",
+    "> [IPEX-LLM](https://github.com/intel-analytics/ipex-llm/) is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency.\n",
+    "\n",
+    "This example goes over how to use LlamaIndex to interact with [`ipex-llm`](https://github.com/intel-analytics/ipex-llm/) for text generation and chat on GPU. \n",
+    "\n",
+    "For more examples and usage, refer to [Examples](https://github.com/run-llama/llama_index/tree/main/llama-index-integrations/llms/llama-index-llms-ipex-llm/examples)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Install `llama-index-llms-ipex-llm`. This will also install `ipex-llm` and its dependencies."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "```bash\n",
+    "pip install llama-index-llms-ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this example we'll use [HuggingFaceH4/zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha) model for demostration. It requires updating `transformers` and `tokenizers` packages."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "```bash\n",
+    "pip install -U transformers==4.37.0 tokenizers==0.15.2\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Before loading the Zephyr model, you'll need to define `completion_to_prompt` and `messages_to_prompt` for formatting prompts. This is essential for preparing inputs that the model can interpret accurately."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Transform a string into input zephyr-specific input\n",
+    "def completion_to_prompt(completion):\n",
+    "    return f\"<|system|>\\n</s>\\n<|user|>\\n{completion}</s>\\n<|assistant|>\\n\"\n",
+    "\n",
+    "\n",
+    "# Transform a list of chat messages into zephyr-specific input\n",
+    "def messages_to_prompt(messages):\n",
+    "    prompt = \"\"\n",
+    "    for message in messages:\n",
+    "        if message.role == \"system\":\n",
+    "            prompt += f\"<|system|>\\n{message.content}</s>\\n\"\n",
+    "        elif message.role == \"user\":\n",
+    "            prompt += f\"<|user|>\\n{message.content}</s>\\n\"\n",
+    "        elif message.role == \"assistant\":\n",
+    "            prompt += f\"<|assistant|>\\n{message.content}</s>\\n\"\n",
+    "\n",
+    "    # ensure we start with a system prompt, insert blank if needed\n",
+    "    if not prompt.startswith(\"<|system|>\\n\"):\n",
+    "        prompt = \"<|system|>\\n</s>\\n\" + prompt\n",
+    "\n",
+    "    # add final assistant prompt\n",
+    "    prompt = prompt + \"<|assistant|>\\n\"\n",
+    "\n",
+    "    return prompt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Basic Usage\n",
+    "\n",
+    "Load the Zephyr model locally using IpexLLM using `IpexLLM.from_model_id`. It will load the model directly in its Huggingface format and convert it automatically to low-bit format for inference. Use `device_map` to load the model to xpu. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import warnings\n",
+    "\n",
+    "warnings.filterwarnings(\n",
+    "    \"ignore\", category=UserWarning, message=\".*padding_mask.*\"\n",
+    ")\n",
+    "\n",
+    "from llama_index.llms.ipex_llm import IpexLLM\n",
+    "\n",
+    "llm = IpexLLM.from_model_id(\n",
+    "    model_name=\"HuggingFaceH4/zephyr-7b-alpha\",\n",
+    "    tokenizer_name=\"HuggingFaceH4/zephyr-7b-alpha\",\n",
+    "    context_window=512,\n",
+    "    max_new_tokens=128,\n",
+    "    generate_kwargs={\"do_sample\": False},\n",
+    "    completion_to_prompt=completion_to_prompt,\n",
+    "    messages_to_prompt=messages_to_prompt,\n",
+    "    device_map=\"xpu\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now you can proceed to use the loaded model for text completion and interactive chat. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Text Completion"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "completion_response = llm.complete(\"Once upon a time, \")\n",
+    "print(completion_response.text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Streaming Text Completion"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response_iter = llm.stream_complete(\"Once upon a time, there's a little girl\")\n",
+    "for response in response_iter:\n",
+    "    print(response.delta, end=\"\", flush=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Chat"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_index.core.llms import ChatMessage\n",
+    "\n",
+    "message = ChatMessage(role=\"user\", content=\"Explain Big Bang Theory briefly\")\n",
+    "resp = llm.chat([message])\n",
+    "print(resp)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Streaming Chat"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "message = ChatMessage(role=\"user\", content=\"What is AI?\")\n",
+    "resp = llm.stream_chat([message], max_tokens=256)\n",
+    "for r in resp:\n",
+    "    print(r.delta, end=\"\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Save/Load Low-bit Model\n",
+    "Alternatively, you might save the low-bit model to disk once and use `from_model_id_low_bit` instead of `from_model_id` to reload it for later use - even across different machines. It is space-efficient, as the low-bit model demands significantly less disk space than the original model. And `from_model_id_low_bit` is also more efficient than `from_model_id` in terms of speed and memory usage, as it skips the model conversion step."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To save the low-bit model, use `save_low_bit` as follows."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "saved_lowbit_model_path = (\n",
+    "    \"./zephyr-7b-alpha-low-bit\"  # path to save low-bit model\n",
+    ")\n",
+    "\n",
+    "llm._model.save_low_bit(saved_lowbit_model_path)\n",
+    "del llm"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Load the model from saved lowbit model path as follows. Also use `device_map` to load the model to xpu. \n",
+    "> Note that the saved path for the low-bit model only includes the model itself but not the tokenizers. If you wish to have everything in one place, you will need to manually download or copy the tokenizer files from the original model's directory to the location where the low-bit model is saved."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "llm_lowbit = IpexLLM.from_model_id_low_bit(\n",
+    "    model_name=saved_lowbit_model_path,\n",
+    "    tokenizer_name=\"HuggingFaceH4/zephyr-7b-alpha\",\n",
+    "    # tokenizer_name=saved_lowbit_model_path,  # copy the tokenizers to saved path if you want to use it this way\n",
+    "    context_window=512,\n",
+    "    max_new_tokens=64,\n",
+    "    completion_to_prompt=completion_to_prompt,\n",
+    "    generate_kwargs={\"do_sample\": False},\n",
+    "    device_map=\"xpu\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Try stream completion using the loaded low-bit model. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response_iter = llm_lowbit.stream_complete(\"What is Large Language Model?\")\n",
+    "for response in response_iter:\n",
+    "    print(response.delta, end=\"\", flush=True)"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}