Skip to content

Commit 2aeb875

Browse files
authored
Merge pull request #24 from ivy-lv11/ipex-llm-llm-gpu
Add IPEX-LLM with GPU
2 parents 8d17508 + f76713b commit 2aeb875

File tree

8 files changed

+478
-15
lines changed

8 files changed

+478
-15
lines changed
Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# IPEX-LLM\n",
8+
"> [IPEX-LLM](https://github.com/intel-analytics/ipex-llm/) is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency.\n",
9+
"\n",
10+
"This example goes over how to use LlamaIndex to interact with [`ipex-llm`](https://github.com/intel-analytics/ipex-llm/) for text generation and chat on intel GPU. \n",
11+
"\n",
12+
"> **Note**\n",
13+
">\n",
14+
"> You could refer to [here](https://github.com/run-llama/llama_index/tree/main/llama-index-integrations/llms/llama-index-llms-ipex-llm/examples) for full examples of `IpexLLM`. Please note that for running on Intel GPU, please specify `-d 'xpu'` in command argument when running the examples.\n",
15+
"\n",
16+
"## Install Prerequisites\n",
17+
"To benefit from IPEX-LLM on Intel GPUs, there are several prerequisite steps for tools installation and environment preparation.\n",
18+
"\n",
19+
"If you are a Windows user, visit the [Install IPEX-LLM on Windows with Intel GPU Guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_windows_gpu.html), and follow [**Install Prerequisites**](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_windows_gpu.html#install-prerequisites) to update GPU driver (optional) and install Conda.\n",
20+
"\n",
21+
"If you are a Linux user, visit the [Install IPEX-LLM on Linux with Intel GPU](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html), and follow [**Install Prerequisites**](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#install-prerequisites) to install GPU driver, Intel® oneAPI Base Toolkit 2024.0, and Conda.\n",
22+
"\n",
23+
"## Install `llama-index-llms-ipex-llm`\n",
24+
"\n",
25+
"After the prerequisites installation, you should have created a conda environment with all prerequisites installed, activate your conda environment and install `llama-index-llms-ipex-llm` as follows:\n",
26+
"\n",
27+
"```bash\n",
28+
"conda activate <your-conda-env-name>\n",
29+
"\n",
30+
"pip install llama-index-llms-ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/\n",
31+
"```\n",
32+
"This step will also install `ipex-llm` and its dependencies.\n",
33+
"\n",
34+
"> **Note**\n",
35+
">\n",
36+
"> You can also use `https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/` as the `extra-indel-url`.\n",
37+
"\n",
38+
"\n",
39+
"## Runtime Configuration\n",
40+
"\n",
41+
"For optimal performance, it is recommended to set several environment variables based on your device:\n",
42+
"\n",
43+
"### For Windows Users with Intel Core Ultra integrated GPU\n",
44+
"\n",
45+
"In Anaconda Prompt:\n",
46+
"\n",
47+
"```\n",
48+
"set SYCL_CACHE_PERSISTENT=1\n",
49+
"set BIGDL_LLM_XMX_DISABLED=1\n",
50+
"```\n",
51+
"\n",
52+
"### For Linux Users with Intel Arc A-Series GPU\n",
53+
"\n",
54+
"```bash\n",
55+
"# Configure oneAPI environment variables. Required step for APT or offline installed oneAPI.\n",
56+
"# Skip this step for PIP-installed oneAPI since the environment has already been configured in LD_LIBRARY_PATH.\n",
57+
"source /opt/intel/oneapi/setvars.sh\n",
58+
"\n",
59+
"# Recommended Environment Variables for optimal performance\n",
60+
"export USE_XETLA=OFF\n",
61+
"export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1\n",
62+
"export SYCL_CACHE_PERSISTENT=1\n",
63+
"```\n",
64+
"\n",
65+
"> **Note**\n",
66+
">\n",
67+
"> For the first time that each model runs on Intel iGPU/Intel Arc A300-Series or Pro A60, it may take several minutes to compile.\n",
68+
">\n",
69+
"> For other GPU type, please refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#runtime-configuration) for Windows users, and [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#id5) for Linux users.\n",
70+
"\n",
71+
"## `IpexLLM`\n",
72+
"\n",
73+
"Setting `device_map=\"xpu\"` when initializing `IpexLLM` will put the LLM model on Intel GPU and benefit from IPEX-LLM optimizations.\n",
74+
"\n",
75+
"Before loading the Zephyr model, you'll need to define `completion_to_prompt` and `messages_to_prompt` for formatting prompts. Follow proper prompt format for zephyr-7b-alpha following the [model card](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha). This is essential for preparing inputs that the model can interpret accurately. Load the Zephyr model locally using IpexLLM using `IpexLLM.from_model_id`. It will load the model directly in its Huggingface format and convert it automatically to low-bit format for inference.\n",
76+
"\n",
77+
"```python\n",
78+
"# Transform a string into input zephyr-specific input\n",
79+
"def completion_to_prompt(completion):\n",
80+
" return f\"<|system|>\\n</s>\\n<|user|>\\n{completion}</s>\\n<|assistant|>\\n\"\n",
81+
"\n",
82+
"\n",
83+
"# Transform a list of chat messages into zephyr-specific input\n",
84+
"def messages_to_prompt(messages):\n",
85+
" prompt = \"\"\n",
86+
" for message in messages:\n",
87+
" if message.role == \"system\":\n",
88+
" prompt += f\"<|system|>\\n{message.content}</s>\\n\"\n",
89+
" elif message.role == \"user\":\n",
90+
" prompt += f\"<|user|>\\n{message.content}</s>\\n\"\n",
91+
" elif message.role == \"assistant\":\n",
92+
" prompt += f\"<|assistant|>\\n{message.content}</s>\\n\"\n",
93+
"\n",
94+
" # ensure we start with a system prompt, insert blank if needed\n",
95+
" if not prompt.startswith(\"<|system|>\\n\"):\n",
96+
" prompt = \"<|system|>\\n</s>\\n\" + prompt\n",
97+
"\n",
98+
" # add final assistant prompt\n",
99+
" prompt = prompt + \"<|assistant|>\\n\"\n",
100+
"\n",
101+
" return prompt\n",
102+
"\n",
103+
"from llama_index.llms.ipex_llm import IpexLLM\n",
104+
"\n",
105+
"llm = IpexLLM.from_model_id(\n",
106+
" model_name=\"HuggingFaceH4/zephyr-7b-alpha\",\n",
107+
" tokenizer_name=\"HuggingFaceH4/zephyr-7b-alpha\",\n",
108+
" context_window=512,\n",
109+
" max_new_tokens=128,\n",
110+
" generate_kwargs={\"do_sample\": False},\n",
111+
" completion_to_prompt=completion_to_prompt,\n",
112+
" messages_to_prompt=messages_to_prompt,\n",
113+
" device_map=\"xpu\",\n",
114+
")\n",
115+
"```\n",
116+
"\n",
117+
"> Please note that in this example we'll use [HuggingFaceH4/zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha) model for demostration. It requires updating `transformers` and `tokenizers` packages.\n",
118+
"> ```bash\n",
119+
"> pip install -U transformers==4.37.0 tokenizers==0.15.2\n",
120+
"> ```\n",
121+
"\n",
122+
"You could then conduct the completion task or chat task as normal:\n",
123+
"\n",
124+
"```python\n",
125+
"print(\"----------------- Complete ------------------\")\n",
126+
"completion_response = llm.complete(\"Once upon a time, \")\n",
127+
"print(completion_response.text)\n",
128+
"print(\"----------------- Stream Complete ------------------\")\n",
129+
"response_iter = llm.stream_complete(\"Once upon a time, there's a little girl\")\n",
130+
"for response in response_iter:\n",
131+
" print(response.delta, end=\"\", flush=True)\n",
132+
"print(\"----------------- Chat ------------------\")\n",
133+
"from llama_index.core.llms import ChatMessage\n",
134+
"\n",
135+
"message = ChatMessage(role=\"user\", content=\"Explain Big Bang Theory briefly\")\n",
136+
"resp = llm.chat([message])\n",
137+
"print(resp)\n",
138+
"print(\"----------------- Stream Chat ------------------\")\n",
139+
"message = ChatMessage(role=\"user\", content=\"What is AI?\")\n",
140+
"resp = llm.stream_chat([message], max_tokens=256)\n",
141+
"for r in resp:\n",
142+
" print(r.delta, end=\"\")\n",
143+
"```\n",
144+
"\n",
145+
"Alternatively, you might save the low-bit model to disk once and use `from_model_id_low_bit` instead of `from_model_id` to reload it for later use - even across different machines. It is space-efficient, as the low-bit model demands significantly less disk space than the original model. And `from_model_id_low_bit` is also more efficient than `from_model_id` in terms of speed and memory usage, as it skips the model conversion step. \n",
146+
"\n",
147+
"To save the low-bit model, use `save_low_bit` as follows. Then load the model from saved lowbit model path. Also use `device_map` to load the model to xpu. \n",
148+
"> Note that the saved path for the low-bit model only includes the model itself but not the tokenizers. If you wish to have everything in one place, you will need to manually download or copy the tokenizer files from the original model's directory to the location where the low-bit model is saved.\n",
149+
"\n",
150+
"Try stream completion using the loaded low-bit model. \n",
151+
"```python\n",
152+
"saved_lowbit_model_path = (\n",
153+
" \"./zephyr-7b-alpha-low-bit\" # path to save low-bit model\n",
154+
")\n",
155+
"\n",
156+
"llm._model.save_low_bit(saved_lowbit_model_path)\n",
157+
"del llm\n",
158+
"\n",
159+
"llm_lowbit = IpexLLM.from_model_id_low_bit(\n",
160+
" model_name=saved_lowbit_model_path,\n",
161+
" tokenizer_name=\"HuggingFaceH4/zephyr-7b-alpha\",\n",
162+
" # tokenizer_name=saved_lowbit_model_path, # copy the tokenizers to saved path if you want to use it this way\n",
163+
" context_window=512,\n",
164+
" max_new_tokens=64,\n",
165+
" completion_to_prompt=completion_to_prompt,\n",
166+
" generate_kwargs={\"do_sample\": False},\n",
167+
" device_map=\"xpu\",\n",
168+
")\n",
169+
"\n",
170+
"response_iter = llm_lowbit.stream_complete(\"What is Large Language Model?\")\n",
171+
"for response in response_iter:\n",
172+
" print(response.delta, end=\"\", flush=True)\n",
173+
"```"
174+
]
175+
}
176+
],
177+
"metadata": {
178+
"language_info": {
179+
"name": "python"
180+
}
181+
},
182+
"nbformat": 4,
183+
"nbformat_minor": 2
184+
}

llama-index-integrations/llms/llama-index-llms-ipex-llm/README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,12 @@
1010
pip install llama-index-llms-ipex-llm
1111
```
1212

13+
### On GPU
14+
15+
```bash
16+
pip install llama-index-llms-ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
17+
```
18+
1319
## Usage
1420

1521
```python

llama-index-integrations/llms/llama-index-llms-ipex-llm/examples/README.md

Lines changed: 39 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,16 +12,52 @@ Install `llama-index-llms-ipex-llm`. This will also install `ipex-llm` and its d
1212
pip install llama-index-llms-ipex-llm
1313
```
1414

15+
### On GPU
16+
17+
Install `llama-index-llms-ipex-llm`. This will also install `ipex-llm` and its dependencies.
18+
19+
```bash
20+
pip install llama-index-llms-ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
21+
```
22+
1523
## List of Examples
1624

25+
### Basic Example
26+
27+
The example [basic.py](./basic.py) shows how to run `IpexLLM` on Intel CPU or GPU and conduct tasks such as text completion. Run the example as following:
28+
29+
```bash
30+
python basic.py -m <path_to_model> -d <cpu_or_xpu> -q <query_to_LLM>
31+
```
32+
33+
> Please note that in this example we'll use [HuggingFaceH4/zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha) model for demonstration. It requires updating `transformers` and `tokenizers` packages.
34+
>
35+
> ```bash
36+
> pip install -U transformers==4.37.0 tokenizers==0.15.2
37+
> ```
38+
39+
### Low Bit Example
40+
41+
The example [low_bit.py](./low_bit.py) shows how to save and load low_bit model by `IpexLLM` on Intel CPU or GPU and conduct tasks such as text completion. Run the example as following:
42+
43+
```bash
44+
python low_bit.py -m <path_to_model> -d <cpu_or_xpu> -q <query_to_LLM> -s <save_low_bit_dir>
45+
```
46+
47+
> Please note that in this example we'll use [HuggingFaceH4/zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha) model for demonstration. It requires updating `transformers` and `tokenizers` packages.
48+
>
49+
> ```bash
50+
> pip install -U transformers==4.37.0 tokenizers==0.15.2
51+
> ```
52+
1753
### More Data Types Example
1854
19-
By default, `IpexLLM` loads the model in int4 format. To load a model in different data formats like `sym_int5`, `sym_int8`, etc., you can use the `load_in_low_bit` option in `IpexLLM`.
55+
By default, `IpexLLM` loads the model in int4 format. To load a model in different data formats like `sym_int5`, `sym_int8`, etc., you can use the `load_in_low_bit` option in `IpexLLM`. To load a model on different device like `cpu` or `xpu`, you can use the `device_map` option in `IpexLLM`.
2056
21-
The example [more_data_type.py](./more_data_type.py) shows how to use the `load_in_low_bit` option. Run the example as following:
57+
The example [more_data_type.py](./more_data_type.py) shows how to use the `load_in_low_bit` option and `device_map` option. Run the example as following:
2258
2359
```bash
24-
python more_data_type.py -m <path_to_model> -t <path_to_tokenizer> -l <low_bit_format>
60+
python more_data_type.py -m <path_to_model> -t <path_to_tokenizer> -l <low_bit_format> -d <device> -q <query_to_LLM>
2561
```
2662
2763
> Note: If you're using [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) model in this example, it is recommended to use transformers version
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# Transform a string into input zephyr-specific input
2+
def completion_to_prompt(completion):
3+
return f"<|system|>\n</s>\n<|user|>\n{completion}</s>\n<|assistant|>\n"
4+
5+
6+
# Transform a list of chat messages into zephyr-specific input
7+
def messages_to_prompt(messages):
8+
prompt = ""
9+
for message in messages:
10+
if message.role == "system":
11+
prompt += f"<|system|>\n{message.content}</s>\n"
12+
elif message.role == "user":
13+
prompt += f"<|user|>\n{message.content}</s>\n"
14+
elif message.role == "assistant":
15+
prompt += f"<|assistant|>\n{message.content}</s>\n"
16+
17+
# ensure we start with a system prompt, insert blank if needed
18+
if not prompt.startswith("<|system|>\n"):
19+
prompt = "<|system|>\n</s>\n" + prompt
20+
21+
# add final assistant prompt
22+
prompt = prompt + "<|assistant|>\n"
23+
24+
return prompt
25+
26+
27+
from llama_index.llms.ipex_llm import IpexLLM
28+
import argparse
29+
30+
if __name__ == "__main__":
31+
parser = argparse.ArgumentParser(description="IpexLLM Basic Usage Example")
32+
parser.add_argument(
33+
"--model-name",
34+
"-m",
35+
type=str,
36+
default="HuggingFaceH4/zephyr-7b-alpha",
37+
help="The huggingface repo id for the LLM model to be downloaded"
38+
", or the path to the huggingface checkpoint folder",
39+
)
40+
parser.add_argument(
41+
"--device",
42+
"-d",
43+
type=str,
44+
default="cpu",
45+
choices=["cpu", "xpu"],
46+
help="The device (Intel CPU or Intel GPU) the LLM model runs on",
47+
)
48+
parser.add_argument(
49+
"--query",
50+
"-q",
51+
type=str,
52+
default="What is IPEX-LLM?",
53+
help="The sentence you prefer for query the LLM",
54+
)
55+
56+
args = parser.parse_args()
57+
model_name = args.model_name
58+
device = args.device
59+
query = args.query
60+
61+
llm = IpexLLM.from_model_id(
62+
model_name=model_name,
63+
tokenizer_name=model_name,
64+
context_window=512,
65+
max_new_tokens=128,
66+
generate_kwargs={"do_sample": False},
67+
completion_to_prompt=completion_to_prompt,
68+
messages_to_prompt=messages_to_prompt,
69+
device_map=device,
70+
)
71+
72+
print("\n----------------- Complete ------------------")
73+
completion_response = llm.complete(query)
74+
print(completion_response.text)
75+
print("\n----------------- Stream Complete ------------------")
76+
response_iter = llm.stream_complete(query)
77+
for response in response_iter:
78+
print(response.delta, end="", flush=True)
79+
print("\n----------------- Chat ------------------")
80+
from llama_index.core.llms import ChatMessage
81+
82+
message = ChatMessage(role="user", content=query)
83+
resp = llm.chat([message])
84+
print(resp)
85+
print("\n----------------- Stream Chat ------------------")
86+
message = ChatMessage(role="user", content=query)
87+
resp = llm.stream_chat([message], max_tokens=256)
88+
for r in resp:
89+
print(r.delta, end="")

0 commit comments

Comments
 (0)