|
1 | 1 | # 🍋 Lemonade SDK |
2 | 2 |
|
3 | | -*The long-term objective of the Lemonade SDK is to provide the ONNX ecosystem with the same kind of tools available in the GGUF ecosystem.* |
| 3 | +### RELOCATION NOTICE |
4 | 4 |
|
5 | | -Lemonade SDK is built on top of [OnnxRuntime GenAI (OGA)](https://github.com/microsoft/onnxruntime-genai), an ONNX LLM inference engine developed by Microsoft to improve the LLM experience on AI PCs, especially those with accelerator hardware such as Neural Processing Units (NPUs). |
| 5 | +This content has moved to: https://github.com/lemonade-sdk/lemonade/blob/main/docs/README.md |
6 | 6 |
|
7 | | -The Lemonade SDK provides everything needed to get up and running quickly with LLMs on OGA: |
| 7 | +The Lemonade SDK project has moved to https://github.com/lemonade-sdk/lemonade |
8 | 8 |
|
9 | | -| **Feature** | **Description** | |
10 | | -|------------------------------------------|-----------------------------------------------------------------------------------------------------| |
11 | | -| **🌐 Local LLM server with OpenAI API compatibility (Lemonade Server)** | Replace cloud-based LLMs with private and free LLMs that run locally on your own PC's NPU and GPU. | |
12 | | -| **🖥️ CLI with tools for prompting, benchmarking, and accuracy tests** | Enables convenient interoperability between models, frameworks, devices, accuracy tests, and deployment options. | |
13 | | -| **🐍 Python API based on `from_pretrained()`** | Provides easy integration with Python applications for loading and using LLMs. | |
| 9 | +The new PyPI package name is `lemonade-sdk` |
14 | 10 |
|
| 11 | +The new Lemonade_Server_Installer.exe is at: https://github.com/lemonade-sdk/lemonade/releases |
15 | 12 |
|
16 | | -## Table of Contents |
| 13 | +Please migrate to the new repository and package as soon as possible. |
17 | 14 |
|
18 | | -- [Installation](#installation) |
19 | | - - [Installing Lemonade Server via Executable](#installing-from-lemonade_server_installerexe) |
20 | | - - [Installing Lemonade SDK From PyPI](#installing-from-pypi) |
21 | | - - [Installing Lemonade SDK From Source](#installing-from-source) |
22 | | -- [CLI Commands](#cli-commands) |
23 | | - - [Prompting](#prompting) |
24 | | - - [Accuracy](#accuracy) |
25 | | - - [Benchmarking](#benchmarking) |
26 | | - - [LLM Report](#llm-report) |
27 | | - - [Memory Usage](#memory-usage) |
28 | | - - [Serving](#serving) |
29 | | -- [API](#api) |
30 | | - - [High-Level APIs](#high-level-apis) |
31 | | - - [Low-Level API](#low-level-api) |
32 | | -- [Contributing](#contributing) |
33 | | - |
34 | | - |
35 | | -# Installation |
36 | | - |
37 | | -There are 3 ways a user can install the Lemonade SDK: |
38 | | - |
39 | | -1. Use the [Lemonade Server Installer](#installing-from-lemonade_server_installerexe). This provides a no code way to run LLMs locally and integrate with OpenAI compatible applications. |
40 | | -1. Use [PyPI installation](#installing-from-pypi) by installing the `turnkeyml` package with the appropriate extras for your backend. This will install the full set of Turnkey and Lemonade SDK tools, including Lemonade Server, API, and CLI commands. |
41 | | -1. Use [source installation](#installing-from-source) if you plan to contribute or customize the Lemonade SDK. |
42 | | - |
43 | | - |
44 | | -## Installing From Lemonade_Server_Installer.exe |
45 | | - |
46 | | -The Lemonade Server is available as a standalone tool with a one-click Windows installer `.exe`. Check out the [Lemonade_Server_Installer.exe guide](lemonade_server_exe.md) for installation instructions and the [server spec](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_spec.md) to learn more about the functionality. |
47 | | - |
48 | | -The Lemonade Server [examples folder](https://github.com/onnx/turnkeyml/tree/main/examples/lemonade/server) has guides for how to use Lemonade Server with a collection of applications that we have tested. |
49 | | - |
50 | | -## Installing From PyPI |
51 | | - |
52 | | -To install the Lemonade SDK from PyPI: |
53 | | - |
54 | | -1. Create and activate a [miniconda](https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe) environment. |
55 | | - ```bash |
56 | | - conda create -n lemon python=3.10 |
57 | | - ``` |
58 | | - |
59 | | - ```bash |
60 | | - conda activate lemon |
61 | | - ``` |
62 | | - |
63 | | -3. Install Lemonade for your backend of choice: |
64 | | - - [OnnxRuntime GenAI with CPU backend](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/ort_genai_igpu.md): |
65 | | - ```bash |
66 | | - pip install turnkeyml[llm-oga-cpu] |
67 | | - ``` |
68 | | - - [OnnxRuntime GenAI with Integrated GPU (iGPU, DirectML) backend](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/ort_genai_igpu.md): |
69 | | - > Note: Requires Windows and a DirectML-compatible iGPU. |
70 | | - ```bash |
71 | | - pip install turnkeyml[llm-oga-igpu] |
72 | | - ``` |
73 | | - - OnnxRuntime GenAI with Ryzen AI Hybrid (NPU + iGPU) backend: |
74 | | - > Note: Ryzen AI Hybrid requires a Windows 11 PC with an AMD Ryzen™ AI 300-series processor. |
75 | | - |
76 | | - - Follow the environment setup instructions [here](https://ryzenai.docs.amd.com/en/latest/llm/high_level_python.html) |
77 | | - - Hugging Face (PyTorch) LLMs for CPU backend: |
78 | | - ```bash |
79 | | - pip install turnkeyml[llm] |
80 | | - ``` |
81 | | - - llama.cpp: see [instructions](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/llamacpp.md). |
82 | | - |
83 | | -4. Use `lemonade -h` to explore the LLM tools, and see the [command](#cli-commands) and [API](#api) examples below. |
84 | | - |
85 | | -## Installing From Source |
86 | | - |
87 | | -The Lemonade SDK can be installed from source code by cloning this repository and following the instructions [here](source_installation_inst.md). |
88 | | - |
89 | | - |
90 | | -# CLI Commands |
91 | | - |
92 | | -The `lemonade` CLI uses a unique command syntax that enables convenient interoperability between models, frameworks, devices, accuracy tests, and deployment options. |
93 | | - |
94 | | -Each unit of functionality (e.g., loading a model, running a test, deploying a server, etc.) is called a `Tool`, and a single call to `lemonade` can invoke any number of `Tools`. Each `Tool` will perform its functionality, then pass its state to the next `Tool` in the command. |
95 | | - |
96 | | -You can read each command out loud to understand what it is doing. For example, a command like this: |
97 | | - |
98 | | -```bash |
99 | | -lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 llm-prompt -p "Hello, my thoughts are" |
100 | | -``` |
101 | | - |
102 | | -Can be read like this: |
103 | | - |
104 | | -> Run `lemonade` on the input `(-i)` checkpoint `microsoft/Phi-3-mini-4k-instruct`. First, load it in the OnnxRuntime GenAI framework (`oga-load`), onto the integrated GPU device (`--device igpu`) in the int4 data type (`--dtype int4`). Then, pass the OGA model to the prompting tool (`llm-prompt`) with the prompt (`-p`) "Hello, my thoughts are" and print the response. |
105 | | - |
106 | | -The `lemonade -h` command will show you which options and Tools are available, and `lemonade TOOL -h` will tell you more about that specific Tool. |
107 | | - |
108 | | - |
109 | | -## Prompting |
110 | | - |
111 | | -To prompt your LLM, try one of the following: |
112 | | - |
113 | | -OGA iGPU: |
114 | | -```bash |
115 | | - lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 llm-prompt -p "Hello, my thoughts are" -t |
116 | | -``` |
117 | | - |
118 | | -Hugging Face: |
119 | | -```bash |
120 | | - lemonade -i facebook/opt-125m huggingface-load llm-prompt -p "Hello, my thoughts are" -t |
121 | | -``` |
122 | | - |
123 | | -The LLM will run with your provided prompt, and the LLM's response to your prompt will be printed to the screen. You can replace the `"Hello, my thoughts are"` with any prompt you like. |
124 | | -
|
125 | | -You can also replace the `facebook/opt-125m` with any Hugging Face checkpoint you like, including LLaMA-2, Phi-2, Qwen, Mamba, etc. |
126 | | -
|
127 | | -You can also set the `--device` argument in `oga-load` and `huggingface-load` to load your LLM on a different device. |
128 | | -
|
129 | | -The `-t` (or `--template`) flag instructs lemonade to insert the prompt string into the model's chat template. |
130 | | -This typically results in the model returning a higher quality response. |
131 | | - |
132 | | -Run `lemonade huggingface-load -h` and `lemonade llm-prompt -h` to learn more about these tools. |
133 | | - |
134 | | -## Accuracy |
135 | | - |
136 | | -To measure the accuracy of an LLM using MMLU (Measuring Massive Multitask Language Understanding), try the following: |
137 | | - |
138 | | -OGA iGPU: |
139 | | -```bash |
140 | | - lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 accuracy-mmlu --tests management |
141 | | -``` |
142 | | - |
143 | | -Hugging Face: |
144 | | -```bash |
145 | | - lemonade -i facebook/opt-125m huggingface-load accuracy-mmlu --tests management |
146 | | -``` |
147 | | - |
148 | | -This command will run just the management test from MMLU on your LLM and save the score to the lemonade cache at `~/.cache/lemonade`. You can also run other subject tests by replacing management with the new test subject name. For the full list of supported subjects, see the [MMLU Accuracy Read Me](mmlu_accuracy.md). |
149 | | - |
150 | | -You can run the full suite of MMLU subjects by omitting the `--test` argument. You can learn more about this with `lemonade accuracy-mmlu -h`. |
151 | | - |
152 | | -## Benchmarking |
153 | | - |
154 | | -To measure the time-to-first-token and tokens/second of an LLM, try the following: |
155 | | - |
156 | | -OGA iGPU: |
157 | | -```bash |
158 | | - lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 oga-bench |
159 | | -``` |
160 | | - |
161 | | -Hugging Face: |
162 | | -```bash |
163 | | - lemonade -i facebook/opt-125m huggingface-load huggingface-bench |
164 | | -``` |
165 | | - |
166 | | -This command will run a few warm-up iterations, then a few generation iterations where performance data is collected. |
167 | | - |
168 | | -The prompt size, number of output tokens, and number iterations are all parameters. Learn more by running `lemonade oga-bench -h` or `lemonade huggingface-bench -h`. |
169 | | - |
170 | | -## LLM Report |
171 | | - |
172 | | -To see a report that contains all the benchmarking results and all the accuracy results, use the `report` tool with the `--perf` flag: |
173 | | - |
174 | | -`lemonade report --perf` |
175 | | - |
176 | | -The results can be filtered by model name, device type and data type. See how by running `lemonade report -h`. |
177 | | - |
178 | | -## Memory Usage |
179 | | - |
180 | | -The peak memory used by the `lemonade` build is captured in the build output. To capture more granular |
181 | | -memory usage information, use the `--memory` flag. For example: |
182 | | - |
183 | | -OGA iGPU: |
184 | | -```bash |
185 | | - lemonade --memory -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 oga-bench |
| 15 | +For example: |
186 | 16 | ``` |
187 | | - |
188 | | -Hugging Face: |
189 | | -```bash |
190 | | - lemonade --memory -i facebook/opt-125m huggingface-load huggingface-bench |
191 | | -``` |
192 | | - |
193 | | -In this case a `memory_usage.png` file will be generated and stored in the build folder. This file |
194 | | -contains a figure plotting the memory usage over the build time. Learn more by running `lemonade -h`. |
195 | | - |
196 | | -## Serving |
197 | | - |
198 | | -You can launch an OpenAI-compatible server with: |
199 | | - |
200 | | -```bash |
201 | | - lemonade serve |
| 17 | + pip uninstall turnkeyml |
| 18 | + pip install lemonade-sdk[YOUR_EXTRAS] |
| 19 | + e.g., pip install lemonade-sdk[llm-oga-hybrid] |
202 | 20 | ``` |
203 | 21 |
|
204 | | -Visit the [server spec](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_spec.md) to learn more about the endpoints provided as well as how to launch the server with more detailed informational messages enabled. |
205 | | - |
206 | | -See the Lemonade Server [examples folder](https://github.com/onnx/turnkeyml/tree/main/examples/lemonade/server) to see a collection of applications that we have tested with Lemonade Server. |
207 | | - |
208 | | -# API |
209 | | - |
210 | | -Lemonade is also available via API. |
211 | | - |
212 | | -## High-Level APIs |
213 | | - |
214 | | -The high-level Lemonade API abstracts loading models from any supported framework (e.g., Hugging Face, OGA) and backend (e.g., CPU, iGPU, Hybrid) using the popular `from_pretrained()` function. This makes it easy to integrate Lemonade LLMs into Python applications. For more information on recipes and compatibility, see the [Lemonade API ReadMe](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/lemonade_api.md). |
215 | | -
|
216 | | -OGA iGPU: |
217 | | -```python |
218 | | -from lemonade.api import from_pretrained |
219 | | - |
220 | | -model, tokenizer = from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", recipe="oga-igpu") |
221 | | - |
222 | | -input_ids = tokenizer("This is my prompt", return_tensors="pt").input_ids |
223 | | -response = model.generate(input_ids, max_new_tokens=30) |
224 | | - |
225 | | -print(tokenizer.decode(response[0])) |
226 | | -``` |
227 | | -
|
228 | | -You can learn more about the high-level APIs [here](https://github.com/onnx/turnkeyml/tree/main/examples/lemonade). |
229 | | -
|
230 | | -## Low-Level API |
231 | | -
|
232 | | -The low-level API is useful for designing custom experiments. For example, sweeping over specific checkpoints, devices, and/or tools. |
233 | | -
|
234 | | -Here's a quick example of how to prompt a Hugging Face LLM using the low-level API, which calls the load and prompt tools one by one: |
235 | | -
|
236 | | -```python |
237 | | -import lemonade.tools.torch_llm as tl |
238 | | -import lemonade.tools.prompt as pt |
239 | | -from turnkeyml.state import State |
240 | | -
|
241 | | -state = State(cache_dir="cache", build_name="test") |
242 | | -
|
243 | | -state = tl.HuggingfaceLoad().run(state, input="facebook/opt-125m") |
244 | | -state = pt.Prompt().run(state, prompt="hi", max_new_tokens=15) |
245 | | -
|
246 | | -print("Response:", state.response) |
247 | | -``` |
248 | | -
|
249 | | -# Contributing |
250 | | -
|
251 | | -Contributions are welcome! If you decide to contribute, please: |
252 | | -
|
253 | | -- Do so via a pull request. |
254 | | -- Write your code in keeping with the same style as the rest of this repo's code. |
255 | | -- Add a test under `test/lemonade` that provides coverage of your new feature. |
256 | | -
|
257 | | -The best way to contribute is to add new tools to cover more devices and usage scenarios. |
258 | | -
|
259 | | -To add a new tool: |
260 | | -
|
261 | | -1. (Optional) Create a new `.py` file under `src/lemonade/tools` (or use an existing file if your tool fits into a pre-existing family of tools). |
262 | | -1. Define a new class that inherits the `Tool` class from `TurnkeyML`. |
263 | | -1. Register the class by adding it to the list of `tools` near the top of `src/lemonade/cli.py`. |
264 | | -
|
265 | | -You can learn more about contributing on the repository's [contribution guide](https://github.com/onnx/turnkeyml/blob/main/docs/contribute.md). |
| 22 | +Thank you for using Lemonade SDK! |
0 commit comments