-
Notifications
You must be signed in to change notification settings - Fork 29
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: Sertac Ozercan <[email protected]>
- Loading branch information
Showing
35 changed files
with
8,636 additions
and
15,819 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
--- | ||
title: Creating Model Images | ||
--- | ||
|
||
:::note | ||
This section shows how to create a custom image with models of your choosing. If you want to use one of the pre-made models, skip to [running models](#running-models). | ||
::: | ||
|
||
Create an `aikitfile.yaml` with the following structure: | ||
|
||
```yaml | ||
#syntax=ghcr.io/sozercan/aikit:latest | ||
apiVersion: v1alpha1 | ||
models: | ||
- name: llama-2-7b-chat | ||
source: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf | ||
``` | ||
:::tip | ||
This is the simplest way to get started to build an image. For full `aikitfile` specification, see [specs](docs/specs.md). | ||
::: | ||
|
||
First, create a buildx buildkit instance. Alternatively, if you are using Docker v24 with [containerd image store](https://docs.docker.com/storage/containerd/) enabled, you can skip this step. | ||
|
||
```bash | ||
docker buildx create --use --name aikit-builder | ||
``` | ||
|
||
Then build your image with: | ||
|
||
```bash | ||
docker buildx build . -t my-model -f aikitfile.yaml --load | ||
``` | ||
|
||
This will build a local container image with your model(s). You can see the image with: | ||
|
||
```bash | ||
docker images | ||
REPOSITORY TAG IMAGE ID CREATED SIZE | ||
my-model latest e7b7c5a4a2cb About an hour ago 5.51GB | ||
``` | ||
|
||
### Running models | ||
|
||
You can start the inferencing server for your models with: | ||
|
||
```bash | ||
# for pre-made models, replace "my-model" with the image name | ||
docker run -d --rm -p 8080:8080 my-model | ||
``` | ||
|
||
You can then send requests to `localhost:8080` to run inference from your models. For example: | ||
|
||
```bash | ||
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ | ||
"model": "llama-2-7b-chat", | ||
"messages": [{"role": "user", "content": "explain kubernetes in a sentence"}] | ||
}' | ||
{"created":1701236489,"object":"chat.completion","id":"dd1ff40b-31a7-4418-9e32-42151ab6875a","model":"llama-2-7b-chat","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"\nKubernetes is a container orchestration system that automates the deployment, scaling, and management of containerized applications in a microservices architecture."}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
--- | ||
title: Demos | ||
--- | ||
|
||
## Building an image with a Llama 2 model | ||
|
||
[![Building an image with a Llama 2 model](https://asciinema.org/a/J9bitkONKPvedSfU1RkrmVEhD.svg 'Building an image with a Llama 2 model')](https://asciinema.org/a/J9bitkONKPvedSfU1RkrmVEhD) | ||
|
||
## Inference | ||
|
||
[![Inference](https://asciinema.org/a/DYh5bCQMNPSis1whhsfPeMOoM.svg 'Inference')](https://asciinema.org/a/DYh5bCQMNPSis1whhsfPeMOoM) | ||
|
||
## Vision with LLaVA | ||
|
||
[![Vision with LLaVA | ||
](https://asciinema.org/a/626553.svg 'Vision with LLaVA')](https://asciinema.org/a/626553) | ||
|
||
> see [llava.yaml](https://github.com/sozercan/aikit/blob/main/examples/llava.yaml) for the configuration used in the demo |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
--- | ||
title: GPU Acceleration | ||
--- | ||
|
||
:::note | ||
At this time, only NVIDIA GPU acceleration is supported. Please open an issue if you'd like to see support for other GPU vendors. | ||
::: | ||
|
||
## NVIDIA | ||
|
||
AIKit supports GPU accelerated inferencing with [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-container-toolkit). You must also have [NVIDIA Drivers](https://www.nvidia.com/en-us/drivers/unix/) installed on your host machine. | ||
|
||
For Kubernetes, [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator) provides a streamlined way to install the NVIDIA drivers and container toolkit to configure your cluster to use GPUs. | ||
|
||
To get started with GPU-accelerated inferencing, make sure to set the following in your `aikitfile` and build your model. | ||
|
||
```yaml | ||
runtime: cuda # use NVIDIA CUDA runtime | ||
``` | ||
For `llama` backend, set the following in your `config`: | ||
|
||
```yaml | ||
f16: true # use float16 precision | ||
gpu_layers: 35 # number of layers to offload to GPU | ||
low_vram: true # for devices with low VRAM | ||
``` | ||
|
||
:::tip | ||
Make sure to customize these values based on your model and GPU specs. | ||
::: | ||
|
||
:::note | ||
For `exllama` and `exllama2` backends, GPU acceleration is enabled by default and cannot be disabled. | ||
::: | ||
|
||
After building the model, you can run it with [`--gpus all`](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/docker-specialized.html#gpu-enumeration) flag to enable GPU support: | ||
|
||
```bash | ||
# for pre-made models, replace "my-model" with the image name | ||
docker run --rm --gpus all -p 8080:8080 my-model | ||
``` | ||
|
||
If GPU acceleration is working, you'll see output that is similar to following in the debug logs: | ||
|
||
```bash | ||
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr ggml_init_cublas: found 1 CUDA devices: | ||
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr Device 0: Tesla T4, compute capability 7.5 | ||
... | ||
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: using CUDA for GPU acceleration | ||
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: mem required = 70.41 MB (+ 2048.00 MB per state) | ||
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: offloading 32 repeating layers to GPU | ||
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: offloading non-repeating layers to GPU | ||
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: offloading v cache to GPU | ||
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: offloading k cache to GPU | ||
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: offloaded 35/35 layers to GPU | ||
5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: VRAM used: 5869 MB | ||
``` |
Oops, something went wrong.