You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+30-2Lines changed: 30 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -86,7 +86,7 @@ The kickstart supports two modes of deployments
86
86
-[Hugging Face Token](https://huggingface.co/settings/tokens)
87
87
- Access to [Meta Llama](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct/) model.
88
88
- Access to [Meta Llama Guard](https://huggingface.co/meta-llama/Llama-Guard-3-8B/) model.
89
-
- Some of the example scripts use `jq` a JSON parsing utility which you can acquire via `brew install jq`
89
+
- Some of the example scripts use `jq` a JSON parsing utility which you can acquire via `brew install jq` if you are on a MacOS or using your favorite package manager if you are on Linux.
90
90
91
91
### Supported Models
92
92
@@ -180,7 +180,35 @@ model: llama-3-2-3b-instruct
180
180
model: llama-guard-3-8b (shield)
181
181
```
182
182
183
-
6. Install via make
183
+
# Deploying RAG Blueprint Step by Step
184
+
185
+
## Step 1: Deploy LLM Services
186
+
187
+
```bash
188
+
make install-llm-service NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct SAFETY=llama-guard-3-8b
189
+
```
190
+
191
+
This command deploys the LLM services that power the RAG application. Here's what happens when you run it:
192
+
193
+
1. The command creates the namespace `llama-stack-rag` if it doesn't already exist
194
+
2. It creates required secrets, including the Hugging Face token secret
195
+
3. It deploys the `llm-service` Helm chart with specific model configurations:
196
+
- Enables the `llama-3-2-3b-instruct` model as the main LLM
197
+
- Enables the `llama-guard-3-8b` model as the safety filter
198
+
199
+
Each enabled model triggers the creation of:
200
+
- A Persistent Volume Claim (PVC) to store the model files
201
+
- A KServe InferenceService resource that defines how to run the model
202
+
203
+
The actual model pods are created by the KServe operator in OpenShift AI, which processes the InferenceService resources. Each model runs in its own vLLM instance, which is a high-performance inference engine for LLMs.
204
+
205
+
Note that if you don't see pods being created after running this command, ensure that:
206
+
- Your OpenShift cluster has the OpenShift AI operator installed and properly configured
207
+
- You have sufficient GPU resources available (if using GPU versions of the models)
208
+
- The Hugging Face token provided has access to the requested models
209
+
210
+
211
+
# Deploying RAG Blueprint All at once
184
212
185
213
Use the taint key from above as the `LLM_TOLERATION` and `SAFETY_TOLERATION`
0 commit comments