You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We provide the conversation demo with a multi-modal agent, using the [chainlit](https://github.com/Chainlit/chainlit) framework. For more information, please visit their official website [here](https://docs.chainlit.io/get-started/overview).
5
+
6
+
For a simple chat experience, we load an LLM, such as [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), by specifying the configuration of like so:
7
+
8
+
```yaml
9
+
task: text-generation
10
+
model: "meta-llama/Meta-Llama-3-8B-Instruct"
11
+
do_sample: false
12
+
max_new_tokens: 300
13
+
```
14
+
15
+
Then, run the following:
16
+
17
+
```sh
18
+
CONFIG=config/regular_chat.yaml chainlit run fastrag/ui/chainlit_no_rag.py
19
+
```
20
+
21
+
For a chat using a RAG pipeline, specify the tools you wish to use in the following format:
description: 'useful for when you need to retrieve text to answer questions. Use the following format: {{ "input": [your tool input here ] }}.'
50
+
output_variable: "documents"
51
+
```
52
+
53
+
Then, run the application using the command:
54
+
55
+
```sh
56
+
CONFIG=config/rag_pipeline_chat.yaml chainlit run fastrag/ui/chainlit_pipeline.py
57
+
```
58
+
59
+
## Screenshot
60
+
61
+

62
+
63
+
64
+
# Multi-Modal Conversational Agent with Chainlit
65
+
66
+
In this demo, we use the [```xtuner/llava-llama-3-8b-v1_1-transformers```]https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers) model as a conversational agent, that can decide which retriever to use to respond to the user's query.
67
+
To perform that, we use dynamic reasoning with [ReAct](https://arxiv.org/abs/2210.03629) prompts, resulting in multiple logical turns.
68
+
To explore all the steps to build the agent system, you can check out our [Example Notebook](../examples/multi_modal_react_agent.ipynb).
69
+
For more information on how to use ReAct, feel free to visit [Haystack's original tutorial](https://haystack.deepset.ai/tutorials/25_customizing_agent), which our demo is based on.
70
+
71
+
To run the demo, simply run:
72
+
73
+
```sh
74
+
CONFIG=config/visual_chat_agent.yaml chainlit run fastrag/ui/chainlit_multi_modal_agent.py
75
+
```
76
+
77
+
## Screenshot
78
+
79
+

80
+
81
+
# Available Chat Templates
82
+
83
+
## Default Template
84
+
85
+
```
86
+
The following is a conversation between a human and an AI. Do not generate the user response to your output.
87
+
{memory}
88
+
Human: {query}
89
+
AI:
90
+
```
91
+
92
+
## Llama 2 Template (Llama2)
93
+
94
+
```
95
+
<s>[INST] <<SYS>>
96
+
The following is a conversation between a human and an AI. Do not generate the user response to your output.
97
+
<</SYS>>
98
+
99
+
{memory}{query} [/INST]
100
+
```
101
+
102
+
Notice that here we, the user messages will be:
103
+
104
+
```
105
+
<s>[INST] {USER_QUERY} [/INST]
106
+
```
107
+
108
+
And the model messages will be:
109
+
110
+
```
111
+
{ASSISTATN_RESPONSE} </s>
112
+
```
113
+
114
+
## User-Assistant (UserAssistant)
115
+
116
+
```
117
+
### System:
118
+
The following is a conversation between a human and an AI. Do not generate the user response to your output.
119
+
{memory}
120
+
121
+
### User: {query}
122
+
### Assistant:
123
+
```
124
+
125
+
## User-Assistant for Llava (UserAssistantLlava)
126
+
127
+
For the v1.5 llava models, we define a specific template, as shown in [this post regardin Llava models](https://huggingface.co/docs/transformers/model_doc/llava).
@@ -21,11 +18,13 @@ with a comprehensive tool-set for advancing retrieval augmented generation.
21
18
22
19
Comments, suggestions, issues and pull-requests are welcomed! :heart:
23
20
21
+
> [!IMPORTANT]
22
+
> Now compatible with Haystack v2+. Please report any possible issues you find.
23
+
24
24
## :mega: Updates
25
25
26
-
-**2024-04**: [(Extra Demo)](extras/rag_on_client.ipynb)**Chat with your documents on Intel Meteor Lake iGPU**.
27
-
-**2023-12**: Gaudi2, ONNX runtime and LlamaCPP support; Optimized Embedding models; Multi-modality and Chat demos; [REPLUG](https://arxiv.org/abs/2301.12652) text generation.
28
-
-**2023-06**: ColBERT index modification: adding/removing documents; see [IndexUpdater](https://github.com/stanford-futuredata/ColBERT/blob/main/colbert/index_updater.py).
26
+
-**2023-12**: Gaudi2 and ONNX runtime support; Optimized Embedding models; Multi-modality and Chat demos; [REPLUG](https://arxiv.org/abs/2301.12652) text generation.
27
+
-**2023-06**: ColBERT index modification: adding/removing documents; see [IndexUpdater](libs/colbert/colbert/index_updater.py).
29
28
-**2023-05**: [RAG with LLM and dynamic prompt synthesis example](examples/rag-prompt-hf.ipynb).
30
29
-**2023-04**: Qdrant `DocumentStore` support.
31
30
@@ -57,6 +56,10 @@ For a brief overview of the various unique components in fastRAG refer to the [C
"content": "The die from an Intel 8742, an 8-bit microcontroller that includes a CPU running at 12 MHz, 128 bytes of RAM, 2048 bytes of EPROM, and I/O in the same chip"
@@ -12,9 +10,9 @@ process a larger number of retrieved documents, without limiting ourselves to th
12
10
method works with any LLM, no fine-tuning is needed. See ([Shi et al. 2023](#shiREPLUGRetrievalAugmentedBlackBox2023))
13
11
for more details.
14
12
15
-
We provide implementation for the REPLUG ensembling inference, using the invocation layer
16
-
`ReplugHFLocalInvocationLayer`; Our implementation supports most Hugging FAce models with `.generate()` capabilities (such that implement the generation mixin); For a complete example, see [REPLUG Parallel
@@ -137,18 +134,60 @@ quantizer.quantize(save_dir=os.path.join(converted_model_path, 'quantized'), qua
137
134
138
135
### Loading the Quantized Model
139
136
140
-
Now that our model is quantized, we can load it in our framework, by specifying the ```ORTInvocationLayer```invocation layer.
137
+
Now that our model is quantized, we can load it in our framework, by using the ```ORTGenerator```generator.
141
138
142
139
```python
143
-
PrompterModel = PromptModel(
144
-
model_name_or_path="my/local/path/quantized",
145
-
invocation_layer_class=ORTInvocationLayer,
140
+
generator = ORTGenerator(
141
+
model="my/local/path/quantized",
142
+
task="text-generation",
143
+
generation_kwargs={
144
+
"max_new_tokens": 100,
145
+
}
146
+
)
147
+
```
148
+
149
+
## fastRAG running quantized LLMs using OpenVINO
150
+
151
+
We provide a method for running quantized LLMs with [OpenVINO](https://docs.openvino.ai/2024/home.html) and [optimum-intel](https://github.com/huggingface/optimum-intel).
152
+
We recommend checking out our [notebook](examples/rag_with_openvino.ipynb) with all the details, including the quantization and pipeline construction.
153
+
154
+
### Installation
155
+
156
+
Run the following command to install our dependencies:
157
+
158
+
```
159
+
pip install -e .[openvino]
160
+
```
161
+
162
+
For more information regarding the installation process, we recommend checking out the guides provided by [OpenVINO](https://docs.openvino.ai/2024/home.html) and [optimum-intel](https://github.com/huggingface/optimum-intel).
163
+
164
+
### LLM Quantization
165
+
166
+
We can use the [OpenVINO tutorial notebook](https://github.com/openvinotoolkit/openvino_notebooks/blob/main/notebooks/254-llm-chatbot/254-llm-chatbot.ipynb) to quantize an LLM to our liking.
167
+
168
+
### Loading the Quantized Model
169
+
170
+
Now that our model is quantized, we can load it in our framework, by using the ```OpenVINOGenerator``` component.
171
+
172
+
```python
173
+
from fastrag.generators.openvino import OpenVINOGenerator
## fastRAG Running RAG Pipelines with LLMs on a Llama CPP backend
150
189
151
-
To run LLM effectively on CPUs, especially on client side machines, we offer a method for running LLMs using the [llama-cpp](https://github.com/ggerganov/llama.cpp).
190
+
To run LLMs effectively on CPUs, especially on client side machines, we offer a method for running LLMs using the [llama-cpp](https://github.com/ggerganov/llama.cpp).
152
191
We recommend checking out our [tutorial notebook](examples/client_inference_with_Llama_cpp.ipynb) with all the details, including processes such as downloading GGUF models.
153
192
154
193
### Installation
@@ -176,7 +215,6 @@ PrompterModel = PromptModel(
176
215
)
177
216
```
178
217
179
-
180
218
## Optimized Embedding Models
181
219
182
220
Bi-encoder Embedders are key components of Retrieval Augmented Generation pipelines. Mainly used for indexing documents and for online re-ranking. We provide support for quantized `int8` models that have low latency and high throughput, using [`optimum-intel`](https://github.com/huggingface/optimum-intel) framework.
0 commit comments