Looking to build and deploy a real-world RAG application? Welcome!
This repo provides a base skeleton for building a (RAG)-enhanced LLM application. It is intended to serve as a starting point for developers looking to build real-world production RAG applications.
This allows easy setup of a web application that allows you to input large amounts of custom data for use with your LLM, operated via a web API.
The project allows you to:
- Locally deploy (via Docker) a vector database.
- Locally deploy an API to add/remove custom textual data from the vector DB.
- Instructions on how to deploy to AWS.
- An API to send chat messages to the LLM, which can be used to ask questions about the custom data. It can "look up" data from the vector DB for use as part of the response.
- Chat history is stored in a local Postgres, and is also accesable via the API.
- The API supports streaming of the response from the LLM.
This repo is built by Hipposys Ltd., and serves as a starting off point for new RAG projects for our clients. It is open sourced both for educational purposes, and to serve as a base for commercial projects.
The current skeleton supports Amazon Bedrock or OpenAI as the LLM provider, and Milvus or Chroma as the vector database. Additional LLM providers and vector databases are planned to be added in the near future.
- Web Server providing endpoints for:
- Adding and removing custom data.
- Sending and receiving chat messages.
- This supports streaming of the response from the LLM.
- Using Amazon Bedrock or OpenAI as an LLM provider.
- Milvus and Chroma Vector database integrations.
- Built on top of Langchain.
- Chat history is stored in a local Postgres, and is also accesable via the API.
Currently, you must have an Amazon Bedrock or OpenAI account to use this project.
You'll also need Docker for the local deploy.
Looking for help or have questions? Contact us at [email protected].
We work with clients on a variety of AI engineering and Data Enngineering projects.
The local deployment relies on having Docker installed.
It also relies on having access to Amazon Bedrock or OpenAI models, which are used as the LLM provider of the application.
- Clone the repository:
git clone https://github.com/hipposys-ltd/rag-app-skeleton.git
- Navigate to the project directory:
cd rag-app-skeleton - Make sure you have Docker installed and running.
- Create an
.envfile:cp .env-template .env- Fill it in with the necessary credentials and settings.
- For the initial local deplyment, the most important credentials are the ones defining your LLM provider:
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEYif you'll be usingAWS.OPENAI_API_KEYif you'll be using OpenAI.- Other credentials may be suitable for local development, but should be replaced when deploying to a remote server (e.g.
prod) for additional security.
- If you want to use a local LLM or embedding model, change
LLM_MODEL_IDandEMBEDDING_MODEL, then specify a model available at Ollama's model search withollama:prefix (e.g.,LLM_MODEL_ID='ollama:llama3.2:1b'; EMBEDDING_MODEL='ollama:mxbai-embed-large', ). Note: The selected LLM must support tools for compatibility with this architecture. If you modify the default values, remember to update theLLMandEMBEDDING_MODELarguments indocker-compose.local-models.yml.
- Build and run the project via Docker: The base command is
docker compose -f docker-compose.yml up -d --build. Add one or more of the following compose files, depending on your use-case:- Milvus:
-f docker-compose.milvus.yml - Chroma:
-f docker-compose.chromadb.yml - To use local models:
-f docker-compose.local-models.yml - To enable the UI for chat and Milvus DB,:
-f docker-compose.ui.yml
- Milvus:
- After running docker, you should have multiple services running.
- You can check the status of the services with
docker ps -a. - Make sure the
fastapi,postgresandmilvus-standalonecontainers are running.
- You can check the status of the services with
- Go to
localhost:8080/hello-worldto see an{"hello": "world"}response from the server. - You now have a running instance of the RAG application.
- Make sure that you have a Bedrock model available in your AWS account:
- Log into the AWS console.
- Navigate to the
Amazon Bedrockservice. - In the left navigration pane:
Bedrock configurations->Model access. - We currently use
Claude 3.5 Sonnetfor inference andTitan Text Embeddings V2for embeddings.- Note that this may change, you can either change it yourself, or see that someone else has changed it in the code.
- Note that these models may not be available in all regions, we currently use
us-east-1(N. Virginia).
- If these models are not enabled, you'll have to ask for access. The access should be granted immediately upon request.
- You'll need to generate access credentials for your Amazon account for use in the application.
- In your
.envfile:- Add your
OPENAI_API_KEY. - Set the
LLM_MODEL_IDto an OpenAI-compatible model withopenai:prefix (e.g.,openai:gpt-3.5-turbo). - Comment out any unused environment variables (e.g., AWS-related variables).
- Add your
- In
app/models/__init__.py, update the code to use OpenAI models instead of the default Bedrock models. Make sure to adjust both the model for inference and the one for embeddings. - Finally, restart Docker Compose to apply the
.envchanges.
We're now going to give a simple example of how to use the API. The plan is to:
- Query our LLM, via the API, and ask for specific "inside" information, which it does not have access to.
- Use the API to add the information to the vector database via simple textual data.
- Query the LLM again and ask for the same information, which it now has access to.
Note that by default, the repo is configured to return simple textual responses when running in local mode (controlled via .env) and JSON-formatted responses when running in non local modes (e.g. prod).
The following command sends a chat query via the API.:
curl \
-i \
-X POST \
--no-buffer \
-b cookies.tmp.txt -c cookies.tmp.txt \
-H 'Content-Type: application/json' \
-d '{"message": "What headphones are recommended by the company for listening to podcasts?"}' \
http://localhost:8080/chat/askHighlighting the important things in the command:
- We are hitting the
/chat/askendpoint to actually ask the LLM a question. - We are using
-band-cto save the cookies from the server. This lets the server continue our chat session, so additional requests to/chat/askwill be part of the same chat session. - The message itself is the chat message we are sending to the LLM.
The output should be a message of not finding anything in the company's internal documents about a headphone recommendation. There will also likely be a general message trying to help.
We'll add two sources of information to the vector database about Heapdhone choices:
curl \
-i \
-X POST \
--no-buffer \
-b cookies.tmp.uc.txt -c cookies.tmp.uc.txt \
-H 'Content-Type: application/json' \
-d '{"source_id": "1001", "source_name": "Headphones Guide I", "text": "The recommended headphones to use while listening to podcasts are AirPods Pro", "modified_at": "2024-09-22T17:04"}' \
http://localhost:8080/embeddings/text/storecurl \
-i \
-X POST \
--no-buffer \
-b cookies.tmp.uc.txt -c cookies.tmp.uc.txt \
-H 'Content-Type: application/json' \
-d '{"source_id": "1001", "source_name": "Headphones Guide II", "text": "The recommended headphones to use while listening to music is BoseQC35", "modified_at": "2024-09-22T17:04"}' \
http://localhost:8080/embeddings/text/storeHere, we are sending data to the /embeddings/text/store endpoint. This endpoint is responsible for storing the text data in the vector database. We store the data itself, as well as metadata about the source of the data - the source name, the source id, and the modification date.
Now we can query the LLM again and ask for the same information, which it now has access to:
curl \
-i \
-X POST \
--no-buffer \
-b cookies.tmp.txt -c cookies.tmp.txt \
-H 'Content-Type: application/json' \
-d '{"message": "What headphones are recommended by the company?"}' \
http://localhost:8080/chat/askThis time, you shold see a response from the LLM that includes the information we added to the vector database.
You can delete a source using the /embeddings/text/delete endpoint:
curl \
-i \
-X DELETE \
--no-buffer \
-b cookies.tmp.txt -c cookies.tmp.txt \
-H 'Content-Type: application/json' \
-d '{"source_id": "1001"}' \
http://localhost:8080/embeddings/text/deleteCurrently, the project supports Chroma and Milvus, with plans to add more vector databases in the future. By default, Milvus is used, but switching to another supported database is simple:
- Open
app/databases/vector/__init__.pyand update theVectorDBassignment. For example, to switch to Chroma:from app.databases.vector.chroma import Chroma # from app.databases.vector.milvus import Milvus VectorDB = Chroma
- When running
docker compose, use thedocker-composeconfiguration file that matches the database you’ve chosen. For example, to useChroma:docker compose \ -f docker-compose.yml \ -f docker-compose.chromadb.yml \ up -d --build
To run the tests, use the following command:
docker exec -it fastapi bash -c "pytest app/"For faster test execution, at the expense of cleaner output, you can add the -n option to parallelize tests across multiple workers:
docker exec -it fastapi bash -c "pytest -n 5 app/"In this example, 5 parallel workers will execute the tests.
A more complete guide to deploying to production will be added later.
For now, you can check the notes in prod/README.md and the other files in that directory.
When starting the project locally (following the instructions above), a Jupyter Lab server will automatically start. The server configuration is defined in docker-compose.yml.
To access Jupyter Lab, open http://localhost:8890 in your web browser. On your first visit, you’ll need to provide a login token. You can retrieve the token from the logs of the jupyter Docker container by running:
docker logs jupyter 2>&1 | grep token= | tail -n 1 | grep -E '=.+$'After logging in, navigate to /work/notebooks to access the existing notebooks or create new ones.