FREE Reverse Engineering Self-Study Course HERE
A Personal Assistant leveraging Retrieval-Augmented Generation (RAG) and the LLaMA-3.1-8B-Instant Large Language Model (LLM). This tool is designed to revolutionize PDF document analysis tasks by combining machine learning with retrieval-based systems.
Retrieval-Augmented Generation (RAG) is a powerful technique in natural language processing (NLP) that combines retrieval-based methods with generative models to produce more accurate and contextually relevant outputs. This approach was introduced in the 2020 paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" by Facebook AI Research (FAIR).
For further reading and a deeper understanding of RAG, refer to the original paper by Facebook AI Research: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.
The RAG model consists of three main components:
- Indexer: This component creates an index of the corpus to facilitate efficient retrieval of relevant documents.
- Retriever: This component retrieves relevant documents from the indexed corpus based on the input query.
- Generator: This component generates responses conditioned on the retrieved documents.
The indexer preprocesses the corpus
The retriever selects the top
The generator produces a response
The final probability of generating a response
Here,
The RAG model is trained in three stages:
- Indexer Training: The indexer is trained to create an efficient and accurate mapping of queries to documents.
-
Retriever Training: The retriever is trained to maximize the relevance score
$s(q, d_i)$ for relevant documents. -
Generator Training: The generator is trained to maximize the probability
$P(r \mid q, d_i)$ for the ground-truth responses.
During inference, the RAG model follows these steps:
- Indexing: The corpus is indexed to facilitate efficient retrieval.
-
Retrieval: The top
$k$ documents are retrieved for a given query based on their relevance scores. - Generation: A response is generated conditioned on the input query and the retrieved documents. The final response is obtained by marginalizing over the retrieved documents as described above.
RAG leverages the strengths of indexing, retrieval-based, and generation-based models to produce more accurate and informative responses. By conditioning the generation on retrieved documents, RAG can incorporate external knowledge from large corpora, leading to better performance on various tasks.
The combination of indexer, retriever, and generator in the RAG model makes it a powerful approach for tasks that require access to external knowledge and the ability to generate coherent and contextually appropriate responses.
- To select a Conda environment in Visual Studio Code, press the play button in the next cell which will open up a command prompt then select
Python Environments...
- A new command prompt will pop up and select
+ Create Python Environment
. - A new command prompt will again pop up and select
Conda Creates a .conda Conda environment in the current workspace
. - A new command prompt will again pop up and select
* Python 3.11
.
!conda create -n pa python=3.11 -y
In order for the Conda environment to be available, you need to close down VSCode and reload it and select rea
in the Kernel area in the top-right of VSCode.
- In the VSCode pop-up command window select
Select Another Kernel...
. - In the next command window select
Python Environments...
. - In the next command window select
pa (Python 3.11.9)
.
!conda install -n pa \
pytorch \
torchvision \
torchaudio \
cpuonly \
-c pytorch \
-c conda-forge \
--yes
%pip install -U ipywidgets
%pip install -U requests
%pip install -U llama-index
%pip install -U llama-index-embeddings-huggingface
%pip install -U llama-index-llms-groq
%pip install -U groq
%pip install -U gradio
import os
import platform
import subprocess
import requests
def install_tesseract():
"""
Installs Tesseract OCR based on the operating system.
"""
os_name = platform.system()
if os_name == "Linux":
print("Detected Linux. Installing Tesseract using apt-get...")
subprocess.run(["sudo", "apt-get", "update"], check=True)
subprocess.run(["sudo", "apt-get", "install", "-y", "tesseract-ocr"], check=True)
elif os_name == "Darwin":
print("Detected macOS. Installing Tesseract using Homebrew...")
subprocess.run(["brew", "install", "tesseract"], check=True)
elif os_name == "Windows":
tesseract_installer_url = "https://github.com/UB-Mannheim/tesseract/releases/download/v5.4.0.20240606/tesseract-ocr-w64-setup-5.4.0.20240606.exe"
installer_path = "tesseract-ocr-w64-setup-5.4.0.20240606.exe"
response = requests.get(tesseract_installer_url)
with open(installer_path, "wb") as file:
file.write(response.content)
tesseract_path = r"C:\Program Files\Tesseract-OCR"
os.environ["PATH"] += os.pathsep + tesseract_path
try:
result = subprocess.run(["tesseract", "--version"], check=True, capture_output=True, text=True)
print(result.stdout)
except subprocess.CalledProcessError as e:
print(f"Error running Tesseract: {e}")
else:
print(f"Unsupported OS: {os_name}")
install_tesseract()
import webbrowser
url = "https://www.ilovepdf.com/ocr-pdf"
webbrowser.open_new(url)
import os
from llama_index.core import (
Settings,
VectorStoreIndex,
SimpleDirectoryReader,
StorageContext,
load_index_from_storage
)
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.groq import Groq
import gradio as gr
Visit https://console.groq.com/keys and set up an API Key then replace <GROQ_API_KEY>
below with the newly generated key.
os.environ["GROQ_API_KEY"] = "<GROQ_API_KEY>"
GROQ_API_KEY = os.getenv("GROQ_API_KEY")
os.environ["TOKENIZERS_PARALLELISM"] = "false"
def load_single_pdf_from_directory(directory_path):
"""
Load a single PDF file from the specified directory.
Args:
directory_path (str): The path to the directory containing the PDF files.
Returns:
documents (list): The loaded documents if exactly one PDF is found.
"""
pdf_files = [file for file in os.listdir(directory_path) if file.lower().endswith(".pdf")]
if len(pdf_files) != 1:
print("Error: There must be exactly one PDF file in the directory.")
return None
file_path = os.path.join(directory_path, pdf_files[0])
reader = SimpleDirectoryReader(input_files=[file_path])
documents = reader.load_data()
return documents
directory_path = "files"
documents = load_single_pdf_from_directory(directory_path)
if documents is not None:
print(f"Successfully loaded {len(documents)} pages(s).")
else:
print("ABORTING NOTEBOOK!")
Successfully loaded 2 pages(s).
text_splitter = SentenceSplitter(chunk_size=2048, chunk_overlap=200)
nodes = text_splitter.get_nodes_from_documents(documents)
embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
llm = Groq(model="llama-3.1-8b-instant", api_key=GROQ_API_KEY)
Settings.embed_model = embed_model
Settings.llm = llm
print("VectorStoreIndex initialization")
vector_index = VectorStoreIndex.from_documents(
documents,
show_progress=True,
node_parser=nodes
)
VectorStoreIndex initialization
Parsing nodes: 0%| | 0/2 [00:00<?, ?it/s]
Generating embeddings: 0%| | 0/10 [00:00<?, ?it/s]
vector_index.storage_context.persist(persist_dir="./storage_mini")
storage_context = StorageContext.from_defaults(persist_dir="./storage_mini")
index = load_index_from_storage(storage_context)
query_engine = index.as_query_engine()
def query_function(query):
"""
Processes a query using the query engine and returns the response.
Args:
query (str): The query string to be processed by the query engine.
Returns:
str: The response generated by the query engine based on the input query.
Example:
>>> query_function("What is Reverse Engineering?")
'Reverse engineering is the process of deconstructing an object to understand its design, architecture, and functionality.'
"""
response = query_engine.query(query)
return response
iface = gr.Interface(
fn=query_function,
inputs=gr.Textbox(label="Query"),
outputs=gr.Textbox(label="Response")
)
iface.launch()
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.