-
Notifications
You must be signed in to change notification settings - Fork 14
RAG - Local index #1715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
RAG - Local index #1715
Conversation
Move the logic concerning linking files to a vector store into the Collection model. The index manager should only be the interface to the index
Test that add_files_to_index works as expected
Codecov ReportAttention: Patch coverage is ✅ All tests successful. No failed tests found. 📢 Thoughts on this report? Let us know! 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a preliminary review - I'll need to come back to this.
return embeddings.embed_query(content) | ||
|
||
def chunk_content(self, text: str, chunk_size: int, chunk_overlap: int) -> list[str]: | ||
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it necessary to do different splitting based on the llm provider? I would have thought the chunking was independent.
Also thinking we should look at https://github.com/microsoft/markitdown and then we can always use the markdown splitter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it necessary to do different splitting based on the llm provider?
You're right, it isn't. I changed it here a bit.
Refactor form logic round which form fields to enable/disable when editing a collection
Consume the whole iterator in the test
apps/files/models.py
Outdated
@@ -171,7 +177,7 @@ class FileChunkEmbedding(BaseTeamModel): | |||
chunk_number = models.PositiveIntegerField() | |||
text = models.TextField() | |||
page_number = models.PositiveIntegerField(blank=True) | |||
embedding = VectorField(dimensions=1024) | |||
embedding = VectorField(dimensions=settings.EMBEDDING_VECTOR_SIZE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In light of this article, I'm thinking of switching this for HalfVectorField
(same for the index).
), | ||
# The Django ORM doesn't seem to support Half vector field with HNSW, so we need to use raw SQL for the index creation. | ||
migrations.RunSQL( | ||
sql=""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example didn't quite work. I got a ProgrammingError
which says that the syntax wasn't correct.
I used
indexes = [
HnswIndex(
OpClass(Cast('embedding', HalfVectorField(dimensions=3)), name='halfvec_cosine_ops'),
name='embedding_index',
m=16,
ef_construction=64
)
]
I could not find a solution for it, but creating it using SQL works..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly small comments but worth resolving before merge, especially the query in the loop one.
@@ -94,6 +94,11 @@ <h2 class="text-lg font-semibold">Files</h2> | |||
<div class="text-xs badge badge-ghost">{{ collection_file.file_size_mb }} MB</div> | |||
</div> | |||
{% if collection.is_index %} | |||
{% if not collection.is_remote_index %} | |||
<div class="flex flex-row items-center gap-2"> | |||
<span class="text-xs text-base-content/70">{{ collection_file.chunk_count }} chunks</span> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is doing a query inside a loop. It would be much better to annotate the objects with the count in the main query.
# Create versions of file chunk embeddings and add them to the new collection | ||
for embedding in self.filechunkembedding_set.iterator(chunk_size=50): | ||
embedding_version = embedding.create_new_version(save=False) | ||
embedding_version.collection = new_version | ||
|
||
file_version_id = file_versions[embedding.file_id] | ||
embedding_version.file_id = file_version_id | ||
embedding_version.save() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof, this is going to blow up the storage.
def get(self): | ||
return self.client.vector_stores.retrieve(self.index_id) | ||
|
||
def create_remote_index(self, name: str, file_ids: list = None) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method feels a bit odd to me. All other methods on the class are scoped to the index passed in at the constructor. This one overrides that index ID with a new one.
What about making it a class method like this:
@classmethod
def create_remote_index(cls, client, name: str, file_ids: list = None) -> Self:
file_ids = file_ids or []
vector_store = client.vector_stores.create(name=name, file_ids=file_ids)
return cls(client, vector_store.id)
Then you could make index_id
a required param in the __init__
method.
else: | ||
self._handle_local_indexing(*args, **kwargs) | ||
|
||
def _handle_remote_indexing( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine to leave these here but personally I think they belong in the index manager base classes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to put it there. I was a bit back-and-forth on where to put this logic.
Co-authored-by: Simon Kelly <[email protected]>
Description
Resolves #1648
Note
I'm yet to write some docs + fix the failing test
Users can now choose an embedding model and upload files to OCS after which we'll chunk + index those files. Since we don't yet have the infra in place to query the vectors, the "local" indexes are sort of useless. In the pipeline views where we query all team collections, I added a filter to return only those that are remote, thus usable. Maybe we should prevent users from choosing creating local indexes instead?
A big change that was needed was to separate the logic around indexing files at the remote and local indexes, so I introduce the concept of a local and remote index. The remote index is one where the service provider is responsible for hosting and managing chunking and embeddings. The local one refers to OCS, who is responsible for handing embeddings / chunking. OpenAI is the only remote index we'll probably support, but it should be easy enough to add more creating new subclasses. Dev docs can be found here.
Previously, the
index_collection_files
task in thedocuments
app had the logic to upload + link files to the index. In this PR, that logic moved into theCollection
model's class. TheCollection
model is the one that "knows" how to add files to remote and local indexes. Happy to discuss a more appropriate place for this if needed.Tip
I would recommend looking at this commit-wise. There are some morphing happening as I went along, but it should generally be small enough that it's not confusing.
User Impact
Demo
https://www.loom.com/share/6707fae51f6f40b391bb4e6a46ed8038?sid=4ee65561-2d49-4025-a89b-6ceb2004b0f1
Docs and Changelog
Pending