Skip to content

RAG - Local index #1715

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 50 commits into
base: main
Choose a base branch
from
Open

RAG - Local index #1715

wants to merge 50 commits into from

Conversation

SmittieC
Copy link
Contributor

@SmittieC SmittieC commented Jun 5, 2025

Description

Resolves #1648

Note

I'm yet to write some docs + fix the failing test

Users can now choose an embedding model and upload files to OCS after which we'll chunk + index those files. Since we don't yet have the infra in place to query the vectors, the "local" indexes are sort of useless. In the pipeline views where we query all team collections, I added a filter to return only those that are remote, thus usable. Maybe we should prevent users from choosing creating local indexes instead?

A big change that was needed was to separate the logic around indexing files at the remote and local indexes, so I introduce the concept of a local and remote index. The remote index is one where the service provider is responsible for hosting and managing chunking and embeddings. The local one refers to OCS, who is responsible for handing embeddings / chunking. OpenAI is the only remote index we'll probably support, but it should be easy enough to add more creating new subclasses. Dev docs can be found here.

Previously, the index_collection_files task in the documents app had the logic to upload + link files to the index. In this PR, that logic moved into the Collection model's class. The Collection model is the one that "knows" how to add files to remote and local indexes. Happy to discuss a more appropriate place for this if needed.

Tip

I would recommend looking at this commit-wise. There are some morphing happening as I went along, but it should generally be small enough that it's not confusing.

User Impact

  • Users can choose to use OCS as the index host, in which case OCS will manage chunking / embedding and do RAG itself.
  • Since the search part of the RAG implementation is missing, adding an OCS based collection is not usable yet.

Demo

https://www.loom.com/share/6707fae51f6f40b391bb4e6a46ed8038?sid=4ee65561-2d49-4025-a89b-6ceb2004b0f1

Docs and Changelog

Pending

SmittieC added 30 commits June 5, 2025 16:25
Allow users to select an embedding provider model depending on which provider they selected. Additionally, if an OpenAI provider is selected, users need to be able to choose if they want to use OpenAI or OCS to manage the index
Move the logic concerning linking files to a vector store into the Collection model. The index manager should only be the interface to the index
Test that add_files_to_index works as expected
@codecov-commenter
Copy link

codecov-commenter commented Jun 5, 2025

Codecov Report

Attention: Patch coverage is 90.92437% with 54 lines in your changes missing coverage. Please review.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...ps/service_providers/llm_service/index_managers.py 81.72% 17 Missing ⚠️
apps/documents/views.py 46.15% 14 Missing ⚠️
apps/documents/forms.py 78.26% 5 Missing ⚠️
apps/documents/tasks.py 71.42% 4 Missing ⚠️
apps/service_providers/utils.py 20.00% 4 Missing ⚠️
apps/files/models.py 75.00% 3 Missing ⚠️
apps/service_providers/models.py 76.92% 3 Missing ⚠️
apps/service_providers/llm_service/main.py 83.33% 2 Missing ⚠️
apps/assistants/sync.py 90.90% 1 Missing ⚠️
apps/documents/models.py 98.61% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@SmittieC SmittieC requested review from snopoke and stephherbers June 5, 2025 14:55
@SmittieC SmittieC marked this pull request as ready for review June 5, 2025 14:55
Copy link
Contributor

@snopoke snopoke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a preliminary review - I'll need to come back to this.

return embeddings.embed_query(content)

def chunk_content(self, text: str, chunk_size: int, chunk_overlap: int) -> list[str]:
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to do different splitting based on the llm provider? I would have thought the chunking was independent.

Also thinking we should look at https://github.com/microsoft/markitdown and then we can always use the markdown splitter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to do different splitting based on the llm provider?

You're right, it isn't. I changed it here a bit.

@@ -171,7 +177,7 @@ class FileChunkEmbedding(BaseTeamModel):
chunk_number = models.PositiveIntegerField()
text = models.TextField()
page_number = models.PositiveIntegerField(blank=True)
embedding = VectorField(dimensions=1024)
embedding = VectorField(dimensions=settings.EMBEDDING_VECTOR_SIZE)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In light of this article, I'm thinking of switching this for HalfVectorField (same for the index).

),
# The Django ORM doesn't seem to support Half vector field with HNSW, so we need to use raw SQL for the index creation.
migrations.RunSQL(
sql="""
Copy link
Contributor Author

@SmittieC SmittieC Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example didn't quite work. I got a ProgrammingError which says that the syntax wasn't correct.

I used

indexes = [
    HnswIndex(
        OpClass(Cast('embedding', HalfVectorField(dimensions=3)), name='halfvec_cosine_ops'),
        name='embedding_index',
        m=16,
        ef_construction=64
    )
]

I could not find a solution for it, but creating it using SQL works..

Copy link
Contributor

@snopoke snopoke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly small comments but worth resolving before merge, especially the query in the loop one.

@@ -94,6 +94,11 @@ <h2 class="text-lg font-semibold">Files</h2>
<div class="text-xs badge badge-ghost">{{ collection_file.file_size_mb }} MB</div>
</div>
{% if collection.is_index %}
{% if not collection.is_remote_index %}
<div class="flex flex-row items-center gap-2">
<span class="text-xs text-base-content/70">{{ collection_file.chunk_count }} chunks</span>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is doing a query inside a loop. It would be much better to annotate the objects with the count in the main query.

Comment on lines +197 to +204
# Create versions of file chunk embeddings and add them to the new collection
for embedding in self.filechunkembedding_set.iterator(chunk_size=50):
embedding_version = embedding.create_new_version(save=False)
embedding_version.collection = new_version

file_version_id = file_versions[embedding.file_id]
embedding_version.file_id = file_version_id
embedding_version.save()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, this is going to blow up the storage.

def get(self):
return self.client.vector_stores.retrieve(self.index_id)

def create_remote_index(self, name: str, file_ids: list = None) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method feels a bit odd to me. All other methods on the class are scoped to the index passed in at the constructor. This one overrides that index ID with a new one.

What about making it a class method like this:

@classmethod
def create_remote_index(cls, client, name: str, file_ids: list = None) -> Self:
    file_ids = file_ids or []
    vector_store = client.vector_stores.create(name=name, file_ids=file_ids)
    return cls(client, vector_store.id)

Then you could make index_id a required param in the __init__ method.

else:
self._handle_local_indexing(*args, **kwargs)

def _handle_remote_indexing(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine to leave these here but personally I think they belong in the index manager base classes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to put it there. I was a bit back-and-forth on where to put this logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RAG - Save file embeddings
3 participants