RAG - Local index #1715

SmittieC · 2025-06-05T14:26:34Z

Description

Resolves #1648

Note

I'm yet to write some docs + fix the failing test

Users can now choose an embedding model and upload files to OCS after which we'll chunk + index those files. Since we don't yet have the infra in place to query the vectors, the "local" indexes are sort of useless. In the pipeline views where we query all team collections, I added a filter to return only those that are remote, thus usable. Maybe we should prevent users from choosing creating local indexes instead?

A big change that was needed was to separate the logic around indexing files at the remote and local indexes, so I introduce the concept of a local and remote index. The remote index is one where the service provider is responsible for hosting and managing chunking and embeddings. The local one refers to OCS, who is responsible for handing embeddings / chunking. OpenAI is the only remote index we'll probably support, but it should be easy enough to add more creating new subclasses. Dev docs can be found here.

Previously, the index_collection_files task in the documents app had the logic to upload + link files to the index. In this PR, that logic moved into the Collection model's class. The Collection model is the one that "knows" how to add files to remote and local indexes. Happy to discuss a more appropriate place for this if needed.

Tip

I would recommend looking at this commit-wise. There are some morphing happening as I went along, but it should generally be small enough that it's not confusing.

User Impact

Users can choose to use OCS as the index host, in which case OCS will manage chunking / embedding and do RAG itself.
Since the search part of the RAG implementation is missing, adding an OCS based collection is not usable yet.

Demo

https://www.loom.com/share/6707fae51f6f40b391bb4e6a46ed8038?sid=4ee65561-2d49-4025-a89b-6ceb2004b0f1

Docs and Changelog

Pending

Allow users to select an embedding provider model depending on which provider they selected. Additionally, if an OpenAI provider is selected, users need to be able to choose if they want to use OpenAI or OCS to manage the index

Move the logic concerning linking files to a vector store into the Collection model. The index manager should only be the interface to the index

Test that add_files_to_index works as expected

codecov-commenter · 2025-06-05T14:51:29Z

Codecov Report

Attention: Patch coverage is 90.92437% with 54 lines in your changes missing coverage. Please review.

✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...ps/service_providers/llm_service/index_managers.py	81.72%	17 Missing ⚠️
apps/documents/views.py	46.15%	14 Missing ⚠️
apps/documents/forms.py	78.26%	5 Missing ⚠️
apps/documents/tasks.py	71.42%	4 Missing ⚠️
apps/service_providers/utils.py	20.00%	4 Missing ⚠️
apps/files/models.py	75.00%	3 Missing ⚠️
apps/service_providers/models.py	76.92%	3 Missing ⚠️
apps/service_providers/llm_service/main.py	83.33%	2 Missing ⚠️
apps/assistants/sync.py	90.90%	1 Missing ⚠️
apps/documents/models.py	98.61%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

snopoke

Just a preliminary review - I'll need to come back to this.

apps/documents/models.py

apps/documents/forms.py

snopoke · 2025-06-06T09:38:21Z

apps/service_providers/llm_service/index_managers.py

+        return embeddings.embed_query(content)
+
+    def chunk_content(self, text: str, chunk_size: int, chunk_overlap: int) -> list[str]:
+        text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(


is it necessary to do different splitting based on the llm provider? I would have thought the chunking was independent.

Also thinking we should look at https://github.com/microsoft/markitdown and then we can always use the markdown splitter

is it necessary to do different splitting based on the llm provider?

You're right, it isn't. I changed it here a bit.

gpt_playground/settings.py

Refactor form logic round which form fields to enable/disable when editing a collection

Consume the whole iterator in the test

SmittieC · 2025-06-10T11:21:24Z

apps/files/models.py

@@ -171,7 +177,7 @@ class FileChunkEmbedding(BaseTeamModel):
    chunk_number = models.PositiveIntegerField()
    text = models.TextField()
    page_number = models.PositiveIntegerField(blank=True)
-    embedding = VectorField(dimensions=1024)
+    embedding = VectorField(dimensions=settings.EMBEDDING_VECTOR_SIZE)


In light of this article, I'm thinking of switching this for HalfVectorField (same for the index).

SmittieC · 2025-06-11T14:05:04Z

apps/files/migrations/0008_filechunkembedding_is_archived_and_more.py

+        ),
+        # The Django ORM doesn't seem to support Half vector field with HNSW, so we need to use raw SQL for the index creation.
+        migrations.RunSQL(
+            sql="""


This example didn't quite work. I got a ProgrammingError which says that the syntax wasn't correct.

I used

indexes = [ HnswIndex( OpClass(Cast('embedding', HalfVectorField(dimensions=3)), name='halfvec_cosine_ops'), name='embedding_index', m=16, ef_construction=64 ) ]

I could not find a solution for it, but creating it using SQL works..

snopoke

Mostly small comments but worth resolving before merge, especially the query in the loop one.

snopoke · 2025-06-11T12:57:14Z

templates/documents/single_collection_home.html

@@ -94,6 +94,11 @@ <h2 class="text-lg font-semibold">Files</h2>
                <div class="text-xs badge badge-ghost">{{ collection_file.file_size_mb }} MB</div>
              </div>
              {% if collection.is_index %}
+                {% if not collection.is_remote_index %}
+                  <div class="flex flex-row items-center gap-2">
+                    <span class="text-xs text-base-content/70">{{ collection_file.chunk_count }} chunks</span>


This is doing a query inside a loop. It would be much better to annotate the objects with the count in the main query.

snopoke · 2025-06-11T13:24:49Z

apps/documents/models.py

+                # Create versions of file chunk embeddings and add them to the new collection
+                for embedding in self.filechunkembedding_set.iterator(chunk_size=50):
+                    embedding_version = embedding.create_new_version(save=False)
+                    embedding_version.collection = new_version
+
+                    file_version_id = file_versions[embedding.file_id]
+                    embedding_version.file_id = file_version_id
+                    embedding_version.save()


Oof, this is going to blow up the storage.

snopoke · 2025-06-12T11:34:22Z

apps/service_providers/llm_service/index_managers.py

+    def get(self):
+        return self.client.vector_stores.retrieve(self.index_id)
+
+    def create_remote_index(self, name: str, file_ids: list = None) -> str:


This method feels a bit odd to me. All other methods on the class are scoped to the index passed in at the constructor. This one overrides that index ID with a new one.

What about making it a class method like this:

@classmethod def create_remote_index(cls, client, name: str, file_ids: list = None) -> Self: file_ids = file_ids or [] vector_store = client.vector_stores.create(name=name, file_ids=file_ids) return cls(client, vector_store.id)

Then you could make index_id a required param in the __init__ method.

snopoke · 2025-06-12T11:39:35Z

apps/documents/models.py

+        else:
+            self._handle_local_indexing(*args, **kwargs)
+
+    def _handle_remote_indexing(


Fine to leave these here but personally I think they belong in the index manager base classes

Happy to put it there. I was a bit back-and-forth on where to put this logic.

apps/documents/tasks.py

Co-authored-by: Simon Kelly <[email protected]>

SmittieC added 30 commits June 5, 2025 16:25

Add nullable embedding provider model FK on collection

802c990

Update UI

6e55c29

Allow users to select an embedding provider model depending on which provider they selected. Additionally, if an OpenAI provider is selected, users need to be able to choose if they want to use OpenAI or OCS to manage the index

Show OpenAI vector store ID only when it is a remote index

8dd54a1

Rename _remove_index -> _remove_remote_index

889ce91

Prefactor: Remote index manager

fdd03c4

Mark existing indexes as remote indexes

a8a1ade

Prefactor: Update code according to the prefactor

999e411

Rename: OpenAiUnableToLinkFileError -> UnableToLinkFileException

85b6d06

add the concept of a local index manager

41b86a8

Move logic into Collection model

41f9195

Move the logic concerning linking files to a vector store into the Collection model. The index manager should only be the interface to the index

Add task tests

888b47e

Add tests for Collection models

25596e4

Test that add_files_to_index works as expected

Add tests for index manager

7dd81f5

Better approach to re-uploading when migrating providers

d584219

Don't show embedding model selection when index is remote

34f292c

WIP

7fb11b7

Update sync.py tests

ac03c0c

Fix test names

79f465c

Fix test's patch path

01ead3d

Make client param for remote index manager openai specific

70936b8

Merge schema and data migration

aa67f29

Use an iterator when reading collection files to add to the index

12cf9f7

Rename method

3565079

Rename create_vector_store to create_remote_index and add docstrings

9bc7eaa

Rename link_files_to_vector_store -> link_files_to_remote_index

db710ce

Fix: Specify ABC metaclass for index manager base classes

db740a8

Show the number of chunks per file

d28905c

Add some docs on index managers

4f54b54

Temp change: Allow only remote indexes to be used

3574f23

Tiny update

c82468d

SmittieC added 2 commits June 5, 2025 16:35

Tiny update: Update checks

45fc53f

Fix: return type

aa3ce04

SmittieC requested review from snopoke and stephherbers June 5, 2025 14:55

SmittieC marked this pull request as ready for review June 5, 2025 14:55

snopoke reviewed Jun 6, 2025

View reviewed changes

SmittieC added 10 commits June 6, 2025 15:36

Update with main

0655a4f

Post merge fixes

0006c8a

Update audited fields

0bd7320

Move chunk_file into base class and update expected params

6f59573

Refactor form logic

46b1834

Refactor form logic round which form fields to enable/disable when editing a collection

update test

07ff378

Fix test

138df82

Merge with branch 'main'

e7cc770

Add test case

4583557

Fix missing cursor error

0a3dfa6

Consume the whole iterator in the test

SmittieC commented Jun 10, 2025

View reviewed changes

SmittieC added 6 commits June 11, 2025 12:07

Version file chunk embedddings as well

d325f4f

Switch to using HalfVectorField instead of VectorField

d2d9469

Fix file deletion

653d12f

Add delete_file method for local index and update view

b57bcff

Fix silly mistake

38ce98e

Merge migrations

8a7827c

SmittieC commented Jun 11, 2025

View reviewed changes

snopoke reviewed Jun 12, 2025

View reviewed changes

SmittieC and others added 2 commits June 12, 2025 14:53

Merge branch 'main' into cs/local_index

17e3447

Update apps/documents/tasks.py

a6ebd3b

Co-authored-by: Simon Kelly <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RAG - Local index #1715

RAG - Local index #1715

Uh oh!

SmittieC commented Jun 5, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 5, 2025 •

edited

Loading

Uh oh!

snopoke left a comment

Uh oh!

Uh oh!

Uh oh!

snopoke Jun 6, 2025

Uh oh!

SmittieC Jun 6, 2025

Uh oh!

Uh oh!

SmittieC Jun 10, 2025

Uh oh!

SmittieC Jun 11, 2025 •

edited

Loading

Uh oh!

snopoke left a comment

Uh oh!

snopoke Jun 11, 2025

Uh oh!

snopoke Jun 11, 2025

Uh oh!

snopoke Jun 12, 2025

Uh oh!

snopoke Jun 12, 2025

Uh oh!

SmittieC Jun 12, 2025

Uh oh!

Uh oh!

Uh oh!

RAG - Local index #1715

Are you sure you want to change the base?

RAG - Local index #1715

Uh oh!

Conversation

SmittieC commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

User Impact

Demo

Docs and Changelog

Uh oh!

codecov-commenter commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

snopoke left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SmittieC Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snopoke left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SmittieC commented Jun 5, 2025 •

edited

Loading

codecov-commenter commented Jun 5, 2025 •

edited

Loading

SmittieC Jun 11, 2025 •

edited

Loading