New plan: make a rag directory in core

jwm4 · web-flow · commit e96172774e11 · 2024-12-13T12:00:28.000-05:00
Signed-off-by: Bill Murdock &lt;bmurdock@redhat.com&gt;
diff --git a/docs/retrieval-augmented-generation/rag-repo.md b/docs/retrieval-augmented-generation/rag-repo.md
@@ -1,4 +1,4 @@
-# New repository for RAG
+# Code location for RAG
 
 | Created  | Dec 5, 2024 |
 | -------- | -------- |
@@ -17,60 +17,57 @@ Many InstructLab users want to train a model and then use it to RAG.  Often they
 - Building their own RAG is extra work.
 - Users who are not experts on RAG might not build a RAG that provides outstanding results.
 
-There is a very simple RAG capability at <https://github.com/instructlab/rag> .  It is not tightly integrated with InstructLab and it does not use any advanced RAG capabilities.  However, it might have some existing users so we can't just unilaterally delete it or replace it with something radically different.
+There is a very simple RAG capability at <https://github.com/instructlab/rag> .  It is not tightly integrated with InstructLab and it does not use any advanced RAG capabilities.  However, we have a directive from product management to not just unilaterally delete it or replace it with something radically different.
 
 ## Goals
 
-Provide a built-in alternative for users who do not want to build their own RAG.  Do not break the existing capability at <https://github.com/instructlab/rag> .
+Provide a built-in alternative for users who do not want to build their own RAG.  Keep the existing capability at <https://github.com/instructlab/rag> somewhere, but potentially somewhere other than it is now (e.g., in a new branch of the existing repository).
 
 ## Non-goals
 
 Evaluation of RAG will be addressed in one or more other development documents.  That topic is out of scope for this document.
 
 ## Decision
 
-- There will be a new repository for RAG.
-- It will be located at <https://github.com/instructlab/retrieve>
-- By mid-January, it will be available and working but not integrated with InstructLab.
-- By mid-March, it will be integrated with InstructLab with the new repository being invoked by the core repository and maybe also by the SDG repository.
-- Eventually, it will be integrated with InstructLab with the new repository being invoked only by the core repository.
+- For now, RAG will be located in the core repository in its own directory: `src/instructlab/rag` in the core InstructLab repository (<https://github.com/instructlab/instructlab>).
 
 ## How
 
-### By mid-January
+### Phase 1
 
-- There will be a new repository for RAG.
-- It will be located at <https://github.com/instructlab/retrieve>
-- It will *not* be referenced by any other InstructLab repositories at that time.
-- We will provide a sample Python notebook that *at least* shows how to use this repository to do both of the following:
-  - After someone has run `ilab data generate`, the code in the notebook pulls extracted content from documents that were stored during the execution of that command and indexes that content in a vector database.  (Note that this will also require updates to the SDG code base to ensure that the extracted content is stored; currently only the legacy format Docling outputs are stored.)
-  - The code in the notebook takes a session history plus a query and does both retrieval and response generation to produce answers (i.e., run-time RAG).  That code shows how to do this both with an unmodified open-source LLM for response generation *and* with an InstructLab fine-tuned LLM to compare before-and-after fine-tuning behavior.
+- RAG will be located in the core repository in its own directory: `src/instructlab/rag` in the core InstructLab repository (<https://github.com/instructlab/instructlab>).
+- This directory will include all of the following:
+  - Loading the content from docling-format JSON files (that are produced by SDG preprocessing).
+  - Chunking that content to sizes that fit the requirements of the selected embedding model for vector database storage and retrieval.
+  - Storing those chunks with their vector representations in a vector database.
+  - End-to-end runtime RAG.  The initial version of this includes the following:
+    - Taking as input a session history (including a current user query) and providing a response (e.g., something along the lines of the [OpenAI chat completion API](https://platform.openai.com/docs/api-reference/chat/create)).
+    - During that processing, it retrieves relevant search results from the vector database, it converts those into a prompt to send to the response generation model, it prompts that model, and it returns the response from that model.
+- This will be invoked from the existing `ilab` CLI, as described in the [RAG ingestion and chat pipelines](https://github.com/instructlab/dev-docs/pull/161) dev doc.
 
-### By mid-March
+### Future phases
 
-- The capabilities in `retrieve` will be improved with more advanced functionality.  (Details TBD)
-- The InstructLab command-line interface will be updated as follows:
-  - There will be one or more commands that can be configured to index content in a vector database.  Those commands will call out to `retrieve` to do this indexing.
-  - The existing chat capability will be configurable to either use RAG (which would involve calling out to `retrieve`) or not use RAG (which would provide the current behavior).
-- Both indexing and run-time RAG will be implemented by calls that go directly from the core repo (`instructlab/instructlab`) to  `retrieve` *unless* the work to migrate SDG preprocessing to the core repo is not complete by then (in which case there may need to be calls from SDG to `retrieve` until the migration completes).
-
-### Beyond mid-March
-
-- The capabilities in `retrieve` will continue to be improved with even more advanced functionality.  (Details TBD)
-- Both indexing and run-time RAG will be implemented by calls that go directly from the core repo to `retrieve`.  The SDG repository will not directly invoke the retrieve repository since the document processing will all be done in the core repository and the outputs of that document processing will be consumed by both `retrieve` and SDG.
+- In the near future, RAG might be moved to the existing <https://github.com/instructlab/rag> repository.
+  - If so, something will be done with the existing code in <https://github.com/instructlab/rag>, e.g., moving it to a branch of that repository or moving it to a different repository.
+- Alternatively, some or all of it might move to a new repository.
+  - For example, maybe the indexing and retrieval portions move to a separate retrieval repository while the rest of end-to-end runtime RAG might move somewhere else.
+- If/when we move ahead with any of these options, *we will open a new ADR for that decision*.
+- Also, the capabilities will keep improving and adding more functionality.
 
 ## Alternatives
 
+- Put the indexing and run-time RAG code in a new repository.
+  - Pro: Having a dedicated repository gives the RAG team the most freedom and flexibility to make technical decisions that work for that team.
+  - Pro: Starting with a new repository provides a blank slate that can be set up in whatever way makes the most sense for that functionality.
+  - Pro: Having the capability in one repository makes it easier for consumers such as RamaLama to reuse it for their purposes too.
+  - Con: Creating and configuring a new repository is some work.  (This is a fairly small con, but a real one.)
+  - Con: Integrating a new repository into the continuous integration and delivery capabilities for both upstream InstructLab and downstream consumers is a *lot* of work.  This is a much bigger con.
+  - Con: All that extra work would almost certainly result in slower time to market.  This risks missing some market opportunities.
 - Put the indexing code in <https://github.com/instructlab/sdg> (SDG) and the run-time RAG code in <https://github.com/instructlab/instructlab> (core)
   - Pro: This has the advantage of not adding any new dependencies.
   - Pro: The document processing is already in SDG and chat functionality is already in core so this would require the fewest code changes.
   - Con: Splitting the RAG functionality across multiple repositories makes it more complicated to reuse in other applications outside of InstructLab.
   - Con: Many things we will want to do to add advanced functionality to make RAG more effective will require changes to both indexing and run-time RAG.  If those components are split across multiple repositories, that will make delivering such changes more complicated.
-- Put the indexing and run-time RAG code in <https://github.com/instructlab/instructlab> (core)
-  - Pro: This has the advantage of not adding any new dependencies.
-  - Pro: However, since the existing document processing is in SDG, the flow for indexing for RAG would be a bit complicated (i.e., it starts with a CLI call handled by the core repo then goes to SDG for some of the document processing and then back to the core for vector database indexing).   That drawback will be eliminated if/when the document processing moves into the core repository.
-  - Con: Putting the RAG functionality in the core repository requires any application that wants to use this functionality to bring in the entire core which then brings in all of the libraries it depend on, so this becomes an enormous dependency.  This would discourage reuse in other applications.
-  - Con: Building a great RAG capability is a long-term grand challenge that will require a lot of dedicated investment.  That makes it a poor fit for a repository that also has a lot of existing responsibilities.
 - Start by putting the code into existing InstructLab repositories (either of the above options) and then split if off into its own repository later.
   - Pro: Gets us integrated into InstructLab sooner.
   - Con: Adds extra work to the second phase where we have to split it off into its own repository.
@@ -99,9 +96,11 @@ Evaluation of RAG will be addressed in one or more other development documents.
 
 ## Risks
 
-- Adding a new repo and dependencies adds more continuous integration complexity and work.
-- That extra work is why we're not planning calls from InstructLab core to the RAG capability until the mid-March time frame.  This risks missing some market opportunities.
+- Putting the RAG functionality in the core repository requires any application that wants to use this functionality to bring in the entire core which then brings in all of the libraries it depend on, so this becomes an enormous dependency.  This discourages reuse in other applications.  It *encourages* either of the following behaviors that would be unfortunate:
+  - Other applications pull directly from <https://github.com/redhat-et/PaRAGon> and in doing so duplicate the ongoing effort to harden/productize that code base.
+  - Other applications may implement their own RAG solutions or pull from some other upstream unrelated to ours.
 - As noted earlier, putting the capability inside <https://github.com/instructlab/> signals that this is a component of InstructLab and not a generally useful feature.  That creates a risk that the work could miss out on additional opportunities for impact.  We hope to mitigate that risk by spinning it off to its own open source project when it is mature enough, but there is a risk that we will get distracted by other things and never get around to this.
+- The flow for document processing for InstructLab winds up being quite complicated in this proposal.  Since the existing document processing is in SDG, the flow for indexing for RAG winds up being a bit complicated (i.e., it starts with a CLI call handled by the core repo then goes to SDG for some of the document processing and then back to the core `/data` directory which then calls out the the `core/rag` directory for chunking and vector database indexing).  Having the document processing move from core to SDG and back to core and forward to RAG makes that capability more difficult to understand and maintain.  This complexity will be partially mitigated when the preprocessing code moves from SDG to core.  It will be further mitigated by having a clear, well-documented contract between core and the RAG repository indicating the responsibilities of each.
 
 ## References