-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new Databricks Vector Search langchain native tool VectorSearchRetrieverTool #24
Conversation
text_column: Optional[str] = Field(None, description="If using a direct-access index or delta-sync index, specify the text column.") | ||
embedding: Optional[Embeddings] = Field(None, description="Embedding model for self-managed embeddings.") | ||
# TODO: Confirm if we can add this endpoint field | ||
endpoint: Optional[str] = Field(None, description="Endpoint for DatabricksVectorSearch.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field was added because of this restriction in databricks-langchain. I felt that if we threw this error without giving the ability for the user to rectify it, it would be a poor user experience. Alternatively maybe we pin databricks-vectorsearch to be >=0.35.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's valid to require databricks-vectorsearch >= 0.35 especially because this is new - that might be the better considering we don't need endpoint
for any other reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah reasonable to require new versions of other clients!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out we already mark the "databricks-vectorsearch>=0.40" as a dependency here, so I'll just remove this argument.
text_column: Optional[str] = Field(None, description="If using a direct-access index or delta-sync index, specify the text column.") | ||
embedding: Optional[Embeddings] = Field(None, description="Embedding model for self-managed embeddings.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two fields are required for direct-access indexes or delta-sync indexes with self-managed embeddings. Should we support these additional fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like if we support it for DatabricksVectorSearch it makes sense to support it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, seems reasonable to support these, though I'd say it's worth asking vector search folks how commonly direct access indexes are used, if it's infrequent we could drop this to start with to simplify the API/testing surface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to block this PR on that though, I figure we'll need this eventually anyways, would just be good for us to know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_tool_description(): | ||
default_tool_description = "A vector search-based retrieval tool for querying indexed embeddings." | ||
index_details = IndexDetails(dbvs.index) | ||
if index_details.is_delta_sync_index(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
direct access indexes don't have an associated source table so we'll just use the default tool description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious what the existing langchain-databricks
DatabricksVectorSearch.as_retriever(...).as_tool(...) ends up generating as the tool description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to tell would be to use it as a tool with payload logging enabled & see what the tools
argument to the LLM API in model serving looks like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally looks reasonable, just curious if we can keep it in sync with the existing behavior/default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol yep makes sense, the updated version in this PR is definitely better
index_name: str = Field(..., description="The name of the index to use, format: 'catalog.schema.index'.") | ||
num_results: int = Field(10, description="The number of results to return.") | ||
columns: Optional[List[str]] = Field(None, description="Columns to return when doing the search.") | ||
filters: Optional[Dict[str, Any]] = Field(None, description="Filters to apply to the search.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ, does this get sent to the LLM as the parameter description? If so I wonder if it's worth including examples like the ones in https://docs.databricks.com/api/workspace/vectorsearchindexes/queryindex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nvm, this is in the init, not in the tool call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But seems like there is a way we can specify the description of the params for the LLM too: https://chatgpt.com/share/6764d76f-69a0-8009-8a8f-f58977753057
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also https://python.langchain.com/docs/how_to/custom_tools/#subclass-basetool (we can use args_schema
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to include VectorSearchRetrieverToolInput as an args_schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good! Just had some small comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one comment - we should update the tests to reflect the most recent changes. 🙇♀️
tool_description: Optional[str], | ||
embedding: Optional[Any], | ||
text_column: Optional[str], | ||
endpoint: Optional[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't need the endpoint
argument anymore right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch updated!
from typing import Generator | ||
from unittest import mock | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we update these tests to also assert that the tool description + args description are properly set? Lmk if it's already done and I missed it (just looking at changes since my last review rn)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a new test test_vector_search_retriever_tool_description_generation
to explicitly test these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will stamp after my testing comment is addressed ,thanks Leon!
What does this PR do?
This PR introduces the class
VectorSearchRetrieverTool
into thedatabricks-langchain
package which allows the user to instantiate a langchain native tool that calls Databricks Vector Search when invoked. Under the hood, the core logic is fromDatabricksVectorSearch
in thelangchain-databricks
package.How was it tested?
test_vector_search_retriever_tool.py
which verifies the new tool object can be instantiated correctly and can be invoked by an arbitrary langchain llm