-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: Vector Search using API #2937
Comments
Thank you for the detailed write up on this issue! I have just opened the PR to fix the similarity score bug you pointed out where if there is an exact match the similarity should not be 0 but should actually be 1. Regarding the info about the metadata being embedded in each text chunk, this is actually intentional. We do this because it helps when users request certain information about a document by it's name and allows them to refer to it by the name while also asking it more about the document. This improves the RAG results from our testing. It also allows them to ask things like "when was that information collected" and the chunk has enough info to respond to the user that way. |
Thanks for adding the fix Sean and explaining the expected results! Could you point me to where in the source code the metadata is added to the content string when embedding? I'd like to modify the behavior in my local installation so that only the content string is embedded when adding a document to a workspace. |
@TheNeeloy You will see a line like this in each vector db provider (lancedb example below)
Which comes from this class method
You can modify the method to return what you would like - or simple return |
Thanks Tim, that's really helpful! |
How are you running AnythingLLM?
Docker (local)
What happened?
Hi, thanks for releasing such an awesome project; really is helping with swapping out models and providers quickly during LLM experiments. I have a question about the intended effect of the
api/v1/workspace/{slug}/vector-search
API endpoint.Based off of these issues and PRs: #2811, #2812, #2815
TLDR:
When testing the new vector-search API endpoint, I found that I needed to add metadata to my query to retrieve the vector with distance 0. However, I thought that the vector search originally was based purely on the page content, excluding metadata. Below, I wrote down my environment setup, testing process, expectations, results, and questions. Thanks for your time!
Workspace and System Setup:
My AnythingLLM instance is hosted locally via Docker. It is using the default, out of the box, AnythingLLM embedding provider and LanceDB vector database settings. I setup a workspace using Ollama as the provider, running a llama3.2:1b LLM.
This is the response from
/api/v1/workspace/{slug}
(my workspace slug istesting_api
):This is the response from
/api/v1/system
:Goal:
I've written a simple Python API to interface with the AnythingLLM instance, and I want to test out its functionality before using the API in my experiments. I've implemented functions for creating a folder, uploading a raw text document, moving a file into a folder, adding a file to a workspace, and performing vector search within a workspace given a query. Every function works as expected, except for the
vector_search
function.Below is my Python API and testing script (it assumes the AnythingLLM API key is set as an environment variable
ANYTHINGLLM_API_KEY
):Expectation:
My understanding (and correct me if I am wrong) is that:
api/v1/document/raw-text
endpoint will add a json document to the system with fields like itsid
,title
,description
,pageContent
, etc. I see that thepageContent
is directly taken from thetextContent
field from the request.api/v1/workspace/{slug}/update-embeddings
endpoint to embed and add the document to a workspace, ONLY thepageContent
field will be split into chunks, passed through the embedder, and stored into LanceDB.api/v1/workspace/{slug}/vector-search
endpoint, the query string will similarly be passed through the embedder. The endpoint returns chunks that are most similar to the given query.Based on my assumptions, I expect that, if I query the workspace's vector database using the same exact
textContent
I used to add a document to the server, the query should return a vector with a distance of0
and a similarity of1
.I tested with
textContent
with less than5
tokens, so the text is not chunked separately, and the entire text should be returned with a similarity of1
.You can see the tests above at the bottom of the provided script.
Reality:
The response of the first call to my
vector_search
function returns vectors with nonzero distance, and low score.Below is the resulting response (it has two search results because I ran the script multiple times, so the workspace has multiple files in it with the same name and contents.)
Debugging:
I was weirded out that the scores of both documents returned in the response were not the same, even though their content is the exact same. The only difference was their
metadata
andtext
fields.So, I tried running the vector search again, but using the exact string from the
text
field of one of the documents in the response (ie.test_query = '<document_metadata>\nsourceDocument: pirate.txt ......
).And this time, the query returned a response with a vector of distance
0
and similarity0
:I'm not fluent in JavaScript, but I took a look at the following source files for debugging, and I thought the code should return a vector closest based on the text excluding the metadata:
anything-llm/collector/processRawText/index.js
Line 37 in c6547ec
anything-llm/server/endpoints/api/workspace/index.js
Line 511 in c6547ec
anything-llm/server/models/documents.js
Line 82 in c6547ec
anything-llm/server/utils/vectorDbProviders/lance/index.js
Line 279 in c6547ec
anything-llm/server/endpoints/api/workspace/index.js
Line 958 in c6547ec
anything-llm/server/utils/vectorDbProviders/lance/index.js
Line 157 in c6547ec
anything-llm/server/utils/vectorDbProviders/lance/index.js
Line 29 in c6547ec
Questions:
0
, could there be a different endpoint added where the vector search is based purely on the text content from the raw-text endpoint?0
when the distance is0
in the example above?Are there known steps to reproduce?
No response
The text was updated successfully, but these errors were encountered: