Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Vector Search using API #2937

Closed
TheNeeloy opened this issue Jan 4, 2025 · 4 comments · Fixed by #2986
Closed

[BUG]: Vector Search using API #2937

TheNeeloy opened this issue Jan 4, 2025 · 4 comments · Fixed by #2986
Assignees
Labels
possible bug Bug was reported but is not confirmed or is unable to be replicated.

Comments

@TheNeeloy
Copy link

How are you running AnythingLLM?

Docker (local)

What happened?

Hi, thanks for releasing such an awesome project; really is helping with swapping out models and providers quickly during LLM experiments. I have a question about the intended effect of the api/v1/workspace/{slug}/vector-search API endpoint.
Based off of these issues and PRs: #2811, #2812, #2815

TLDR:

When testing the new vector-search API endpoint, I found that I needed to add metadata to my query to retrieve the vector with distance 0. However, I thought that the vector search originally was based purely on the page content, excluding metadata. Below, I wrote down my environment setup, testing process, expectations, results, and questions. Thanks for your time!

Workspace and System Setup:

My AnythingLLM instance is hosted locally via Docker. It is using the default, out of the box, AnythingLLM embedding provider and LanceDB vector database settings. I setup a workspace using Ollama as the provider, running a llama3.2:1b LLM.

This is the response from /api/v1/workspace/{slug} (my workspace slug is testing_api):

{
  "workspace": [
    {
      "id": 7,
      "name": "testing_api",
      "slug": "testing_api",
      "vectorTag": null,
      "createdAt": "2025-01-04T01:25:26.088Z",
      "openAiTemp": 0.7,
      "openAiHistory": 20,
      "lastUpdatedAt": "2025-01-04T01:25:26.088Z",
      "openAiPrompt": "Given the following conversation, relevant context, and a follow up question, reply with an answer to the current question the user is asking. Return only your response to the question given the above information following the users instructions as needed.",
      "similarityThreshold": 0.25,
      "chatProvider": "ollama",
      "chatModel": "llama3.2:1b",
      "topN": 4,
      "chatMode": "chat",
      "pfpFilename": null,
      "agentProvider": null,
      "agentModel": null,
      "queryRefusalResponse": "There is no relevant information in this workspace to answer your query.",
      "documents": [
        {
          "id": 9,
          "docId": "efd8d182-048e-41d4-aa61-3dcb0c98fff2",
          "filename": "raw-pirate-cab0d2bf-4cf4-4020-a5ff-233e02c5067f.json",
          "docpath": "testing_temp_folder/raw-pirate-cab0d2bf-4cf4-4020-a5ff-233e02c5067f.json",
          "workspaceId": 7,
          "metadata": "{\"id\":\"cab0d2bf-4cf4-4020-a5ff-233e02c5067f\",\"url\":\"file://pirate.txt\",\"title\":\"pirate.txt\",\"docAuthor\":\"\",\"description\":\"what a pirate says\",\"docSource\":\"\",\"chunkSource\":\"\",\"published\":\"1/4/2025, 6:17:35 AM\",\"wordCount\":3,\"token_count_estimate\":4}",
          "pinned": false,
          "watched": false,
          "createdAt": "2025-01-04T06:17:36.080Z",
          "lastUpdatedAt": "2025-01-04T06:17:36.080Z"
        },
        {
          "id": 10,
          "docId": "cddb338d-d6ba-4636-a229-b5be995cef93",
          "filename": "raw-long_file-efe0f77c-db8f-4b79-9531-c797621f251e.json",
          "docpath": "testing_temp_folder/raw-long_file-efe0f77c-db8f-4b79-9531-c797621f251e.json",
          "workspaceId": 7,
          "metadata": "{\"id\":\"efe0f77c-db8f-4b79-9531-c797621f251e\",\"url\":\"file://long_file.txt\",\"title\":\"long_file.txt\",\"docAuthor\":\"\",\"description\":\"bunch of as\",\"docSource\":\"\",\"chunkSource\":\"\",\"published\":\"1/4/2025, 6:17:36 AM\",\"wordCount\":2001,\"token_count_estimate\":2001}",
          "pinned": false,
          "watched": false,
          "createdAt": "2025-01-04T06:17:36.876Z",
          "lastUpdatedAt": "2025-01-04T06:17:36.876Z"
        },
        {
          "id": 11,
          "docId": "ea926340-3c71-46df-8e1e-72e1629b5de0",
          "filename": "raw-pirate-a17387aa-8307-4e92-b07a-8dd92d26b68e.json",
          "docpath": "testing_temp_folder/raw-pirate-a17387aa-8307-4e92-b07a-8dd92d26b68e.json",
          "workspaceId": 7,
          "metadata": "{\"id\":\"a17387aa-8307-4e92-b07a-8dd92d26b68e\",\"url\":\"file://pirate.txt\",\"title\":\"pirate.txt\",\"docAuthor\":\"\",\"description\":\"what a pirate says\",\"docSource\":\"\",\"chunkSource\":\"\",\"published\":\"1/4/2025, 6:21:05 AM\",\"wordCount\":3,\"token_count_estimate\":4}",
          "pinned": false,
          "watched": false,
          "createdAt": "2025-01-04T06:21:06.143Z",
          "lastUpdatedAt": "2025-01-04T06:21:06.143Z"
        },
        {
          "id": 12,
          "docId": "89e8bb51-8ff6-433f-ac56-6d12d5f76158",
          "filename": "raw-long_file-f83a5abe-e69c-40c8-907f-2a2fece3e3b1.json",
          "docpath": "testing_temp_folder/raw-long_file-f83a5abe-e69c-40c8-907f-2a2fece3e3b1.json",
          "workspaceId": 7,
          "metadata": "{\"id\":\"f83a5abe-e69c-40c8-907f-2a2fece3e3b1\",\"url\":\"file://long_file.txt\",\"title\":\"long_file.txt\",\"docAuthor\":\"\",\"description\":\"bunch of as\",\"docSource\":\"\",\"chunkSource\":\"\",\"published\":\"1/4/2025, 6:21:06 AM\",\"wordCount\":2001,\"token_count_estimate\":2001}",
          "pinned": false,
          "watched": false,
          "createdAt": "2025-01-04T06:21:06.927Z",
          "lastUpdatedAt": "2025-01-04T06:21:06.927Z"
        },
        {
          "id": 13,
          "docId": "e11d5f8f-1029-4881-a8cf-62ca09adc97f",
          "filename": "raw-pirate-9c2c0cdf-246e-485e-a1fa-15bca9782b54.json",
          "docpath": "testing_temp_folder/raw-pirate-9c2c0cdf-246e-485e-a1fa-15bca9782b54.json",
          "workspaceId": 7,
          "metadata": "{\"id\":\"9c2c0cdf-246e-485e-a1fa-15bca9782b54\",\"url\":\"file://pirate.txt\",\"title\":\"pirate.txt\",\"docAuthor\":\"\",\"description\":\"what a pirate says\",\"docSource\":\"\",\"chunkSource\":\"\",\"published\":\"1/4/2025, 6:24:00 AM\",\"wordCount\":3,\"token_count_estimate\":4}",
          "pinned": false,
          "watched": false,
          "createdAt": "2025-01-04T06:24:00.807Z",
          "lastUpdatedAt": "2025-01-04T06:24:00.807Z"
        },
        {
          "id": 14,
          "docId": "d076c9bf-84da-4894-b8c4-d734c375ec8b",
          "filename": "raw-long_file-703df755-c8b4-4449-9918-afe877621d4f.json",
          "docpath": "testing_temp_folder/raw-long_file-703df755-c8b4-4449-9918-afe877621d4f.json",
          "workspaceId": 7,
          "metadata": "{\"id\":\"703df755-c8b4-4449-9918-afe877621d4f\",\"url\":\"file://long_file.txt\",\"title\":\"long_file.txt\",\"docAuthor\":\"\",\"description\":\"bunch of as\",\"docSource\":\"\",\"chunkSource\":\"\",\"published\":\"1/4/2025, 6:24:00 AM\",\"wordCount\":2001,\"token_count_estimate\":2001}",
          "pinned": false,
          "watched": false,
          "createdAt": "2025-01-04T06:24:01.638Z",
          "lastUpdatedAt": "2025-01-04T06:24:01.638Z"
        }
      ],
      "threads": [
        {
          "user_id": null,
          "slug": "19f23a3c-9ecb-4b34-9750-d974829d65f6"
        },
        {
          "user_id": null,
          "slug": "5be87d7a-7abc-436c-8860-77d2e55d6718"
        }
      ]
    }
  ]
}

This is the response from /api/v1/system:

{
  "settings": {
    "RequiresAuth": false,
    "AuthToken": false,
    "JWTSecret": false,
    "StorageDir": "/app/server/storage",
    "MultiUserMode": false,
    "DisableTelemetry": "true",
    "EmbeddingEngine": "native",
    "HasExistingEmbeddings": true,
    "HasCachedEmbeddings": true,
    "VoyageAiApiKey": false,
    "GenericOpenAiEmbeddingApiKey": false,
    "GenericOpenAiEmbeddingMaxConcurrentChunks": 500,
    "GeminiEmbeddingApiKey": false,
    "VectorDB": "lancedb",
    "PineConeKey": false,
    "ChromaApiKey": false,
    "MilvusPassword": false,
    "LLMProvider": "ollama",
    "OpenAiKey": false,
    "OpenAiModelPref": "gpt-4o",
    "AzureOpenAiKey": false,
    "AzureOpenAiTokenLimit": 4096,
    "AnthropicApiKey": false,
    "AnthropicModelPref": "claude-2",
    "GeminiLLMApiKey": true,
    "GeminiLLMModelPref": "gemini-pro",
    "GeminiSafetySetting": "BLOCK_MEDIUM_AND_ABOVE",
    "LocalAiApiKey": false,
    "OllamaLLMBasePath": "http://172.17.0.1:11434",
    "OllamaLLMModelPref": "llama3.2:1b",
    "OllamaLLMTokenLimit": "4096",
    "OllamaLLMKeepAliveSeconds": "300",
    "OllamaLLMPerformanceMode": "base",
    "NovitaLLMApiKey": false,
    "TogetherAiApiKey": false,
    "FireworksAiLLMApiKey": false,
    "PerplexityApiKey": true,
    "OpenRouterApiKey": false,
    "MistralApiKey": false,
    "GroqApiKey": false,
    "HuggingFaceLLMAccessToken": false,
    "TextGenWebUIAPIKey": false,
    "LiteLLMApiKey": false,
    "GenericOpenAiKey": false,
    "AwsBedrockLLMConnectionMethod": "iam",
    "AwsBedrockLLMAccessKeyId": false,
    "AwsBedrockLLMAccessKey": false,
    "AwsBedrockLLMSessionToken": false,
    "CohereApiKey": false,
    "DeepSeekApiKey": false,
    "ApipieLLMApiKey": false,
    "XAIApiKey": false,
    "WhisperProvider": "local",
    "WhisperModelPref": "Xenova/whisper-small",
    "TextToSpeechProvider": "native",
    "TTSOpenAIKey": false,
    "TTSElevenLabsKey": false,
    "TTSPiperTTSVoiceModel": "en_US-hfc_female-medium",
    "TTSOpenAICompatibleKey": false,
    "AgentGoogleSearchEngineId": null,
    "AgentGoogleSearchEngineKey": null,
    "AgentSearchApiKey": null,
    "AgentSearchApiEngine": "google",
    "AgentSerperApiKey": null,
    "AgentBingSearchApiKey": null,
    "AgentSerplyApiKey": null,
    "AgentSearXNGApiUrl": null,
    "AgentTavilyApiKey": null,
    "DisableViewChatHistory": false
  }
}

Goal:

I've written a simple Python API to interface with the AnythingLLM instance, and I want to test out its functionality before using the API in my experiments. I've implemented functions for creating a folder, uploading a raw text document, moving a file into a folder, adding a file to a workspace, and performing vector search within a workspace given a query. Every function works as expected, except for the vector_search function.

Below is my Python API and testing script (it assumes the AnythingLLM API key is set as an environment variable ANYTHINGLLM_API_KEY):

# Standard
import os
import json
from pprint import pprint

# 3rd Party
from requests import get, post


def create_folder(ipv4, port, api_key, folder_name, verbose=False):
    """
    Create empty folder in server's root storage directory.

    Returns <Failure State>.
    Failure State is True if failed.
    """

    url = f'http://{ipv4}:{port}/api/v1/document/create-folder'
    headers = {
        'accept': 'application/json',
        'Authorization': f'Bearer {api_key}',
        'Content-Type': 'application/json'
    }
    data = {
        'name': folder_name
    }

    try:
        response = post(url, headers=headers, json=data, stream=False)
    except Exception as exception:
        print(exception)
        return True
    
    response_dict = json.loads(response.text)
    if verbose:
        pprint(response_dict)
    return not response_dict['success']


def upload_raw_text(ipv4, port, api_key, content, title, description="", verbose=False):
    """
    Uploads document with raw text to database.
    Title is required, but description is not.
    Other fields in metadata (ie. url, published) will 
    automatically be filled in.

    Returns <Failure State, Saved File Path>.
    Failure State is True if failed and should not use Saved File Path.
    Saved File Path can be then used to add document to a workspace.
    """

    url = f'http://{ipv4}:{port}/api/v1/document/raw-text'
    headers = {
        'accept': 'application/json',
        'Authorization': f'Bearer {api_key}',
        'Content-Type': 'application/json'
    }
    data = {
        'textContent': content,
        'metadata': {
            'title': title,
            'description': description,
            'docAuthor': '',
            'docSource': '',
            'chunkSource': ''
        }
    }

    try:
        response = post(url, headers=headers, json=data, stream=False)
    except Exception as exception:
        print(exception)
        return True, None
    
    response_dict = json.loads(response.text)
    file_path = response_dict['documents'][0]['location']
    if verbose:
        pprint(response_dict)
    return not response_dict['success'], file_path


def move_file(ipv4, port, api_key, from_file_path, to_folder, verbose=False):
    """
    Move file from one folder to another.

    Returns <Failure State, New Saved File Path>.
    Failure State is True if failed and should not use New Saved File Path.
    New Saved File Path can be then used to add document to a workspace.
    """

    file_name = from_file_path.split('/')[-1]
    to_file_path = '/'.join([to_folder, file_name])

    url = f'http://{ipv4}:{port}/api/v1/document/move-files'
    headers = {
        'accept': 'application/json',
        'Authorization': f'Bearer {api_key}',
        'Content-Type': 'application/json'
    }
    data = {
        'files': [{
            'from': from_file_path,
            'to': to_file_path
        }]
    }

    try:
        response = post(url, headers=headers, json=data, stream=False)
    except Exception as exception:
        print(exception)
        return True, None
    
    response_dict = json.loads(response.text)
    if verbose:
        pprint(response_dict)
    return not response_dict['success'], to_file_path


def add_file_to_workspace(ipv4, port, slug, api_key, file_path, verbose=False):
    """
    Adds file from server to specific workspace by slug.
    Will embed file if not already cached.

    Returns <Failure State>.
    Failure State is True if failed.
    """

    url = f'http://{ipv4}:{port}/api/v1/workspace/{slug}/update-embeddings'
    headers = {
        'accept': 'application/json',
        'Authorization': f'Bearer {api_key}',
        'Content-Type': 'application/json'
    }
    data = {
        'adds': [file_path]
    }

    try:
        response = post(url, headers=headers, json=data, stream=False)
    except Exception as exception:
        print(exception)
        return True
    
    response_dict = json.loads(response.text)
    if verbose:
        pprint(response_dict)
    return not response.text == 'Internal Server Error'


def vector_search(ipv4, port, slug, api_key, query, top_n, score_threshold, verbose=False):
    """
    Searches for closest vectors to query.

    Returns <Failure State, Response>.
    Failure State is True if failed and should not access Response.
    """

    url = f'http://{ipv4}:{port}/api/v1/workspace/{slug}/vector-search'
    headers = {
        'accept': 'application/json',
        'Authorization': f'Bearer {api_key}',
        'Content-Type': 'application/json'
    }
    data = {
        'query': query,
        'topN': top_n,
        'scoreThreshold': score_threshold
    }

    try:
        response = post(url, headers=headers, json=data, stream=False)
    except Exception as exception:
        print(exception)
        return True, None
    
    response_dict = json.loads(response.text)
    if verbose:
        pprint(response_dict)
    return False, response_dict


if __name__ == '__main__':
    """
    Testing API functions.
    """

    # Create temp folder for testing
    fail = create_folder('localhost', '3001', os.environ.get('ANYTHINGLLM_API_KEY'), 'testing_temp_folder', verbose=True)

    # Add raw text to server
    fail, file_path = upload_raw_text('localhost', '3001', os.environ.get('ANYTHINGLLM_API_KEY'), 'Yo ho ho!', 'pirate', 'what a pirate says', verbose=True)
    
    # Move file into temp folder
    fail, file_path = move_file('localhost', '3001', os.environ.get('ANYTHINGLLM_API_KEY'), file_path, 'testing_temp_folder', verbose=True)

    # Embed file and add to workspace
    fail = add_file_to_workspace('localhost', '3001', 'testing_api', os.environ.get('ANYTHINGLLM_API_KEY'), file_path, verbose=True)

    # Add really long text with 2K tokens to server
    text = ""
    for _ in range(2000):
        text = text + "a "
    fail, file_path = upload_raw_text('localhost', '3001', os.environ.get('ANYTHINGLLM_API_KEY'), text, 'long_file', 'bunch of as', verbose=True)

    # Move long file into temp folder
    fail, file_path = move_file('localhost', '3001', os.environ.get('ANYTHINGLLM_API_KEY'), file_path, 'testing_temp_folder', verbose=True)

    # Embed long file and add to workspace
    fail = add_file_to_workspace('localhost', '3001', 'testing_api', os.environ.get('ANYTHINGLLM_API_KEY'), file_path, verbose=True)

    # Query workspace vectors
    fail, response = vector_search('localhost', '3001', 'testing_api', os.environ.get('ANYTHINGLLM_API_KEY'), 'Yo ho ho!', 2, 0.0, verbose=True)

    # Query workspace vectors
    test_query = '<document_metadata>\nsourceDocument: pirate.txt\npublished: 1/4/2025, 6:17:35 AM\n</document_metadata>\n\nYo ho ho!'
    fail, response = vector_search('localhost', '3001', 'testing_api', os.environ.get('ANYTHINGLLM_API_KEY'), test_query, 2, 0.0, verbose=True)

Expectation:

My understanding (and correct me if I am wrong) is that:

  • The api/v1/document/raw-text endpoint will add a json document to the system with fields like its id, title, description, pageContent, etc. I see that the pageContent is directly taken from the textContent field from the request.
  • Then, when using the api/v1/workspace/{slug}/update-embeddings endpoint to embed and add the document to a workspace, ONLY the pageContent field will be split into chunks, passed through the embedder, and stored into LanceDB.
  • My understanding is that the metadata is not also chunked and passed through the embedder.
  • Finally, when calling the api/v1/workspace/{slug}/vector-search endpoint, the query string will similarly be passed through the embedder. The endpoint returns chunks that are most similar to the given query.

Based on my assumptions, I expect that, if I query the workspace's vector database using the same exact textContent I used to add a document to the server, the query should return a vector with a distance of 0 and a similarity of 1.
I tested with textContent with less than 5 tokens, so the text is not chunked separately, and the entire text should be returned with a similarity of 1.
You can see the tests above at the bottom of the provided script.

Reality:

The response of the first call to my vector_search function returns vectors with nonzero distance, and low score.
Below is the resulting response (it has two search results because I ran the script multiple times, so the workspace has multiple files in it with the same name and contents.)

{'results': [{'distance': 0.8602899312973022,                                                                                                                                                
              'id': 'c83063a4-46a5-412c-b1c4-38cc1adce2a1',                                                                                                                                  
              'metadata': {'author': None,                                                                                                                                                   
                           'chunkSource': None,                                                                                                                                              
                           'description': 'what a pirate says',                                                                                                                              
                           'docSource': None,                                                                                                                                                
                           'published': '1/4/2025, 6:21:05 AM',                                                                                                                              
                           'title': 'pirate.txt',                                                                                                                                            
                           'tokenCount': 4,                                                                                                                                                  
                           'url': 'file://pirate.txt',                                                                                                                                       
                           'wordCount': 3},                                                                                                                                                  
              'score': 0.13971006870269775,                                                                                                                                                  
              'text': '<document_metadata>\n'                                                                                                                                                
                      'sourceDocument: pirate.txt\n'                                                                                                                                         
                      'published: 1/4/2025, 6:21:05 AM\n'                                                                                                                                    
                      '</document_metadata>\n'                                                                                                                                               
                      '\n'                                                                                                                                                                   
                      'Yo ho ho!'},                                                                                                                                                          
             {'distance': 0.8745168447494507,                                                                                                                                                
              'id': '2f3ae7dd-cbe3-47f8-b885-edada0c850c4',                                                                                                                                  
              'metadata': {'author': None,                                                                                                                                                   
                           'chunkSource': None,                                                                                                                                              
                           'description': 'what a pirate says',                                                                                                                              
                           'docSource': None,                                                                                                                                                
                           'published': '1/4/2025, 6:17:35 AM',                                                                                                                              
                           'title': 'pirate.txt',                                                                                                                                            
                           'tokenCount': 4,                                                                                                                                                  
                           'url': 'file://pirate.txt',                                                                                                                                       
                           'wordCount': 3},                                                                                                                                                  
              'score': 0.12548315525054932,                                                                                                                                                  
              'text': '<document_metadata>\n'                                                                                                                                                
                      'sourceDocument: pirate.txt\n'
                      'published: 1/4/2025, 6:17:35 AM\n'
                      '</document_metadata>\n'
                      '\n'
                      'Yo ho ho!'}]}

Debugging:

I was weirded out that the scores of both documents returned in the response were not the same, even though their content is the exact same. The only difference was their metadata and text fields.
So, I tried running the vector search again, but using the exact string from the text field of one of the documents in the response (ie. test_query = '<document_metadata>\nsourceDocument: pirate.txt ......).

And this time, the query returned a response with a vector of distance 0 and similarity 0:

{'results': [{'distance': 0,
              'id': '2f3ae7dd-cbe3-47f8-b885-edada0c850c4',
              'metadata': {'author': None,
                           'chunkSource': None,
                           'description': 'what a pirate says',
                           'docSource': None,
                           'published': '1/4/2025, 6:17:35 AM',
                           'title': 'pirate.txt',
                           'tokenCount': 4,
                           'url': 'file://pirate.txt',
                           'wordCount': 3},
              'score': 0,
              'text': '<document_metadata>\n'
                      'sourceDocument: pirate.txt\n'
                      'published: 1/4/2025, 6:17:35 AM\n'
                      '</document_metadata>\n'
                      '\n'
                      'Yo ho ho!'},
             {'distance': 0.008605420589447021,
              'id': 'b8525a99-6563-4869-9248-237b20f1ed84',
              'metadata': {'author': None,
                           'chunkSource': None,
                           'description': 'what a pirate says',
                           'docSource': None,
                           'published': '1/4/2025, 6:24:00 AM',
                           'title': 'pirate.txt',
                           'tokenCount': 4,
                           'url': 'file://pirate.txt',
                           'wordCount': 3},
              'score': 0.991394579410553,
              'text': '<document_metadata>\n'
                      'sourceDocument: pirate.txt\n'
                      'published: 1/4/2025, 6:24:00 AM\n'
                      '</document_metadata>\n'
                      '\n'
                      'Yo ho ho!'}]}

I'm not fluent in JavaScript, but I took a look at the following source files for debugging, and I thought the code should return a vector closest based on the text excluding the metadata:

Questions:

  1. What is the expected result of embedding a raw-text content and querying with the vector-search endpoint?
  2. If it is expected that we need to include the metadata to retrieve the closest vector with distance 0, could there be a different endpoint added where the vector search is based purely on the text content from the raw-text endpoint?
  3. Why is the similarity 0 when the distance is 0 in the example above?

Are there known steps to reproduce?

No response

@TheNeeloy TheNeeloy added the possible bug Bug was reported but is not confirmed or is unable to be replicated. label Jan 4, 2025
@shatfield4 shatfield4 linked a pull request Jan 18, 2025 that will close this issue
10 tasks
@shatfield4
Copy link
Collaborator

Thank you for the detailed write up on this issue! I have just opened the PR to fix the similarity score bug you pointed out where if there is an exact match the similarity should not be 0 but should actually be 1.

Regarding the info about the metadata being embedded in each text chunk, this is actually intentional. We do this because it helps when users request certain information about a document by it's name and allows them to refer to it by the name while also asking it more about the document. This improves the RAG results from our testing. It also allows them to ask things like "when was that information collected" and the chunk has enough info to respond to the user that way.

@TheNeeloy
Copy link
Author

Thanks for adding the fix Sean and explaining the expected results! Could you point me to where in the source code the metadata is added to the content string when embedding? I'd like to modify the behavior in my local installation so that only the content string is embedded when adding a document to a workspace.

@timothycarambat
Copy link
Member

@TheNeeloy You will see a line like this in each vector db provider (lancedb example below)

chunkHeaderMeta: TextSplitter.buildHeaderMeta(metadata),

Which comes from this class method

static buildHeaderMeta(metadata = {}) {

You can modify the method to return what you would like - or simple return "" from it to do nothing

@TheNeeloy
Copy link
Author

Thanks Tim, that's really helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
possible bug Bug was reported but is not confirmed or is unable to be replicated.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants