ValueError: doc_id: 624_6 not found in corpus_data.

import os

import click
from autorag.evaluator import Evaluator
from dotenv import load_dotenv

root_path = os.path.dirname(os.path.realpath(__file__))
data_path = os.path.join(root_path, 'data')


@click.command()
@click.option('--config', type=click.Path(exists=True), default=os.path.join(root_path, 'config/tutorial.yaml'))
@click.option('--qa_data_path', type=click.Path(exists=True), default=os.path.join(data_path, 'qa_test.parquet'))
@click.option('--corpus_data_path', type=click.Path(exists=True), default=os.path.join(data_path, 'corpus.parquet'))
@click.option('--project_dir', type=click.Path(exists=False), default=os.path.join(root_path, 'benchmark'))
def main(config, qa_data_path, corpus_data_path, project_dir):
    load_dotenv()
    if os.getenv('OPENAI_API_KEY') is None:
        raise ValueError('OPENAI_API_KEY environment variable is not set')
    if not os.path.exists(project_dir):
        os.makedirs(project_dir)
    evaluator = Evaluator(qa_data_path, corpus_data_path, project_dir=project_dir)
    evaluator.start_trial(config, skip_validation=True)


if __name__ == '__main__':
    main()


run this code then 

![Image](https://github.com/user-attachments/assets/82c80e1a-d5f1-4b19-bf16-dd0903f58cbc)

tutorial.yaml
node_lines:
- node_line_name: retrieve_node_line
  nodes:
  - node_type: retrieval
    top_k: 3
    modules:
    - module_type: bm25
      bm25_tokenizer: [porter_stemmer,space]
#    - module_type: vectordb
#      vectordb: chroma_large
    strategy:
      metrics:
      - retrieval_f1
      - retrieval_recall
      - retrieval_precision
- node_line_name: post_retrieve_node_line
  nodes:
  - node_type: prompt_maker
    modules:
    - module_type: fstring
      prompt: "Read the passages and answer the given question. \n Question: {query} \n Passage: {retrieved_contents} \n Answer : "
    strategy:
      generator_modules:
      - batch: 2
        llm: openai
        module_type: llama_index_llm
      metrics:
      - bleu
      - meteor
      - rouge
  - node_type: generator
    modules:
    - batch: 2
      llm: openai
      model: gpt-3.5-turbo-16k
      module_type: llama_index_llm
    strategy:
      metrics:
      - metric_name: bleu
      - metric_name: meteor
      - embedding_model: openai
        metric_name: sem_score

corpus.parquet and qa_test.parquet  from  [[MarkrAI](https://hf-mirror.com/MarkrAI)](https://huggingface.co/datasets/MarkrAI/msmarco_sample_autorag)
/
[msmarco_sample_autorag](https://hf-mirror.com/datasets/MarkrAI/msmarco_sample_autorag)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ValueError: doc_id: 624_6 not found in corpus_data. #5

- module_type: vectordb

vectordb: chroma_large

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ValueError: doc_id: 624_6 not found in corpus_data. #5

Description

- module_type: vectordb

vectordb: chroma_large

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions