Skip to content

ValueError: doc_id: 624_6 not found in corpus_data. #5

Open
@zhoujiamei-git

Description

@zhoujiamei-git

import os

import click
from autorag.evaluator import Evaluator
from dotenv import load_dotenv

root_path = os.path.dirname(os.path.realpath(file))
data_path = os.path.join(root_path, 'data')

@click.command()
@click.option('--config', type=click.Path(exists=True), default=os.path.join(root_path, 'config/tutorial.yaml'))
@click.option('--qa_data_path', type=click.Path(exists=True), default=os.path.join(data_path, 'qa_test.parquet'))
@click.option('--corpus_data_path', type=click.Path(exists=True), default=os.path.join(data_path, 'corpus.parquet'))
@click.option('--project_dir', type=click.Path(exists=False), default=os.path.join(root_path, 'benchmark'))
def main(config, qa_data_path, corpus_data_path, project_dir):
load_dotenv()
if os.getenv('OPENAI_API_KEY') is None:
raise ValueError('OPENAI_API_KEY environment variable is not set')
if not os.path.exists(project_dir):
os.makedirs(project_dir)
evaluator = Evaluator(qa_data_path, corpus_data_path, project_dir=project_dir)
evaluator.start_trial(config, skip_validation=True)

if name == 'main':
main()

run this code then

Image

tutorial.yaml
node_lines:

  • node_line_name: retrieve_node_line
    nodes:
    • node_type: retrieval
      top_k: 3
      modules:
      • module_type: bm25
        bm25_tokenizer: [porter_stemmer,space]

- module_type: vectordb

vectordb: chroma_large

strategy:
  metrics:
  - retrieval_f1
  - retrieval_recall
  - retrieval_precision
  • node_line_name: post_retrieve_node_line
    nodes:
    • node_type: prompt_maker
      modules:
      • module_type: fstring
        prompt: "Read the passages and answer the given question. \n Question: {query} \n Passage: {retrieved_contents} \n Answer : "
        strategy:
        generator_modules:
        • batch: 2
          llm: openai
          module_type: llama_index_llm
          metrics:
        • bleu
        • meteor
        • rouge
    • node_type: generator
      modules:
      • batch: 2
        llm: openai
        model: gpt-3.5-turbo-16k
        module_type: llama_index_llm
        strategy:
        metrics:
        • metric_name: bleu
        • metric_name: meteor
        • embedding_model: openai
          metric_name: sem_score

corpus.parquet and qa_test.parquet from [MarkrAI](https://huggingface.co/datasets/MarkrAI/msmarco_sample_autorag)
/
msmarco_sample_autorag

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions