Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

12th of July Updates #7

Open
alebjanes opened this issue Jul 12, 2024 · 1 comment
Open

12th of July Updates #7

alebjanes opened this issue Jul 12, 2024 · 1 comment

Comments

@alebjanes
Copy link
Contributor

RAG Evaluation

  1. 100 questions
    Types of questions:
  • 60 on general trade
  • 12 on growth/variation
  • 28 on rankings
  1. RAG evaluation results

Best combination tested so far: multi-qa-mpnet-base-cos-v1 (embeddings) + gpt-3.5-turbo (LLM)

  • Accuracy: 73%
    • Answers missing data: 9
    • Answers missing context: 14
    • Incorrect answers: 4
  • Average latency: 4.19 s

Out of the wrong answers:

  • 24 were general questions
  • 3 of growth (the lowest)
  • 0 of ranking questions

We're preparing a presentation gathering the results of all approaches with more detail. Next week I'll be improving the RAG + LLM and evaluating the previous multi-layer approach with Pippo.

@alebjanes alebjanes changed the title 12th of July 12th of July Updates Jul 12, 2024
@pippo-sci
Copy link
Contributor

Fine-tune MAPE results:

  • Recap: Accuracy is 0% for all fine-tuned versions because numbers aren't right
  • Measuring the absolute percentage error by model
    image

Measuring the median absolute % error on value qty, taking the 1.1B params model with one epoch as baseline (ft_tiny0), the error decreased on the 50 epoch trained model (ft_tiny2) and increased when increasing model size (to 7B params, ft_llama2). Sort of make sense, larger models are harder to finetune.

Fine tuning tinyLlama to produce API calls instead, the accuracy is 12%, many because it struggles with HS code numbers. Other than that, query accuracy is 89%. On the HS numbers the % mean error is about 12%. but again every time the model is query with the same question it returns a slightly different number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants