You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Best combination tested so far: multi-qa-mpnet-base-cos-v1 (embeddings) + gpt-3.5-turbo (LLM)
Accuracy: 73%
Answers missing data: 9
Answers missing context: 14
Incorrect answers: 4
Average latency: 4.19 s
Out of the wrong answers:
24 were general questions
3 of growth (the lowest)
0 of ranking questions
We're preparing a presentation gathering the results of all approaches with more detail. Next week I'll be improving the RAG + LLM and evaluating the previous multi-layer approach with Pippo.
The text was updated successfully, but these errors were encountered:
Recap: Accuracy is 0% for all fine-tuned versions because numbers aren't right
Measuring the absolute percentage error by model
Measuring the median absolute % error on value qty, taking the 1.1B params model with one epoch as baseline (ft_tiny0), the error decreased on the 50 epoch trained model (ft_tiny2) and increased when increasing model size (to 7B params, ft_llama2). Sort of make sense, larger models are harder to finetune.
Fine tuning tinyLlama to produce API calls instead, the accuracy is 12%, many because it struggles with HS code numbers. Other than that, query accuracy is 89%. On the HS numbers the % mean error is about 12%. but again every time the model is query with the same question it returns a slightly different number.
RAG Evaluation
Types of questions:
Best combination tested so far: multi-qa-mpnet-base-cos-v1 (embeddings) + gpt-3.5-turbo (LLM)
Out of the wrong answers:
We're preparing a presentation gathering the results of all approaches with more detail. Next week I'll be improving the RAG + LLM and evaluating the previous multi-layer approach with Pippo.
The text was updated successfully, but these errors were encountered: