|
18 | 18 | "\n",
|
19 | 19 | "This notebook uses our `RetrieverEvaluator` to evaluate the quality of any Retriever module defined in LlamaIndex.\n",
|
20 | 20 | "\n",
|
21 |
| - "We specify a set of different evaluation metrics: this includes hit-rate and MRR. For any given question, these will compare the quality of retrieved results from the ground-truth context.\n", |
| 21 | + "We specify a set of different evaluation metrics: this includes hit-rate, MRR, and NDCG. For any given question, these will compare the quality of retrieved results from the ground-truth context.\n", |
22 | 22 | "\n",
|
23 | 23 | "To ease the burden of creating the eval dataset in the first place, we can rely on synthetic data generation."
|
24 | 24 | ]
|
|
40 | 40 | "metadata": {},
|
41 | 41 | "outputs": [],
|
42 | 42 | "source": [
|
43 |
| - "%pip install llama-index-llms-openai" |
| 43 | + "%pip install llama-index-llms-openai\n", |
| 44 | + "%pip install llama-index-readers-file" |
44 | 45 | ]
|
45 | 46 | },
|
46 | 47 | {
|
47 | 48 | "cell_type": "code",
|
48 | 49 | "execution_count": null,
|
49 |
| - "id": "bb6fecf4-7215-4ae9-b02b-3cb7c6000f2c", |
| 50 | + "id": "285cfab2", |
50 | 51 | "metadata": {},
|
51 | 52 | "outputs": [],
|
52 | 53 | "source": [
|
|
62 | 63 | "metadata": {},
|
63 | 64 | "outputs": [],
|
64 | 65 | "source": [
|
65 |
| - "from llama_index.core.evaluation import generate_question_context_pairs\n", |
66 | 66 | "from llama_index.core import VectorStoreIndex, SimpleDirectoryReader\n",
|
67 | 67 | "from llama_index.core.node_parser import SentenceSplitter\n",
|
68 | 68 | "from llama_index.llms.openai import OpenAI"
|
|
82 | 82 | "execution_count": null,
|
83 | 83 | "id": "589c112d",
|
84 | 84 | "metadata": {},
|
85 |
| - "outputs": [], |
| 85 | + "outputs": [ |
| 86 | + { |
| 87 | + "name": "stdout", |
| 88 | + "output_type": "stream", |
| 89 | + "text": [ |
| 90 | + "--2024-06-12 23:57:02-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt\n", |
| 91 | + "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...\n", |
| 92 | + "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.\n", |
| 93 | + "HTTP request sent, awaiting response... 200 OK\n", |
| 94 | + "Length: 75042 (73K) [text/plain]\n", |
| 95 | + "Saving to: ‘data/paul_graham/paul_graham_essay.txt’\n", |
| 96 | + "\n", |
| 97 | + "data/paul_graham/pa 100%[===================>] 73.28K --.-KB/s in 0.08s \n", |
| 98 | + "\n", |
| 99 | + "2024-06-12 23:57:03 (864 KB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]\n", |
| 100 | + "\n" |
| 101 | + ] |
| 102 | + } |
| 103 | + ], |
86 | 104 | "source": [
|
87 | 105 | "!mkdir -p 'data/paul_graham/'\n",
|
88 | 106 | "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'"
|
|
171 | 189 | {
|
172 | 190 | "data": {
|
173 | 191 | "text/markdown": [
|
174 |
| - "**Node ID:** node_0<br>**Similarity:** 0.8181379514114543<br>**Text:** What I Worked On\n", |
175 |
| - "\n", |
176 |
| - "February 2021\n", |
| 192 | + "**Node ID:** node_38<br>**Similarity:** 0.814377909267451<br>**Text:** I also worked on spam filters, and did some more painting. I used to have dinners for a group of friends every thursday night, which taught me how to cook for groups. And I bought another building in Cambridge, a former candy factory (and later, twas said, porn studio), to use as an office.\n", |
177 | 193 | "\n",
|
178 |
| - "Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n", |
179 |
| - "\n", |
180 |
| - "The first programs I tried writing were on the IBM 1401 that our school district used for what was then called \"data processing.\" This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.\n", |
| 194 | + "One night in October 2003 there was a big party at my house. It was a clever idea of my friend Maria Daniels, who was one of the thursday diners. Three separate hosts would all invite their friends to one party. So for every guest, two thirds of the other guests would be people they didn't know but would probably like. One of the guests was someone I didn't know but would turn out to like a lot: a woman called Jessica Livingston. A couple days later I asked her out.\n", |
181 | 195 | "\n",
|
182 |
| - "The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in ...<br>" |
| 196 | + "Jessica was in charge of marketing at a Boston investment bank. This bank thought it understood startups, but over the next year, as she met friends of mine from the startup world, she was surprised how different reality was. And ho...<br>" |
183 | 197 | ],
|
184 | 198 | "text/plain": [
|
185 | 199 | "<IPython.core.display.Markdown object>"
|
|
191 | 205 | {
|
192 | 206 | "data": {
|
193 | 207 | "text/markdown": [
|
194 |
| - "**Node ID:** node_52<br>**Similarity:** 0.8143530600618721<br>**Text:** It felt like I was doing life right. I remember that because I was slightly dismayed at how novel it felt. The good news is that I had more moments like this over the next few years.\n", |
| 208 | + "**Node ID:** node_0<br>**Similarity:** 0.8122448657654567<br>**Text:** What I Worked On\n", |
195 | 209 | "\n",
|
196 |
| - "In the summer of 2016 we moved to England. We wanted our kids to see what it was like living in another country, and since I was a British citizen by birth, that seemed the obvious choice. We only meant to stay for a year, but we liked it so much that we still live there. So most of Bel was written in England.\n", |
| 210 | + "February 2021\n", |
| 211 | + "\n", |
| 212 | + "Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n", |
197 | 213 | "\n",
|
198 |
| - "In the fall of 2019, Bel was finally finished. Like McCarthy's original Lisp, it's a spec rather than an implementation, although like McCarthy's Lisp it's a spec expressed as code.\n", |
| 214 | + "The first programs I tried writing were on the IBM 1401 that our school district used for what was then called \"data processing.\" This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.\n", |
199 | 215 | "\n",
|
200 |
| - "Now that I could write essays again, I wrote a bunch about topics I'd had stacked up. I kept writing essays through 2020, but I also started to think about other things I could work on. How should I choose what to do? Well, how had I chosen what to work on in the past? I wrote an essay for myself to answer that ques...<br>" |
| 216 | + "The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in ...<br>" |
201 | 217 | ],
|
202 | 218 | "text/plain": [
|
203 | 219 | "<IPython.core.display.Markdown object>"
|
|
246 | 262 | "execution_count": null,
|
247 | 263 | "id": "2d29a159-9a4f-4d44-9c0d-1cd683f8bb9b",
|
248 | 264 | "metadata": {},
|
249 |
| - "outputs": [], |
| 265 | + "outputs": [ |
| 266 | + { |
| 267 | + "name": "stderr", |
| 268 | + "output_type": "stream", |
| 269 | + "text": [ |
| 270 | + "100%|██████████| 61/61 [04:59<00:00, 4.91s/it]\n" |
| 271 | + ] |
| 272 | + } |
| 273 | + ], |
250 | 274 | "source": [
|
251 | 275 | "qa_dataset = generate_question_context_pairs(\n",
|
252 | 276 | " nodes, llm=llm, num_questions_per_chunk=2\n",
|
|
263 | 287 | "name": "stdout",
|
264 | 288 | "output_type": "stream",
|
265 | 289 | "text": [
|
266 |
| - "\"Describe the transition from using the IBM 1401 to microcomputers, as mentioned in the text. What were the key differences in terms of user interaction and programming capabilities?\"\n" |
| 290 | + "\"Describe the transition from using the IBM 1401 to microcomputers, as mentioned in the text. How did this change impact the way programs were written and executed?\"\n" |
267 | 291 | ]
|
268 | 292 | }
|
269 | 293 | ],
|
|
319 | 343 | "metadata": {},
|
320 | 344 | "outputs": [],
|
321 | 345 | "source": [
|
322 |
| - "include_cohere_rerank = True\n", |
| 346 | + "include_cohere_rerank = False\n", |
323 | 347 | "\n",
|
324 | 348 | "if include_cohere_rerank:\n",
|
325 | 349 | " !pip install cohere -q"
|
|
334 | 358 | "source": [
|
335 | 359 | "from llama_index.core.evaluation import RetrieverEvaluator\n",
|
336 | 360 | "\n",
|
337 |
| - "metrics = [\"mrr\", \"hit_rate\"]\n", |
| 361 | + "metrics = [\"mrr\", \"hit_rate\", \"ndcg\"]\n", |
338 | 362 | "\n",
|
339 | 363 | "if include_cohere_rerank:\n",
|
340 | 364 | " metrics.append(\n",
|
|
356 | 380 | "name": "stdout",
|
357 | 381 | "output_type": "stream",
|
358 | 382 | "text": [
|
359 |
| - "Query: In the context provided, the author describes his early experiences with programming on an IBM 1401. Based on his description, what were some of the limitations and challenges he faced while trying to write programs on this machine?\n", |
360 |
| - "Metrics: {'mrr': 1.0, 'hit_rate': 1.0, 'cohere_rerank_relevancy': 0.99620515}\n", |
| 383 | + "Query: In the context, the author mentions his early experiences with programming on an IBM 1401. Describe the process he used to write and run a program on this machine, and explain why he found it challenging to create meaningful programs on this system.\n", |
| 384 | + "Metrics: {'mrr': 1.0, 'hit_rate': 1.0, 'ndcg': 0.6131471927654584}\n", |
361 | 385 | "\n"
|
362 | 386 | ]
|
363 | 387 | }
|
|
402 | 426 | "\n",
|
403 | 427 | " full_df = pd.DataFrame(metric_dicts)\n",
|
404 | 428 | "\n",
|
405 |
| - " hit_rate = full_df[\"hit_rate\"].mean()\n", |
406 |
| - " mrr = full_df[\"mrr\"].mean()\n", |
407 |
| - " columns = {\"retrievers\": [name], \"hit_rate\": [hit_rate], \"mrr\": [mrr]}\n", |
| 429 | + " columns = {\n", |
| 430 | + " \"retrievers\": [name],\n", |
| 431 | + " **{k: [full_df[k].mean()] for k in metrics},\n", |
| 432 | + " }\n", |
408 | 433 | "\n",
|
409 | 434 | " if include_cohere_rerank:\n",
|
410 | 435 | " crr_relevancy = full_df[\"cohere_rerank_relevancy\"].mean()\n",
|
|
443 | 468 | " <tr style=\"text-align: right;\">\n",
|
444 | 469 | " <th></th>\n",
|
445 | 470 | " <th>retrievers</th>\n",
|
446 |
| - " <th>hit_rate</th>\n", |
447 | 471 | " <th>mrr</th>\n",
|
448 |
| - " <th>cohere_rerank_relevancy</th>\n", |
| 472 | + " <th>hit_rate</th>\n", |
| 473 | + " <th>ndcg</th>\n", |
449 | 474 | " </tr>\n",
|
450 | 475 | " </thead>\n",
|
451 | 476 | " <tbody>\n",
|
452 | 477 | " <tr>\n",
|
453 | 478 | " <th>0</th>\n",
|
454 | 479 | " <td>top-2 eval</td>\n",
|
455 |
| - " <td>0.801724</td>\n", |
456 |
| - " <td>0.685345</td>\n", |
457 |
| - " <td>0.946009</td>\n", |
| 480 | + " <td>0.643443</td>\n", |
| 481 | + " <td>0.745902</td>\n", |
| 482 | + " <td>0.410976</td>\n", |
458 | 483 | " </tr>\n",
|
459 | 484 | " </tbody>\n",
|
460 | 485 | "</table>\n",
|
461 | 486 | "</div>"
|
462 | 487 | ],
|
463 | 488 | "text/plain": [
|
464 |
| - " retrievers hit_rate mrr cohere_rerank_relevancy\n", |
465 |
| - "0 top-2 eval 0.801724 0.685345 0.946009" |
| 489 | + " retrievers mrr hit_rate ndcg\n", |
| 490 | + "0 top-2 eval 0.643443 0.745902 0.410976" |
466 | 491 | ]
|
467 | 492 | },
|
468 | 493 | "execution_count": null,
|
|
0 commit comments