Evaluation Issues for HybridQA, WikiSQL, WikiTQ, and KVRET using TableLLaMA #12

minger-hsxz · 2024-10-13T01:19:33Z

1. Discrepancy in Evaluation Scores for HybridQA, WikiSQL, and WikiTQ

When using inference_hitab_tabfact_fetaqa.py (with default parameters) followed by eval_tabfact.py for evaluation, I am not able to achieve scores similar to those reported in the paper. For HybridQA (5.74), WikiSQL (41.37), and WikiTQ (16.90), the results I obtain are significantly lower.

Is there something specific in the generation or evaluation process for these datasets that requires modification? I understand that FEVEROUS requires label replacement, but after observing the output for these datasets, it seems far from reaching the levels reported in the paper.

For example, here are some outputs for HybridQA:

{
    "idx": 0,
    "instruction": "This is a hybrid question answering task. The goal of this task is to answer the question given tables and passages.",
    "input_seg": " [TAB] col: | rank | player | team ( s )...(truncated for brevity)",
    "output": "jerry",
    "predict": "payton</s>"
},
{
    "idx": 1,
    "instruction": "This is a hybrid question answering task. The goal of this task is to answer the question given tables and passages.",
    "input_seg": "...",
    "output": "starke rudolf",
    "predict": "rudolf svensson</s>"
},
{
    "idx": 2,
    "instruction": "This is a hybrid question answering task. The goal of this task is to answer the question given tables and passages.",
    "input_seg": "...",
    "output": "british",
    "predict": "german</s>"
}

2. Evaluation for KVRET

How should I evaluate KVRET? The paper mentions using Micro F1, but the dataset only provides outputs, and it seems like it cannot be easily split.

Example:

{
    "instruction": "This is a dialogue response generation task grounded on tables. The goal of this task is to generate response based on the given dialogue history and the given table. The dialogues are grounded through underlying tables and span three distinct tasks in the in-car personal assistant space: calendar scheduling, weather information retrieval, and point-of-interest navigation.",
    "input_seg": "col : event | time | date | room | agenda | party",
    "question": "The dialogue history is: <remind me to take my pills || >. Please generate the response based on the given table and the given dialogue history.",
    "output": "what time do you need to take your pills ?"
},
{
    "instruction": "This is a dialogue response generation task grounded on tables. The goal of this task is to generate response based on the given dialogue history and the given table. The dialogues are grounded through underlying tables and span three distinct tasks in the in-car personal assistant space: calendar scheduling, weather information retrieval, and point-of-interest navigation.",
    "input_seg": "col : event | time | date | room | agenda | party",
    "question": "The dialogue history is: <i need to take my pills at 7pm || what time do you need to take your pills ? | remind me to take my pills>. Please generate the response based on the given table and the given dialogue history.",
    "output": "ok setting your medicine appointment for 7pm"
}

The text was updated successfully, but these errors were encountered:

zhangtianshu · 2024-10-13T18:14:23Z

You can use this script https://github.com/OSU-NLP-Group/TableLlama/blob/main/eval_scripts/eval_hitab.py to evaluate WikiSQL and WikiTQ, which includes a post-process procedure for evaluation.
For HybridQA and Kvret, we use https://github.com/xlang-ai/UnifiedSKG/tree/main/metrics/hybridqa and https://github.com/xlang-ai/UnifiedSKG/tree/main/metrics/kvret for evaluation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Issues for HybridQA, WikiSQL, WikiTQ, and KVRET using TableLLaMA #12

Evaluation Issues for HybridQA, WikiSQL, WikiTQ, and KVRET using TableLLaMA #12

minger-hsxz commented Oct 13, 2024

zhangtianshu commented Oct 13, 2024

Evaluation Issues for HybridQA, WikiSQL, WikiTQ, and KVRET using TableLLaMA #12

Evaluation Issues for HybridQA, WikiSQL, WikiTQ, and KVRET using TableLLaMA #12

Comments

minger-hsxz commented Oct 13, 2024

1. Discrepancy in Evaluation Scores for HybridQA, WikiSQL, and WikiTQ

2. Evaluation for KVRET

zhangtianshu commented Oct 13, 2024