Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation Issues for HybridQA, WikiSQL, WikiTQ, and KVRET using TableLLaMA #12

Open
minger-hsxz opened this issue Oct 13, 2024 · 1 comment

Comments

@minger-hsxz
Copy link

1. Discrepancy in Evaluation Scores for HybridQA, WikiSQL, and WikiTQ

When using inference_hitab_tabfact_fetaqa.py (with default parameters) followed by eval_tabfact.py for evaluation, I am not able to achieve scores similar to those reported in the paper. For HybridQA (5.74), WikiSQL (41.37), and WikiTQ (16.90), the results I obtain are significantly lower.

Is there something specific in the generation or evaluation process for these datasets that requires modification? I understand that FEVEROUS requires label replacement, but after observing the output for these datasets, it seems far from reaching the levels reported in the paper.

For example, here are some outputs for HybridQA:

{
    "idx": 0,
    "instruction": "This is a hybrid question answering task. The goal of this task is to answer the question given tables and passages.",
    "input_seg": " [TAB] col: | rank | player | team ( s )...(truncated for brevity)",
    "output": "jerry",
    "predict": "payton</s>"
},
{
    "idx": 1,
    "instruction": "This is a hybrid question answering task. The goal of this task is to answer the question given tables and passages.",
    "input_seg": "...",
    "output": "starke rudolf",
    "predict": "rudolf svensson</s>"
},
{
    "idx": 2,
    "instruction": "This is a hybrid question answering task. The goal of this task is to answer the question given tables and passages.",
    "input_seg": "...",
    "output": "british",
    "predict": "german</s>"
}

2. Evaluation for KVRET

How should I evaluate KVRET? The paper mentions using Micro F1, but the dataset only provides outputs, and it seems like it cannot be easily split.

Example:

{
    "instruction": "This is a dialogue response generation task grounded on tables. The goal of this task is to generate response based on the given dialogue history and the given table. The dialogues are grounded through underlying tables and span three distinct tasks in the in-car personal assistant space: calendar scheduling, weather information retrieval, and point-of-interest navigation.",
    "input_seg": "col : event | time | date | room | agenda | party",
    "question": "The dialogue history is: <remind me to take my pills || >. Please generate the response based on the given table and the given dialogue history.",
    "output": "what time do you need to take your pills ?"
},
{
    "instruction": "This is a dialogue response generation task grounded on tables. The goal of this task is to generate response based on the given dialogue history and the given table. The dialogues are grounded through underlying tables and span three distinct tasks in the in-car personal assistant space: calendar scheduling, weather information retrieval, and point-of-interest navigation.",
    "input_seg": "col : event | time | date | room | agenda | party",
    "question": "The dialogue history is: <i need to take my pills at 7pm || what time do you need to take your pills ? | remind me to take my pills>. Please generate the response based on the given table and the given dialogue history.",
    "output": "ok setting your medicine appointment for 7pm"
}
@zhangtianshu
Copy link
Collaborator

You can use this script https://github.com/OSU-NLP-Group/TableLlama/blob/main/eval_scripts/eval_hitab.py to evaluate WikiSQL and WikiTQ, which includes a post-process procedure for evaluation.
For HybridQA and Kvret, we use https://github.com/xlang-ai/UnifiedSKG/tree/main/metrics/hybridqa and https://github.com/xlang-ai/UnifiedSKG/tree/main/metrics/kvret for evaluation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants