You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. Discrepancy in Evaluation Scores for HybridQA, WikiSQL, and WikiTQ
When using inference_hitab_tabfact_fetaqa.py (with default parameters) followed by eval_tabfact.py for evaluation, I am not able to achieve scores similar to those reported in the paper. For HybridQA (5.74), WikiSQL (41.37), and WikiTQ (16.90), the results I obtain are significantly lower.
Is there something specific in the generation or evaluation process for these datasets that requires modification? I understand that FEVEROUS requires label replacement, but after observing the output for these datasets, it seems far from reaching the levels reported in the paper.
For example, here are some outputs for HybridQA:
{
"idx": 0,
"instruction": "This is a hybrid question answering task. The goal of this task is to answer the question given tables and passages.",
"input_seg": " [TAB] col: | rank | player | team ( s )...(truncated for brevity)",
"output": "jerry",
"predict": "payton</s>"
},
{
"idx": 1,
"instruction": "This is a hybrid question answering task. The goal of this task is to answer the question given tables and passages.",
"input_seg": "...",
"output": "starke rudolf",
"predict": "rudolf svensson</s>"
},
{
"idx": 2,
"instruction": "This is a hybrid question answering task. The goal of this task is to answer the question given tables and passages.",
"input_seg": "...",
"output": "british",
"predict": "german</s>"
}
2. Evaluation for KVRET
How should I evaluate KVRET? The paper mentions using Micro F1, but the dataset only provides outputs, and it seems like it cannot be easily split.
Example:
{
"instruction": "This is a dialogue response generation task grounded on tables. The goal of this task is to generate response based on the given dialogue history and the given table. The dialogues are grounded through underlying tables and span three distinct tasks in the in-car personal assistant space: calendar scheduling, weather information retrieval, and point-of-interest navigation.",
"input_seg": "col : event | time | date | room | agenda | party",
"question": "The dialogue history is: <remind me to take my pills || >. Please generate the response based on the given table and the given dialogue history.",
"output": "what time do you need to take your pills ?"
},
{
"instruction": "This is a dialogue response generation task grounded on tables. The goal of this task is to generate response based on the given dialogue history and the given table. The dialogues are grounded through underlying tables and span three distinct tasks in the in-car personal assistant space: calendar scheduling, weather information retrieval, and point-of-interest navigation.",
"input_seg": "col : event | time | date | room | agenda | party",
"question": "The dialogue history is: <i need to take my pills at 7pm || what time do you need to take your pills ? | remind me to take my pills>. Please generate the response based on the given table and the given dialogue history.",
"output": "ok setting your medicine appointment for 7pm"
}
The text was updated successfully, but these errors were encountered:
You can use this script https://github.com/OSU-NLP-Group/TableLlama/blob/main/eval_scripts/eval_hitab.py to evaluate WikiSQL and WikiTQ, which includes a post-process procedure for evaluation.
For HybridQA and Kvret, we use https://github.com/xlang-ai/UnifiedSKG/tree/main/metrics/hybridqa and https://github.com/xlang-ai/UnifiedSKG/tree/main/metrics/kvret for evaluation.
1. Discrepancy in Evaluation Scores for HybridQA, WikiSQL, and WikiTQ
When using
inference_hitab_tabfact_fetaqa.py
(with default parameters) followed byeval_tabfact.py
for evaluation, I am not able to achieve scores similar to those reported in the paper. For HybridQA (5.74), WikiSQL (41.37), and WikiTQ (16.90), the results I obtain are significantly lower.Is there something specific in the generation or evaluation process for these datasets that requires modification? I understand that FEVEROUS requires label replacement, but after observing the output for these datasets, it seems far from reaching the levels reported in the paper.
For example, here are some outputs for HybridQA:
2. Evaluation for KVRET
How should I evaluate KVRET? The paper mentions using Micro F1, but the dataset only provides outputs, and it seems like it cannot be easily split.
Example:
The text was updated successfully, but these errors were encountered: