ACL-2021-TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance #362
Labels
C
Code Implementation
D
New Dataset
Finance(D)
Financial Domain
QA(T)
Question Answering/Machine Comprehension Task
Summary:
从年报中抽取表格和文字,构建一个QA数据集。提出了一个新的QA模型,可以在表格和文字之间进行推理。
Resource:
Paper information:
Notes:
The left box of Figure 1 shows a real example from some financial report, where
there is a table containing row/column header and numbers inside, and also some paragraphs describing it. We call the hybrid data like this example hybrid context in QA problems, as it contains both tabular and textual content, and call the paragraphs
associated paragraphs to the table.
所谓的hybrid context,关注点在于表格和表格下面的描述语句。需要通过描述对表格里数字进行推理。
数据制作方面,在Annual reports上收集了过去两年500份报告,使用 (Li et al., 2019) 的table detection模型,然后使用Apache PDFBox来抽取表格内容。对于表格,只抽取3
30行,36列。最后,一共得到了2万个表格,这些表格都没有标准的格式。这些表格也可能包含一些错误,比如行很少或列很少,数字缺失。在标注阶段,会人工挑出这些表格,删除,或修正。标注阶段
or multiple spans extracted from the table or text, as well as a generated answer (usually obtained through numerical reasoning). 标注者需要标注哪种类型。对于generated answer,还需要添加一些变形,方便扩展QA模型。
2.3 Quality Control
TODO
Model Graph:
Result::
Thoughts:
Next Reading:
The text was updated successfully, but these errors were encountered: