SynthTabNet is a dataset of 600k png
images from synthetically generated table layouts with annotations in jsonl
files.
SynthTabNet is a synthetically generated dataset that contains annotated images of data in tabular layouts.
It has been shown that other non-synthetic datasets like PubTabNet, FinTabNet and TableBank suffer from many limitations:
- Their table distributions are skewed towards simpler structures with fewer number of rows/columns.
- There is a very limited variance in the appearance styles.
- The content is sometimes restricted to certain domains.
- The bounding boxes are omitted for non-empty cells or they are completely absent.
SynthTabNet aims to overcome these limitations by providing:
- A broad range of table sizes and richer combinations of row spans /column spans.
- A variety of domain specific styling appearances (e.g. financial data, marketing data, sparse tables etc.)
- Content generated out of the most frequently used terms appearing in non-synthetic datasets (e.g. PubTabNet, FinTabNet, etc.)
- Bounding boxes for all table cells, including the empty ones.
- Rectangular table structures. For each table, every row has the same number of columns after taking into account any row spans /column spans.
SynthTabNet is organized into 4 parts of 150k tables (600k in total). Each part contains tables with different appearances in regard to their size, structure, style and content. All parts are divided into Train, Test and Val splits (80%, 10%, 10%). The tables are delivered as png
images and the annotations are in jsonl
format.
A detailed description of the data synthesis process can be found in the paper.
v2.0.0
Appearance style | Records | Size(GB) | URL v2.0.0 |
---|---|---|---|
Fintabnet | 150k | 10 | SynthTabNet-part1 |
Marketing | 150k | 8 | SynthTabNet-part2 |
PubTabNet | 150k | 6 | SynthTabNet-part3 |
Sparse | 150k | 3 | SynthTatNet-part4 |
v1.0.0
Appearance style | Records | Size(GB) | URL v1.0.0 |
---|---|---|---|
Fintabnet | 150k | 10 | SynthTabNet-part1 |
Marketing | 150k | 8 | SynthTabNet-part2 |
PubTabNet | 150k | 6 | SynthTabNet-part3 |
Sparse | 150k | 3 | SynthTatNet-part4 |
Each part of the dataset corresponds to a top level directory (fintabnet
, marketing
, pubtabnet
, sparse
) and has the following structure:
├── images
│ ├── test
│ ├── train
│ └── val
├── synthetic_data.jsonl
The annotations for each part are in the synthetic_data.jsonl
file. Each line is a json
object that corresponds to a png
image and has the following structure:
"filename": "png image filename inside one of the 'test', 'train', 'val' directories",
"split": "One of 'test', 'train', 'val'",
"html": "Table structure and content",
"cells": "Array with all table cells",
"cell_id": "Zero based cell counter",
"is_header": "true if that cell is part of the table header",
"span": "In case there is a rowspan / columnspan",
"spantype": "One of 'rowspan', 'colspan', '2dspan'. The '2dspan' is used in case there is a rowspan and colspan in the same cell",
"rowspan": "Number of rowspans for this cell",
"colspan": "Number of colspans for this cell"
"tokens": "Array with the tokenized content of the cell",
"bbox": "The bounding bbox and the class of the cell in [x1, y1, x2, y2, class] format"
"structure":
"tokens": "Array with html tags that describe the table structure"
Regarding the bbox
parameter notice that:
- The coordinates origin is the top left corner of the image.
- Each bbox is described by its top left corner
(x1, y1)
and bottom right corner(x2, y2)
. - The bbox
class
can have the values:1
: An empty cell2
: A non-empty cell
The tokens
can be one of:
" colspan=\"10\"", " colspan=\"2\"", " colspan=\"3\"", " colspan=\"4\"", " colspan=\"5\"",
" colspan=\"6\"", " colspan=\"7\"", " colspan=\"8\"", " colspan=\"9\"", " rowspan=\"10\"",
" rowspan=\"2\"", " rowspan=\"3\"", " rowspan=\"4\"", " rowspan=\"5\"", " rowspan=\"6\"",
" rowspan=\"7\"", " rowspan=\"8\"", " rowspan=\"9\"", "</tbody>", "</td>", "</thead>",
"</tr>", "<end>", "<pad>", "<start>", "<tbody>", "<td", "<td>", "<thead>", "<tr>", "<unk>", ">"
Here is a jupyter notebook that demonstrates how to download and use the dataset:
"TableFormer: Table Structure Understanding with Transformers" (CVPR 2022).
- Ahmed Nassar ([email protected])
- Nikolaos Livathinos ([email protected])
- Maksym Lysak ([email protected])
- Peter Staar ([email protected])
ArXiv link: https://arxiv.org/abs/2203.01017
Citation:
@article{nassar2022tableformer,
title={TableFormer: Table Structure Understanding with Transformers},
author={Nassar, Ahmed and Livathinos, Nikolaos and Lysak, Maksym and Staar, Peter},
journal={arXiv preprint arXiv:2203.01017},
year={2022}
}