Code to generate the FigureQA dataset. The dataset is available for download here.
Data generation consists of 3 parts:
- Generate the source numerical data, styles, and question-answer pairs for the figures.
- Generate the figure images and bounding box annotatations.
- Aggregrate the figure images, questions & answers, annotations, and source data.
All data generation source code lives in the figureqa/generation subpackage:
-
questionssubpackage contains code to generate questionscategorical.pyfor questions for bar graphs and pie charts.lines.pyfor line plots.utils.pyfor balancing and question encoding augmentation.
-
source_data_generation.pyto generate source data, questions, and answers. -
figure_generation.pyto generate figure images and bounding boxes. -
json_combiner.pyaggregates the generated data into the documented format. Allows for generating a data split in multiple batches. -
data_utils.pyhas misc. utilities for reconciling data formats, placing legends, etc. -
figure.pydefines the figure objects in Bokeh. -
generate_dataset.pygenerates a whole dataset end-to-end. -
show_bounding_boxes.pygenerates images with bounding boxes visualized.
Each runnable module (script) can have its command line arguments displayed with --help.
There are some additional files used for data generation in these directories:
-
configcontains.yamlfiles that configure visual apsects, source data parameters, color splits, and dataset generation. -
resourcescontains the colors and other misc. resources for data generation.
And docs contains additional documentation on annotations, question format, and file formats.
- Install the FigureQA fork of Bokeh from https://www.github.com/Maluuba/bokeh.
pip install -r requirements.txt.- Make sure you have enough space. The whole dataset unzipped is > 6GB, plus you need room for intermediate data.
This is done with the end-to-end script generate_dataset.py. It does the source data synthesis, figure generation, and aggregation.
This script must be run from the root directory, FigureQA.
The config for the actual dataset is in config/figureqa_generation_config.yaml.
A sample config is provided in config/sample_figureqa_generation_config.yaml.
Note that this does not generate the test sets.
cd FigureQApython figureqa/generation/source_data_generation.py CONFIG_FILE.yaml SOURCE_DATA.json --<figure_type> <N_figures> ...python figureqa/generation/figure_generation.py SOURCE_DATA.json RAW_GENERATED_DIRpython figureqa/generation/json_combiner.py FINAL_AGGREGATE_DIR RAW_GENERATED_DIR1 RAW_GENERATED_DIR2 ...