Skip to content

Generate an htap 10k warehouse dataset.

Rod Butters edited this page Oct 30, 2018 · 2 revisions

Many data stores for big data analytics or HTAP include utilities and APIs for rapid loading of data. Splice Machine includes utilities for rapidly loading CSV files in standard and compressed formats in addition to utilities for loading from HDFS and APIs for streaming interfaces such as Kafka.

The scripts here create and load an htap-10k (10,000 warehouses) data set for CH-benCHmark by first creating CSV files and then loading into Splice Machine via the IMPORT_DATA system utility.

These scripts are configured to create the CSVs for 1,000 warehouses each in a separate directory, allowing you to load subsets of the 10,000 warehouse data set. This is configurable in the command line arguments to oltpbenchmark, making it possible to instead create larger or smaller data sets in a directory as desired. Data sets for 2, 25, 250, 1000, and 10,000 warehouses are already generated and are publicly available at s3://splice-benchmark-data/flat/HTAP/.

The CSVs generated for the first batch of warehouses includes the files that are common regardless of the total number of warehouses: ITEM, REGION, NATION and SUPPLIER.

To run these scripts, generating CSVs for 10K warehouses and loading the data set into the database, do the following:

  • Run the script gen-htap-bz2.sh which will create ten directories with data sets for 1,000 warehouses each. The first directory will include the common files. In this script, the variable blocksize can be adjusted to change the number of warehouses in a directory. The CSV files will be compressed with bzip2 - Splice Machine can import bzip2 files with parallel reads for faster import.

$ ./gen-htap-bz2.sh

  • From the Splice Machine CLI, sqlshell.sh, run the SQL script load-htap-bz2.sql. As written, this script currently works with the data set pre-loaded into the AWS S3 bucket s3://splice-benchmark-data/flat/HTAP/htap-10k. The script creates the schema for the htap benchmark and completes the data load.
$ /usr/local/splicemachine/bin/sqlshell.sh
splice > run 'load-htap-bz2.sql';

Please review the scripts themselves for additional details.

Clone this wiki locally