Python application for simple and automatic execution of TPC-DS on Snowflake and Databricks. It is intended to work with the data generated with dsdgen and dsqgen (see CLADE documentation), and is divided into the following main modules:
load_data.py: load data from an Azure Blob into the desired platform.execute_queries.py: execute the TPC-DS Benchmark queries.data_maintenance.py: execute the TPC-DS Data Maintenance phase.metrics/: folder with Python scripts and notebooks to obtain metrics, graphics, etc.
For load_data.py, execute_queries.py and data_maintenance.py it is necessary to enter command-line parameters and a configuration file. The main parameters are --platform, which must be snowflake or databricks, and --experiment_config/--connection_config with the path of the necessary configuration files. Some options of these files can be edited in each execution by the corresponding parameter. This allows the user to automate the experimentation process. All of this stuff is explained below in the rest of this README file.
Configuration files must be in JSON format and are compared with JSON schemas present in the config/schemas/ folder. Specifically, the following are the schemas available:
db_conn_schema.json: Databricks connection parameters.db_exp_schema.json: Databricks experiments parameters.sf_conn_schema.json: Snowflake connection parameters.sf_exp_schema.json: Snowflake experiments parameters.
Before running the benchmark in Snowflake, the Azure Blob (which URL is a parameter of the experiment config file) must contain all the CSV files to be used (generated using dsdgen). The rest of the components, such as the warehouse, database, schema or stage, are created automatically through the code (and they are not overwritten if they already exist).
In Databricks, as in Snowflake, the Azure Blob (which URL is a parameter of the experiment config file) must contain all the CSV files to be used before running the benchmark. It is also necessary to create the SQL Warehouse that will be used to run the benchmark, as one of the mandatory parameters of the connection configuration file is the path to the Warehouse.
The rest of the components (database, schema, etc.) will be created automatically if they do not already exist (if they do, they will not be overwritten).
load_data.py [-h] [--scale_factor SCALE_FACTOR] [--warehouse WAREHOUSE] --platform {databricks,snowflake} --experiment_config EXPERIMENT_CONFIG --connection_config
CONNECTION_CONFIG
options:
-h, --help show this help message and exit
--scale_factor SCALE_FACTOR, -s SCALE_FACTOR
Scale Factor for the dataset (default: None)
--warehouse WAREHOUSE, -w WAREHOUSE
Size of warehouse to use (default: None)
--platform {databricks,snowflake}, -p {databricks,snowflake}
Platform to use (databricks or snowflake) (default: None)
--experiment_config EXPERIMENT_CONFIG, -e EXPERIMENT_CONFIG
Path to the experiment configuration file (default: None)
--connection_config CONNECTION_CONFIG, -c CONNECTION_CONFIG
Path to the connection configuration file (default: None)
execute_queries.py [-h] [--scale_factor SCALE_FACTOR] [--warehouse WAREHOUSE] --platform {databricks,snowflake} --experiment_config EXPERIMENT_CONFIG --connection_config CONNECTION_CONFIG [--queries QUERIES] [--times TIMES] [--streams STREAMS]
options:
-h, --help show this help message and exit
--scale_factor SCALE_FACTOR, -sf SCALE_FACTOR
Scale Factor for the dataset (default: None)
--warehouse WAREHOUSE, -w WAREHOUSE
Size of warehouse to use (default: None)
--platform {databricks,snowflake}, -p {databricks,snowflake}
Platform to use (databricks or snowflake) (default: None)
--experiment_config EXPERIMENT_CONFIG, -e EXPERIMENT_CONFIG
Path to the experiment configuration file (default: None)
--connection_config CONNECTION_CONFIG, -c CONNECTION_CONFIG
Path to the connection configuration file (default: None)
--queries QUERIES, -q QUERIES
Specify "all" or a specific query number (default: all)
--times TIMES, -t TIMES
Number of times to execute the queries (default: 5)
--streams STREAMS, -s STREAMS
Number of streams to use to execute the queries (default: 1)
Note that te queries have to have been previously generated using the dsqgen tool.
data_maintenance.py [-h] [--scale_factor SCALE_FACTOR] [--warehouse WAREHOUSE] --platform {databricks,snowflake} --experiment_config EXPERIMENT_CONFIG --connection_config CONNECTION_CONFIG [--format {csv,parquet}]
options:
-h, --help show this help message and exit
--scale_factor SCALE_FACTOR, -sf SCALE_FACTOR
Scale Factor for the dataset (default: None)
--warehouse WAREHOUSE, -w WAREHOUSE
Size of warehouse to use (default: None)
--platform {databricks,snowflake}, -p {databricks,snowflake}
Platform to use (databricks or snowflake) (default: None)
--experiment_config EXPERIMENT_CONFIG, -e EXPERIMENT_CONFIG
Path to the experiment configuration file (default: None)
--connection_config CONNECTION_CONFIG, -c CONNECTION_CONFIG
Path to the connection configuration file (default: None)
--format {csv,parquet}
Format of the data to load (default: csv)
The usage of the metrics module is in the README.md file in the corresponding folder.
Project funded by the Ministry of Science, Innovation and Universities of Spain (ref. TED2021-129177B-I00/MCIN/AEI/10.13039/501100011033) and by the European Union “NextGenerationEU”/PRTR".