Skip to content

UDC-GAC/clade-tpcds

CLADE TPC-DS Benchmark

Contributor Covenant License Contributors Stars Activity

Python application for simple and automatic execution of TPC-DS on Snowflake and Databricks. It is intended to work with the data generated with dsdgen and dsqgen (see CLADE documentation), and is divided into the following main modules:

  • load_data.py: load data from an Azure Blob into the desired platform.
  • execute_queries.py: execute the TPC-DS Benchmark queries.
  • data_maintenance.py: execute the TPC-DS Data Maintenance phase.
  • metrics/: folder with Python scripts and notebooks to obtain metrics, graphics, etc.

For load_data.py, execute_queries.py and data_maintenance.py it is necessary to enter command-line parameters and a configuration file. The main parameters are --platform, which must be snowflake or databricks, and --experiment_config/--connection_config with the path of the necessary configuration files. Some options of these files can be edited in each execution by the corresponding parameter. This allows the user to automate the experimentation process. All of this stuff is explained below in the rest of this README file.

Usage

Configuration files and schemas

Configuration files must be in JSON format and are compared with JSON schemas present in the config/schemas/ folder. Specifically, the following are the schemas available:

  • db_conn_schema.json: Databricks connection parameters.
  • db_exp_schema.json: Databricks experiments parameters.
  • sf_conn_schema.json: Snowflake connection parameters.
  • sf_exp_schema.json: Snowflake experiments parameters.

Set up Snowflake

Before running the benchmark in Snowflake, the Azure Blob (which URL is a parameter of the experiment config file) must contain all the CSV files to be used (generated using dsdgen). The rest of the components, such as the warehouse, database, schema or stage, are created automatically through the code (and they are not overwritten if they already exist).

Set up Databricks

In Databricks, as in Snowflake, the Azure Blob (which URL is a parameter of the experiment config file) must contain all the CSV files to be used before running the benchmark. It is also necessary to create the SQL Warehouse that will be used to run the benchmark, as one of the mandatory parameters of the connection configuration file is the path to the Warehouse.

The rest of the components (database, schema, etc.) will be created automatically if they do not already exist (if they do, they will not be overwritten).

Load data

load_data.py [-h] [--scale_factor SCALE_FACTOR] [--warehouse WAREHOUSE] --platform {databricks,snowflake} --experiment_config EXPERIMENT_CONFIG --connection_config
                    CONNECTION_CONFIG

options:
  -h, --help            show this help message and exit
  --scale_factor SCALE_FACTOR, -s SCALE_FACTOR
                        Scale Factor for the dataset (default: None)
  --warehouse WAREHOUSE, -w WAREHOUSE
                        Size of warehouse to use (default: None)
  --platform {databricks,snowflake}, -p {databricks,snowflake}
                        Platform to use (databricks or snowflake) (default: None)
  --experiment_config EXPERIMENT_CONFIG, -e EXPERIMENT_CONFIG
                        Path to the experiment configuration file (default: None)
  --connection_config CONNECTION_CONFIG, -c CONNECTION_CONFIG
                        Path to the connection configuration file (default: None)

Execute queries

execute_queries.py [-h] [--scale_factor SCALE_FACTOR] [--warehouse WAREHOUSE] --platform {databricks,snowflake} --experiment_config EXPERIMENT_CONFIG --connection_config CONNECTION_CONFIG [--queries QUERIES] [--times TIMES] [--streams STREAMS]

options:
  -h, --help            show this help message and exit
  --scale_factor SCALE_FACTOR, -sf SCALE_FACTOR
                        Scale Factor for the dataset (default: None)
  --warehouse WAREHOUSE, -w WAREHOUSE
                        Size of warehouse to use (default: None)
  --platform {databricks,snowflake}, -p {databricks,snowflake}
                        Platform to use (databricks or snowflake) (default: None)
  --experiment_config EXPERIMENT_CONFIG, -e EXPERIMENT_CONFIG
                        Path to the experiment configuration file (default: None)
  --connection_config CONNECTION_CONFIG, -c CONNECTION_CONFIG
                        Path to the connection configuration file (default: None)
  --queries QUERIES, -q QUERIES
                        Specify "all" or a specific query number (default: all)
  --times TIMES, -t TIMES
                        Number of times to execute the queries (default: 5)
  --streams STREAMS, -s STREAMS
                        Number of streams to use to execute the queries (default: 1)

Note that te queries have to have been previously generated using the dsqgen tool.

Data Maintenance

data_maintenance.py [-h] [--scale_factor SCALE_FACTOR] [--warehouse WAREHOUSE] --platform {databricks,snowflake} --experiment_config EXPERIMENT_CONFIG --connection_config CONNECTION_CONFIG [--format {csv,parquet}]

options:
  -h, --help            show this help message and exit
  --scale_factor SCALE_FACTOR, -sf SCALE_FACTOR
                        Scale Factor for the dataset (default: None)
  --warehouse WAREHOUSE, -w WAREHOUSE
                        Size of warehouse to use (default: None)
  --platform {databricks,snowflake}, -p {databricks,snowflake}
                        Platform to use (databricks or snowflake) (default: None)
  --experiment_config EXPERIMENT_CONFIG, -e EXPERIMENT_CONFIG
                        Path to the experiment configuration file (default: None)
  --connection_config CONNECTION_CONFIG, -c CONNECTION_CONFIG
                        Path to the connection configuration file (default: None)
  --format {csv,parquet}
                        Format of the data to load (default: csv)

Metrics

The usage of the metrics module is in the README.md file in the corresponding folder.

Funding

Project funded by the Ministry of Science, Innovation and Universities of Spain (ref. TED2021-129177B-I00/MCIN/AEI/10.13039/501100011033) and by the European Union “NextGenerationEU”/PRTR".

LogosNG(PlantillaInformeFinal) Logo UDC

About

Python application for simple and automatic execution of TPC-DS on Snowflake and Databricks.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages