CLADE TPC-DS Benchmark

Python application for simple and automatic execution of TPC-DS on Snowflake and Databricks. It is intended to work with the data generated with dsdgen and dsqgen (see CLADE documentation), and is divided into the following main modules:

load_data.py: load data from an Azure Blob into the desired platform.
execute_queries.py: execute the TPC-DS Benchmark queries.
data_maintenance.py: execute the TPC-DS Data Maintenance phase.
metrics/: folder with Python scripts and notebooks to obtain metrics, graphics, etc.

For load_data.py, execute_queries.py and data_maintenance.py it is necessary to enter command-line parameters and a configuration file. The main parameters are --platform, which must be snowflake or databricks, and --experiment_config/--connection_config with the path of the necessary configuration files. Some options of these files can be edited in each execution by the corresponding parameter. This allows the user to automate the experimentation process. All of this stuff is explained below in the rest of this README file.

Usage

Configuration files and schemas

Configuration files must be in JSON format and are compared with JSON schemas present in the config/schemas/ folder. Specifically, the following are the schemas available:

db_conn_schema.json: Databricks connection parameters.
db_exp_schema.json: Databricks experiments parameters.
sf_conn_schema.json: Snowflake connection parameters.
sf_exp_schema.json: Snowflake experiments parameters.

Set up Snowflake

Before running the benchmark in Snowflake, the Azure Blob (which URL is a parameter of the experiment config file) must contain all the CSV files to be used (generated using dsdgen). The rest of the components, such as the warehouse, database, schema or stage, are created automatically through the code (and they are not overwritten if they already exist).

Set up Databricks

In Databricks, as in Snowflake, the Azure Blob (which URL is a parameter of the experiment config file) must contain all the CSV files to be used before running the benchmark. It is also necessary to create the SQL Warehouse that will be used to run the benchmark, as one of the mandatory parameters of the connection configuration file is the path to the Warehouse.

The rest of the components (database, schema, etc.) will be created automatically if they do not already exist (if they do, they will not be overwritten).

Load data

load_data.py [-h] [--scale_factor SCALE_FACTOR] [--warehouse WAREHOUSE] --platform {databricks,snowflake} --experiment_config EXPERIMENT_CONFIG --connection_config
                    CONNECTION_CONFIG

options:
  -h, --help            show this help message and exit
  --scale_factor SCALE_FACTOR, -s SCALE_FACTOR
                        Scale Factor for the dataset (default: None)
  --warehouse WAREHOUSE, -w WAREHOUSE
                        Size of warehouse to use (default: None)
  --platform {databricks,snowflake}, -p {databricks,snowflake}
                        Platform to use (databricks or snowflake) (default: None)
  --experiment_config EXPERIMENT_CONFIG, -e EXPERIMENT_CONFIG
                        Path to the experiment configuration file (default: None)
  --connection_config CONNECTION_CONFIG, -c CONNECTION_CONFIG
                        Path to the connection configuration file (default: None)

Execute queries

execute_queries.py [-h] [--scale_factor SCALE_FACTOR] [--warehouse WAREHOUSE] --platform {databricks,snowflake} --experiment_config EXPERIMENT_CONFIG --connection_config CONNECTION_CONFIG [--queries QUERIES] [--times TIMES] [--streams STREAMS]

options:
  -h, --help            show this help message and exit
  --scale_factor SCALE_FACTOR, -sf SCALE_FACTOR
                        Scale Factor for the dataset (default: None)
  --warehouse WAREHOUSE, -w WAREHOUSE
                        Size of warehouse to use (default: None)
  --platform {databricks,snowflake}, -p {databricks,snowflake}
                        Platform to use (databricks or snowflake) (default: None)
  --experiment_config EXPERIMENT_CONFIG, -e EXPERIMENT_CONFIG
                        Path to the experiment configuration file (default: None)
  --connection_config CONNECTION_CONFIG, -c CONNECTION_CONFIG
                        Path to the connection configuration file (default: None)
  --queries QUERIES, -q QUERIES
                        Specify "all" or a specific query number (default: all)
  --times TIMES, -t TIMES
                        Number of times to execute the queries (default: 5)
  --streams STREAMS, -s STREAMS
                        Number of streams to use to execute the queries (default: 1)

Note that te queries have to have been previously generated using the dsqgen tool.

Data Maintenance

data_maintenance.py [-h] [--scale_factor SCALE_FACTOR] [--warehouse WAREHOUSE] --platform {databricks,snowflake} --experiment_config EXPERIMENT_CONFIG --connection_config CONNECTION_CONFIG [--format {csv,parquet}]

options:
  -h, --help            show this help message and exit
  --scale_factor SCALE_FACTOR, -sf SCALE_FACTOR
                        Scale Factor for the dataset (default: None)
  --warehouse WAREHOUSE, -w WAREHOUSE
                        Size of warehouse to use (default: None)
  --platform {databricks,snowflake}, -p {databricks,snowflake}
                        Platform to use (databricks or snowflake) (default: None)
  --experiment_config EXPERIMENT_CONFIG, -e EXPERIMENT_CONFIG
                        Path to the experiment configuration file (default: None)
  --connection_config CONNECTION_CONFIG, -c CONNECTION_CONFIG
                        Path to the connection configuration file (default: None)
  --format {csv,parquet}
                        Format of the data to load (default: csv)

Metrics

The usage of the metrics module is in the README.md file in the corresponding folder.

Funding

Project funded by the Ministry of Science, Innovation and Universities of Spain (ref. TED2021-129177B-I00/MCIN/AEI/10.13039/501100011033) and by the European Union “NextGenerationEU”/PRTR".

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github		.github
LICENSES		LICENSES
config		config
docs		docs
metrics		metrics
src		src
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
data_maintenance.py		data_maintenance.py
execute_queries.py		execute_queries.py
load_data.py		load_data.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CLADE TPC-DS Benchmark

Usage

Configuration files and schemas

Set up Snowflake

Set up Databricks

Load data

Execute queries

Data Maintenance

Metrics

Funding

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

UDC-GAC/clade-tpcds

Folders and files

Latest commit

History

Repository files navigation

CLADE TPC-DS Benchmark

Usage

Configuration files and schemas

Set up Snowflake

Set up Databricks

Load data

Execute queries

Data Maintenance

Metrics

Funding

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages