Databricks Labs Data Generator (`dbldatagen`)

Documentation | Release Notes | Examples | Tutorial

Project Description

The dbldatagen Databricks Labs project is a Python library for generating synthetic data within the Databricks environment using Spark. The generated data may be used for testing, benchmarking, demos, and many other uses.

It operates by defining a data generation specification in code that controls how the synthetic data is generated. The specification may incorporate the use of existing schemas or create data in an ad-hoc fashion.

It has no dependencies on any libraries that are not already installed in the Databricks runtime, and you can use it from Scala, R or other languages by defining a view over the generated data.

Feature Summary

It supports:

Generating synthetic data at scale up to billions of rows within minutes using appropriately sized clusters
Generating repeatable, predictable data supporting the need for producing multiple tables, Change Data Capture, merge and join scenarios with consistency between primary and foreign keys
Generating synthetic data for all of the Spark SQL supported primitive types as a Spark data frame which may be persisted, saved to external storage or used in other computations
Generating ranges of dates, timestamps, and numeric values
Generation of discrete values - both numeric and text
Generation of values at random and based on the values of other fields (either based on the hash of the underlying values or the values themselves)
Ability to specify a distribution for random data generation
Generating arrays of values for ML-style feature arrays
Applying weights to the occurrence of values
Generating values to conform to a schema or independent of an existing schema
use of SQL expressions in synthetic data generation
plugin mechanism to allow use of 3rd party libraries such as Faker
Use within a Databricks Delta Live Tables pipeline as a synthetic data generation source
Generate synthetic data generation code from existing schema or data (experimental)
Use of standard datasets for quick generation of synthetic data

Details of these features can be found in the online documentation - online documentation.

Documentation

Please refer to the online documentation for details of use and many examples.

Release notes and details of the latest changes for this specific release can be found in the GitHub repository here

Installation

Use pip install dbldatagen to install the PyPi package.

Within a Databricks notebook, invoke the following in a notebook cell

%pip install dbldatagen

The Pip install command can be invoked within a Databricks notebook, a Delta Live Tables pipeline and even works on the Databricks community edition.

The documentation installation notes contains details of installation using alternative mechanisms.

Compatibility

The Databricks Labs Data Generator framework can be used with Pyspark 3.3.0 and Python 3.9.21 or later. These are compatible with the Databricks runtime 11.3 LTS and later releases. For full Unity Catalog support, we recommend using Databricks runtime 13.2 or later (Databricks 13.3 LTS or above preferred)

For full library compatibility for a specific Databricks Spark release, see the Databricks release notes for library compatibility

https://docs.databricks.com/release-notes/runtime/releases.html

When using the Databricks Labs Data Generator on "Unity Catalog" enabled Databricks environments, the Data Generator requires the use of Single User or No Isolation Shared access modes when using Databricks runtimes prior to release 13.2. This is because some needed features are not available in Shared mode (for example, use of 3rd party libraries, use of Python UDFs) in these releases. Depending on settings, the Custom access mode may be supported.

The use of Unity Catalog Shared access mode is supported in Databricks runtimes from Databricks runtime release 13.2 onwards.

See the following documentation for more information:

https://docs.databricks.com/data-governance/unity-catalog/compute.html

Using the Data Generator

To use the data generator, install the library using the %pip install method or install the Python wheel directly in your environment.

Once the library has been installed, you can use it to generate a data frame composed of synthetic data.

The easiest way to use the data generator is to use one of the standard datasets which can be further customized for your use case.

import dbldatagen as dg
df = dg.Datasets(spark, "basic/user").get(rows=1000_000).build()
num_rows=df.count()

You can also define fully custom data sets using the DataGenerator class.

For example

import dbldatagen as dg
from pyspark.sql.types import IntegerType, FloatType, StringType
column_count = 10
data_rows = 1000 * 1000
df_spec = (dg.DataGenerator(spark, name="test_data_set1", rows=data_rows,
                                                  partitions=4)
           .withIdOutput()
           .withColumn("r", FloatType(), 
                            expr="floor(rand() * 350) * (86400 + 3600)",
                            numColumns=column_count)
           .withColumn("code1", IntegerType(), minValue=100, maxValue=200)
           .withColumn("code2", IntegerType(), minValue=0, maxValue=10)
           .withColumn("code3", StringType(), values=['a', 'b', 'c'])
           .withColumn("code4", StringType(), values=['a', 'b', 'c'], 
                          random=True)
           .withColumn("code5", StringType(), values=['a', 'b', 'c'], 
                          random=True, weights=[9, 1, 1])
 
           )
                            
df = df_spec.build()
num_rows=df.count()

Refer to the online documentation for further examples.

The GitHub repository also contains further examples in the examples directory.

Spark and Databricks Runtime Compatibility

The dbldatagen package is intended to be compatible with recent LTS versions of the Databricks runtime, including older LTS versions at least from 11.3 LTS and later. It also aims to be compatible with Delta Live Table runtimes, including current and preview.

While we don't specifically drop support for older runtimes, changes in Pyspark APIs or APIs from dependent packages such as numpy, pandas, pyarrow, and pyparsing make cause issues with older runtimes.

By design, installing dbldatagen does not install releases of dependent packages in order to preserve the curated set of packages pre-installed in any Databricks runtime environment.

When building on local environments, the build process uses the Pipfile and requirements files to determine the package versions for releases and unit tests.

Project Support

Please note that all projects released under Databricks Labs are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS, and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.

Any issues discovered through the use of this project should be filed as issues on the GitHub Repo.
They will be reviewed as time permits, but there are no formal SLAs for support.

Feedback

Issues with the application? Found a bug? Have a great idea for an addition? Feel free to file an issue.

Name	Name	Last commit message	Last commit date
Latest commit ronanstokes-db modified files to build for Databricks runtime 11.3 LTS compliant ver… Mar 6, 2025 41fc07f · Mar 6, 2025 History 303 Commits
.github	.github	modified files to build for Databricks runtime 11.3 LTS compliant ver…	Mar 6, 2025
dbldatagen	dbldatagen	modified files to build for Databricks runtime 11.3 LTS compliant ver…	Mar 6, 2025
docs	docs	Support serialization of internal objects (#312 )	Feb 28, 2025
examples	examples	Feature readme updates - updates readme to note compatible Unity Cata…	Oct 5, 2023
python	python	modified files to build for Databricks runtime 11.3 LTS compliant ver…	Mar 6, 2025
tests	tests	modified files to build for Databricks runtime 11.3 LTS compliant ver…	Mar 6, 2025
tutorial	tutorial	V021 prep2 (#124 )	Oct 20, 2022
.coveragerc	.coveragerc	updated .coveragerc	Sep 13, 2021
.gitignore	.gitignore	fixed some LGTM alerts	Sep 16, 2021
.lgtm.yml	.lgtm.yml	lgtm changes	Sep 16, 2021
.readthedocs.yaml	.readthedocs.yaml	config file for readthedocs	Jul 9, 2021
CHANGELOG.md	CHANGELOG.md	modified files to build for Databricks runtime 11.3 LTS compliant ver…	Mar 6, 2025
CONTRIBUTING.md	CONTRIBUTING.md	modified files to build for Databricks runtime 11.3 LTS compliant ver…	Mar 6, 2025
ISSUE_TEMPLATE.md	ISSUE_TEMPLATE.md	templates for filing issues and pull requests (#81 )	Sep 17, 2021
LICENSE	LICENSE	Update LICENSE (#246 )	Dec 18, 2023
MANIFEST.in	MANIFEST.in	updates	May 2, 2020
NOTICE	NOTICE	changed dates in LICENSE and NOTICE (#66 )	Aug 4, 2021
PULL_REQUEST_TEMPLATE.md	PULL_REQUEST_TEMPLATE.md	added prospector lint checking to build process (#174 )	Mar 22, 2023
Pipfile	Pipfile	modified files to build for Databricks runtime 11.3 LTS compliant ver…	Mar 6, 2025
Pipfile.lock	Pipfile.lock	Feature standard datasets - part 1 (#258 )	Jun 7, 2024
README.md	README.md	modified files to build for Databricks runtime 11.3 LTS compliant ver…	Mar 6, 2025
lint.sh	lint.sh	integrated prospector for code checks	Jun 9, 2021
makefile	makefile	modified files to build for Databricks runtime 11.3 LTS compliant ver…	Mar 6, 2025
prospector.yaml	prospector.yaml	Feature templates bugfix (#175 )	Mar 22, 2023
pytest.ini	pytest.ini	FIX #25 - Moved build environment to Pipfile	Jun 12, 2021
setup.cfg	setup.cfg	virtualized build process and fixed bump mechanics / version info	May 28, 2020
setup.py	setup.py	modified files to build for Databricks runtime 11.3 LTS compliant ver…	Mar 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Databricks Labs Data Generator (`dbldatagen`)

Project Description

Feature Summary

Documentation

Installation

Compatibility

Using the Data Generator

Spark and Databricks Runtime Compatibility

Project Support

Feedback

About

Releases 18

Contributors 11

Languages

License

databrickslabs/dbldatagen

Folders and files

Latest commit

History

Repository files navigation

Databricks Labs Data Generator (dbldatagen)

Project Description

Feature Summary

Documentation

Installation

Compatibility

Using the Data Generator

Spark and Databricks Runtime Compatibility

Project Support

Feedback

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 18

Contributors 11

Languages

Databricks Labs Data Generator (`dbldatagen`)