A robust, scalable Python tool for analysing BigQuery environments and automatically generating Dataform configurations. This tool helps migrate existing BigQuery Environments and workflows to Dataform by analysing table metadata, query history, and generating corresponding Dataform configurations.
This project is designed to bring the power of Dataform to existing BigQuery environments. It is not a replacement for Dataform, nor is it a standalone tool. It is designed to be used in conjunction with Dataform to streamline the migration and adoption process. The tool simply captures all metadata and sql queries from BigQuery and generates Dataform configurations based on the analysis.
It will get you 90% of the way there, but you will still need to manually review and refine the generated Dataform configurations to ensure they meet your specific requirements.
This project is still in the early stages of development, and as such, there may be bugs, incomplete features, and other issues. Please use with caution and report any issues you encounter.
- Dataform Bootstrap
- Analyses BigQuery table metadata, including schema, partitioning, clustering, and labels.
- Examines query history to understand data dependencies and transformation logic.
- Collects job metadata such as creation time, query details, and referenced tables.
- Supports multi-project and multi-location migrations, generating separate configurations for each unique pair.
- Constructs
workflow_config.yaml
with project-level settings for Dataform at each location level. - Creates SQL files representing the transformation logic for each table.
- Generates
actions.yaml
file defining Dataform actions for each table and view, which contains metadata and relevant SQL query paths.
- Identifies similar queries using a configurable similarity threshold.
- Deduplicates queries to minimise redundancy in generated Dataform actions.
- Logs deduplication decisions for transparency and review.
.
βββ CONTRIBUTING.md
βββ LICENSE
βββ README.md
βββ ROADMAP.md
βββ requirements.txt
βββ src
βββ cli # CLI implementation
βββ collectors # Data collection modules (BigQuery only for now)
βββ generators # Dataform config and SQL generators
βββ models # Core data models
βββ utils # Utility functions
- Have an active Google Cloud BigQuery project
- Have the necessary permissions to access the BigQuery API
- Install Python (v3.10 or higher)
- Install Node.js (v20 or higher)
- Install Google Cloud SDK
- Install Dataform CLI (v3.0.8 or higher, it is recommended to install it globally)
- Authenticate with Google Cloud SDK using the following commands:
gcloud auth login
(This will open a browser window to authenticate with your Google Account)gcloud config set project <PROJECT_ID>
(replace<PROJECT_ID>
with your Google Cloud Project ID you created earlier)gcloud auth application-default login
(This sets up the application default credentials for your project)gcloud auth application-default set-quota-project <PROJECT_ID>
(This sets the quota project for your project)
Configuration can be provided through either command-line arguments or environment variables.
python -m src.cli.main [OPTIONS]
Argument | Description | Type | Default | Environment Variable |
---|---|---|---|---|
--project |
Single project ID or comma-separated list | str | Required | DATAFORM_PROJECTS |
Argument | Description | Type | Default | Environment Variable |
---|---|---|---|---|
--location |
BigQuery location(s) | str | "US" | DATAFORM_LOCATIONS |
--days |
Days of history to analyse | int | 30 | DATAFORM_HISTORY_DAYS |
--similarity-threshold |
Query similarity threshold | float | 0.9 | DATAFORM_SIMILARITY_THRESHOLD |
--output-dir |
Output directory path | Path | "output" | DATAFORM_OUTPUT_DIR |
--disable-incremental |
Disable incremental detection | flag | False | DATAFORM_ENABLE_INCREMENTAL |
--output-mode |
Output verbosity (minimal/detailed/json) | str | "detailed" | DATAFORM_OUTPUT_MODE |
The tool supports three output modes (eventually π):
minimal
: Single character status (β/β)detailed
: Comprehensive report with per-project statusjson
: JSON-formatted output with complete status information
- Command-line arguments (highest priority)
- Environment variables
- Default values (lowest priority)
The following example scripts demonstrate different usage patterns:
"""
Example of migrating a single project in a single location.
"""
from pathlib import Path
from src.cli.main import run_cli
def main():
args = [
"--project", "my-project-id",
"--location", "US",
"--days", "30",
"--similarity-threshold", "0.9",
"--output-dir", str(Path("output/single_single")),
"--output-mode", "detailed"
]
return run_cli(args)
if __name__ == "__main__":
main()
"""
Example of migrating a single project across multiple locations.
"""
from pathlib import Path
from src.cli.main import run_cli
def main():
args = [
"--project", "my-project-id",
"--location", "US,EU,ASIA",
"--days", "30",
"--output-dir", str(Path("output/single_multi"))
]
return run_cli(args)
if __name__ == "__main__":
main()
"""
Example of migrating multiple projects across multiple locations.
"""
from pathlib import Path
from src.cli.main import run_cli
def main():
args = [
"--project", "project-1,project-2,project-3",
"--location", "US,EU,ASIA",
"--output-dir", str(Path("output/multi_multi")),
"--output-mode", "json"
]
return run_cli(args)
if __name__ == "__main__":
main()
#!/bin/bash
python -m src.cli.main \
--project "my-project-id" \
--location "US" \
--days 30 \
--similarity-threshold 0.9 \
--output-dir "output/single_single" \
--output-mode detailed
#!/bin/bash
python -m src.cli.main \
--project "my-project-id" \
--location "US,EU,ASIA" \
--days 30 \
--output-dir "output/single_multi"
#!/bin/bash
python -m src.cli.main \
--project "project-1,project-2,project-3" \
--location "US,EU,ASIA" \
--output-dir "output/multi_multi" \
--output-mode json
#!/bin/bash
export DATAFORM_PROJECTS="project-1,project-2,project-3"
export DATAFORM_LOCATIONS="US,EU,ASIA"
export DATAFORM_HISTORY_DAYS="30"
export DATAFORM_SIMILARITY_THRESHOLD="0.9"
export DATAFORM_OUTPUT_DIR="output/env_vars"
export DATAFORM_ENABLE_INCREMENTAL="true"
export DATAFORM_OUTPUT_MODE="detailed"
python -m src.cli.main
.
βββ output # default output directory
βββ <output-dir> # specified output directory
βββ my-project-id # repeats for each project specified
βββ <location> # repeats for each location in project specified
βββ definitions
β βββ actions.yaml
β βββ <dataset> # repeats for each dataset in project identified
β βββ <table>.sql # repeats for each table in dataset identified
βββ logs
β βββ <table_log>.ndjson # repeats for each table in project identified
βββ raw
β βββ jobs_US.ndjson
β βββ tables_US.ndjson
βββ workflow_config.yaml # project-level configuration file - Each project + location combination will have a separate file as Dataform does not support multi-location projects
- Documentation
- Best Practices
- Troubleshooting
- Core Github
- API Reference
- Core Reference
- CLI Reference
- Dataform Core - VSCode Extension
This repository is licensed under the MIT License - see the LICENSE file for details.
Please read CONTRIBUTING.md for details on my code of conduct, and the process for submitting pull requests.
Please read ROADMAP.md for a list of planned features and enhancements.