This repository holds the business logic for building and managing the data pipelines used to power various data services at MIT Open Learning. The core framework is Dagster which provides a flexible, and well structured approach to building data applications.
- Ensure that you have the latest version of Docker installed. https://www.docker.com/products/docker-desktop/
- Install docker compose. Check the documentation and requirements for your specific machine. https://docs.docker.com/compose/install/
- Ensure you are able to authenticate into GitHub + Vault
https://github.com/mitodl/ol-data-platform/tree/main
https://vault-qa.odl.mit.edu/v1/auth/github/login
vault login -address=https://vault-qa.odl.mit.edu -method=github
https://vault-production.odl.mit.edu/v1/auth/github/loginvault login -address=https://vault-production.odl.mit.edu -method=github
- Ensure you create your .env file and populate it with the environment variables.
cp .env.example .env
- Call docker compose up
docker compose up --build
- Navigate to localhost:3000 to access the Dagster UI
This repository includes a script for automatically generating dbt source definitions and staging models from database tables. The script is located at bin/dbt-create-staging-models.py
.
- Python environment with required dependencies (see
pyproject.toml
) - dbt environment configured with appropriate credentials
- Access to the target database/warehouse
The script provides three main commands:
uv run python bin/dbt-create-staging-models.py generate-sources \
--schema ol_warehouse_production_raw \
--prefix raw__mitlearn__app__postgres__user \
--target production
uv run python bin/dbt-create-staging-models.py generate-staging-models \
--schema ol_warehouse_production_raw \
--prefix raw__mitlearn__app__postgres__user \
--target production
uv run python bin/dbt-create-staging-models.py generate-all \
--schema ol_warehouse_production_raw \
--prefix raw__mitlearn__app__postgres__user \
--target production
--schema
: The database schema to scan for tables (e.g.,ol_warehouse_production_raw
)--prefix
: The table prefix to filter by (e.g.,raw__mitlearn__app__postgres__user
)--target
: The dbt target environment to use (production
,qa
,dev
, etc.)--database
: (Optional) Specify the database name if different from target default--directory
: (Optional) Override the subdirectory withinmodels/staging/
--apply-transformations
: (Optional) Apply semantic transformations (default: True)--entity-type
: (Optional) Override auto-detection of entity type (user, course, courserun, etc.)
- Domain Detection: Extracts the domain from the prefix (e.g.,
mitlearn
fromraw__mitlearn__app__postgres__
) - Entity Detection: Automatically detects entity type from table name for semantic transformations
- File Organization: Creates files in
src/ol_dbt/models/staging/{domain}/
- Source Generation: Uses dbt-codegen to discover matching tables and generate source definitions
- Enhanced Staging Models: Creates SQL and YAML files with automatic transformations applied
- Merging: Automatically merges new tables with existing source files
The script now includes an enhanced macro that automatically applies common transformation patterns:
- Semantic Column Renaming:
id
→{entity}_id
,title
→{entity}_title
- Timestamp Standardization: Converts all timestamps to ISO8601 format
- Boolean Normalization: Ensures consistent boolean field naming
- Data Quality: Automatic deduplication for Airbyte sync issues
- String Cleaning: Handles multiple spaces in user names
The system auto-detects entity types from table names:
user
tables → User-specific transformationscourse
tables → Course-specific transformationscourserun
tables → Course run transformationsvideo
,program
,website
→ Respective entity transformations
- File Organization: Creates files in
src/ol_dbt/models/staging/{domain}/
- Source Generation: Uses dbt-codegen to discover matching tables and generate source definitions
- Staging Models: Creates SQL and YAML files for each discovered table
- Merging: Automatically merges new tables with existing source files
- Location:
src/ol_dbt/models/staging/{domain}/_{domain}__sources.yml
- Format: Standard dbt sources configuration with dynamic schema references
- Merging: Automatically merges with existing source definitions
- SQL Files:
stg_{domain}__{table_name}.sql
- Generated base models with enhanced transformations and explicit column selections - YAML File:
_stg_{domain}__models.yml
- Consolidated model schema definitions for all staging models in the domain
python bin/dbt-create-staging-models.py generate-all \
--schema ol_warehouse_production_raw \
--prefix raw__mitlearn__app__postgres__user \
--target production
python bin/dbt-create-staging-models.py generate-all \
--schema ol_warehouse_production_raw \
--prefix raw__mitlearn__app__postgres__user \
--target production \
--no-apply-transformations
python bin/dbt-create-staging-models.py generate-all \
--schema ol_warehouse_production_raw \
--prefix raw__mitlearn__app__postgres__user \
--target production \
--entity-type user
python bin/dbt-create-staging-models.py generate-all \
--schema ol_warehouse_production_raw \
--prefix raw__mitlearn__app__postgres__user \
--target production
This creates:
src/ol_dbt/models/staging/mitlearn/_mitlearn__sources.yml
- Source definitionssrc/ol_dbt/models/staging/mitlearn/_stg_mitlearn__models.yml
- Consolidated model definitionssrc/ol_dbt/models/staging/mitlearn/stg_mitlearn__raw__mitlearn__app__postgres__users_user.sql
- Individual SQL files- Additional SQL files for other discovered user-related tables
python bin/dbt-create-staging-models.py generate-sources \
--schema ol_warehouse_production_raw \
--prefix raw__mitlearn__app__postgres__auth \
--target production
This merges auth-related tables into the existing _mitlearn__sources.yml
file.
- The script follows existing dbt project conventions and naming patterns
- Source files use the standard
ol_warehouse_raw_data
source with dynamic schema configuration - Generated staging models reference the correct source and include all discovered columns
- The script handles YAML merging to avoid duplicating source definitions
This repository includes a utility script for running uv
commands across all code locations in the dg_deployment/code_locations
directory. The script is located at bin/uv-operations.py
.
The uv-operations.py
script automatically discovers all directories containing a pyproject.toml
file in the code locations directory and executes the specified uv
command on each one. This is useful for operations like:
- Synchronizing dependencies across all code locations (
uv sync
) - Upgrading lock files (
uv lock --upgrade
) - Building packages (
uv build
) - Listing installed packages (
uv pip list
)
python bin/uv-operations.py <uv-command> [args...]
Or run it directly as an executable:
./bin/uv-operations.py <uv-command> [args...]
python bin/uv-operations.py sync
python bin/uv-operations.py lock --upgrade
python bin/uv-operations.py pip list
By default, the script stops at the first failure. To continue processing all locations even if some fail:
python bin/uv-operations.py sync --continue-on-error
For detailed output showing the exact commands being run:
python bin/uv-operations.py sync --verbose
--code-locations-dir
: Base directory containing code locations (default:dg_deployment/code_locations
)--continue-on-error
: Continue running even if some locations fail--verbose
: Print verbose output including the full command being executed
The script provides:
- A list of discovered code locations
- Progress indicators for each location being processed
- Success (✓) or failure (✗) markers for each location
- A summary at the end showing successful and failed operations