Skip to content

Production-grade ELT pipeline classifying VIX volatility regimes. Features incremental models, SCD-2 snapshots, recursive CTEs, and Slim CI/CD. Stack: dbt, DuckDB/MotherDuck, Python.

License

Notifications You must be signed in to change notification settings

DavidKingGH/dbt_portfolio_project_2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

39 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VIX Volatility Regime Classification Pipeline

A production-grade dbt project that classifies stock market volatility regimes using the CBOE Volatility Index (VIX) and analyzes stock performance across different market conditions.

dbt DuckDB Tests CI CD License: MIT


πŸ“Š Project Overview

This project implements a multi-layered data pipeline that:

  1. Classifies market volatility into four distinct regimes based on VIX levels
  2. Tracks regime periods over time using temporal joins and window functions
  3. Enriches stock price data with volatility classifications
  4. Calculates RSI technical indicators using both SMA and Wilder smoothing methods
  5. Maintains historical versions of data using SCD-2 snapshots

Business Value

  • Risk Assessment: Identify market stress periods for portfolio adjustments
  • Performance Analysis: Understand how stocks behave during different volatility regimes
  • Technical Analysis: Generate trading signals using RSI indicators calibrated to market conditions
  • Historical Tracking: Maintain complete audit trail of regime classifications and price changes

πŸ—οΈ Architecture

Tech Stack

Pipeline Flow:

  1. Python (yfinance/FRED): Extract stock and VIX data via APIs
  2. Data Sources: Yahoo Finance for stocks, FRED for VIX
  3. dbt + MotherDuck: Transform raw data into analytics-ready models
  4. GitHub Actions: Automated testing and deployment on every commit

Data Flow

Raw Sources (Yahoo Finance, FRED)
         ↓
    Bronze Layer (src_*)
         ↓
    Silver Layer (stg_*, int_*)
         ↓
     Gold Layer (fct_*, mrt_*)
         ↓
    Snapshots (SCD-2)

Pipeline Components

VIX Regime Classification:

src_vix β†’ stg__vix_with_versions β†’ int__regime_classification β†’ int__regime_changes β†’ mrt__stocks_with_regimes

RSI Calculation:

src_stocks β†’ stg__stocks β†’ int__price_changes β†’ int__rsi_sma β†’ fct__rsi
                                              β†’ int__rsi_wilder β†—

🎯 Key Features

βœ… Advanced SQL Patterns

  • Temporal Joins: Version-aware threshold selection using effective dates
  • Window Functions: Running sum grouping for regime period detection
  • Recursive CTEs: Wilder smoothing implementation for RSI calculation
  • Incremental Models: Efficient daily updates with merge strategy

βœ… Data Quality

  • 111 Automated Tests: Comprehensive validation across all layers
  • Contract Enforcement: Type-safe schemas with explicit constraints
  • Business Logic Validation: OHLC consistency, RSI bounds, threshold ordering
  • Referential Integrity: Cross-model relationship testing

βœ… Production Features

  • SCD-2 Snapshots: Historical tracking of regime classifications and stock prices
  • Versioned Configuration: Reproducible regime definitions via seed files
  • Incremental Processing: 15% performance improvement over full refresh
  • Comprehensive Documentation: Model descriptions, column definitions, grain specifications

πŸ“ Project Structure

portfolio_project_2/
β”œβ”€β”€ assets/                  
β”‚   └── data_ingestion.ipynb # Python scripts for fetching Yahoo/FRED data
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ bronze/              # Source models (raw data views)
β”‚   β”‚   β”œβ”€β”€ src_stocks.sql
β”‚   β”‚   └── src_vix.sql
β”‚   β”œβ”€β”€ silver/
β”‚   β”‚   β”œβ”€β”€ stg/             # Staging models (cleaned, typed)
β”‚   β”‚   β”‚   β”œβ”€β”€ stg__stocks.sql
β”‚   β”‚   β”‚   └── stg__vix_with_versions.sql
β”‚   β”‚   └── int/             # Intermediate models (business logic)
β”‚   β”‚       β”œβ”€β”€ int__regime_classification.sql
β”‚   β”‚       β”œβ”€β”€ int__regime_changes.sql
β”‚   β”‚       β”œβ”€β”€ int__price_changes.sql
β”‚   β”‚       β”œβ”€β”€ int__rsi_sma.sql
β”‚   β”‚       └── int__rsi_wilder.sql
β”‚   └── gold/                # Mart models (analytics-ready)
β”‚       β”œβ”€β”€ fct__rsi.sql
β”‚       └── mrt__stocks_with_regimes.sql
β”œβ”€β”€ seeds/
β”‚   └── regime_thresholds.csv    # Versioned VIX thresholds
β”œβ”€β”€ snapshots/
β”‚   └── snapshots.yml            # SCD-2 configurations
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ assert_regime_threshold_ordering.sql
β”‚   β”œβ”€β”€ wilder_seed_equals_sma.sql
β”‚   └── TESTING_RESULTS.md       # Detailed test documentation
β”œβ”€β”€ macros/
β”‚   └── ingest_raw.sql           # Data loading utilities
└── README.md

πŸ“ˆ Data Models

Regime Classification Models

int__regime_classification

Assigns volatility regime to each VIX trading day using versioned thresholds.

Grain: One row per trading date

Regime Definitions:

Regime VIX Range Market Condition
Complacent < 12 Very low volatility
Normal 12-20 Typical conditions
Elevated 20-30 Increased uncertainty
Crisis 30+ Extreme volatility

int__regime_changes

Tracks consecutive trading day periods where VIX maintained the same regime.

Grain: One row per regime period

Key Columns:

  • regime_period_id: Sequential period identifier
  • start_date / end_date: Period boundaries
  • days_in_regime: Duration of regime period

Stock Analysis Models

mrt__stocks_with_regimes

Fact table enriching daily stock prices with VIX regime classifications.

Grain: One row per ticker per trading date

Materialization: Incremental (merge strategy)

Key Features:

  • OHLC price data
  • Volume metrics
  • VIX regime classification
  • Regime period identifier

fct__rsi

Relative Strength Index calculations using both SMA and Wilder smoothing methods.

Grain: One row per ticker per trading date

Indicators:

  • sma_rsi: 14-day RSI using Simple Moving Average
  • wilder_rsi: 14-day RSI using Wilder's exponential smoothing
  • sma_rs / wilder_rs: Relative strength ratios

πŸ§ͺ Testing

Test Coverage Summary

Category Count Purpose
Data Quality 60 not_null, unique, relationships
Business Logic 51 OHLC consistency, RSI validation, threshold ordering
Constraints 9 Contract enforcement on mart models
Snapshots 2 SCD-2 temporal tracking
Total 122 Comprehensive validation across all layers

Key Validation Areas

  • Financial Data Integrity: OHLC consistency, positive prices, volume validation
  • RSI Calculation Accuracy: Gain/loss logic, precision checks (1e-9 tolerance), Wilder initialization
  • Regime Classification: Threshold ordering, accepted values, temporal versioning
  • Referential Integrity: Cross-model relationships, date consistency
  • Temporal Accuracy: Date bounds, timestamp validation, period tracking

See Testing Documentation for detailed test scenarios and results.


πŸš€ Getting Started

Prerequisites

  • dbt Core 1.10.13 or higher
  • DuckDB with MotherDuck connection
  • Python 3.8+ (for data ingestion scripts)
    • Required libraries: yfinance, pandas (see assets/data_ingestion.ipynb)

Installation

# Clone the repository
git clone <repository-url>
cd portfolio_project_2

# Install dbt dependencies
dbt deps

# Configure profiles.yml with your MotherDuck credentials
# See: https://docs.getdbt.com/docs/core/connect-data-platform/motherduck-setup

🐍 Data Ingestion

While dbt handles the transformation layer, the data extraction is performed using Python. I have included a Jupyter Notebook that demonstrates how raw financial data is fetched, cleaned, and prepared for the bronze layer.

File Location: assets/data_ingestion.ipynb

Workflow:

  1. Extract: Fetches daily OHLC data using the yfinance library and VIX data from CSV/FRED.
  2. Format: Stacks and renames columns to match the target schema for DuckDB/MotherDuck.
  3. Load: Exports standardized CSVs to the raw data directory for dbt seed or direct ingestion.

Running the Project

# Load seed data
dbt seed

# Run all models
dbt run

# Run tests
dbt test

# Generate documentation
dbt docs generate
dbt docs serve

Incremental Refresh

# Daily incremental update
dbt run --select mrt__stocks_with_regimes fct__rsi

# Run snapshots to capture SCD-2 history
dbt snapshot

πŸ”„ CI/CD Automation

This project uses GitHub Actions for continuous integration and deployment of dbt models.

πŸ§ͺ CI Workflow β€” ci.yml

Triggered on pushes to dev or any feature/** branch.
Validates code changes by:

  • Spinning up a temporary DuckDB/MotherDuck dev environment
  • Running dbt build --defer --state prod_state --select "state:modified+"
    (only builds modified models vs production state)
  • Downloads production manifest.json from latest successful CD run for state comparison
  • Cost Optimization: Utilizing state:modified+ and --defer ensures only changed models and their downstream dependencies are processed, significantly reducing compute costs and build time compared to full refreshes.
  • Ensures no regressions or test failures before merge.

πŸš€ CD Workflow β€” cd.yml

Triggered on merge/push to main.
Performs a full production build and artifact upload:

  • Installs dependencies and configures MotherDuck prod profile
  • Runs dbt build --target prod to rebuild analytics marts
  • Uploads manifest.json and run_results.json as build artifacts (prod-state) for CI reuse

Together, these workflows enforce deployment hygiene, schema consistency, and safe promotion from dev β†’ prod.


πŸ“Š Sample Queries

Analyze Stock Performance by Regime

SELECT 
    regime,
    ticker,
    AVG(close) as avg_price,
    STDDEV(close) as price_volatility,
    COUNT(*) as trading_days
FROM analytics_dev.mrt__stocks_with_regimes
GROUP BY regime, ticker
ORDER BY ticker, regime;

Identify Regime Transitions

SELECT 
    regime_period_id,
    regime,
    start_date,
    end_date,
    days_in_regime
FROM analytics_dev.int__regime_changes
WHERE regime IN ('elevated', 'crisis')
ORDER BY start_date DESC
LIMIT 10;

RSI Signals During High Volatility

SELECT 
    s.trade_date,
    s.ticker,
    s.close,
    s.regime,
    r.sma_rsi,
    r.wilder_rsi,
    CASE 
        WHEN r.sma_rsi > 70 THEN 'Overbought'
        WHEN r.sma_rsi < 30 THEN 'Oversold'
        ELSE 'Neutral'
    END as signal
FROM analytics_dev.mrt__stocks_with_regimes s
JOIN analytics_dev.fct__rsi r 
    ON s.trade_date = r.trade_date 
    AND s.ticker = r.ticker
WHERE s.regime IN ('elevated', 'crisis')
ORDER BY s.trade_date DESC;

πŸ”§ Configuration

Versioned Thresholds

Regime thresholds are managed via seed file for reproducibility:

version_id,version_name,complacent_max,normal_max,elevated_max,effective_date,notes
1,initial,12,20,30,2024-11-01,Initial thresholds based on historical VIX averages

To update thresholds:

  1. Add new row to seeds/regime_thresholds.csv
  2. Run dbt seed --full-refresh
  3. Run dbt run --select stg__vix_with_versions+

Historical data will automatically reclassify using appropriate version based on effective_date.


πŸ“ Data Sources

  • Stock Prices: Yahoo Finance via yFinance Python package
  • VIX Data: Federal Reserve Economic Data (FRED) API
  • Time Period: January 2024 - November 2025
  • Tickers: AAPL, GOOGL, MSFT, JNJ, NVDA, SPY, TSLA, XLP

πŸŽ“ Learning Outcomes

This project demonstrates proficiency in:

  • dbt Core: Incremental models, snapshots, seeds, tests, contracts
  • Advanced SQL: Window functions, CTEs, recursive queries, temporal joins
  • Data Modeling: Dimensional modeling, SCD-2, fact/dimension tables
  • Data Quality: Comprehensive testing strategy, business logic validation
  • Production Patterns: Versioning, incremental processing, performance optimization
  • Technical Writing: Documentation, testing methodology, runbooks

πŸ“š Additional Resources


πŸ“§ Contact

David - Analytics Engineer
LinkedIn | GitHub | Email


πŸ“„ License

This project is part of a personal portfolio and is available for review and educational purposes under the MIT License.

MIT

Note: The underlying dataset used in this project is derived from publicly available sources including the Yahoo Finance API and FRED CBOE Volatility (VIXCLS) data. This repository is for educational and demonstration purposes only and does not distribute or claim ownership over any Yahoo or FRED data.


πŸ™ Acknowledgments

  • dbt Labs for the incredible dbt framework
  • MotherDuck for cloud DuckDB capabilities
  • Yahoo Finance and FRED for financial data

Built with ❀️ using dbt, DuckDB, and MotherDuck

About

Production-grade ELT pipeline classifying VIX volatility regimes. Features incremental models, SCD-2 snapshots, recursive CTEs, and Slim CI/CD. Stack: dbt, DuckDB/MotherDuck, Python.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published