A modern data stack project demonstrating full pipeline ownership — from raw GA4 event ingestion (PySpark) to BI-ready marts (dbt + MotherDuck), automated with GitHub Actions and visualized in Power BI.
This project implements a modern analytics pipeline for GA4 e-commerce event data.
Data is extracted via PySpark, modeled and tested in dbt, and continuously integrated and deployed through GitHub Actions.
Final marts are stored in MotherDuck and visualized in Power BI, demonstrating full data lifecycle coverage — from raw ingestion to BI-ready gold layers.
| Layer | Tool | Purpose |
|---|---|---|
| Extraction | PySpark + BigQuery Connector | Reads GA4 raw events and writes structured Parquet outputs to GCS / local for dbt ingestion |
| Transformation | dbt Core (DuckDB/MotherDuck) | Builds modular SQL pipelines with schema contracts, data tests, and lineage |
| Storage | MotherDuck (DuckDB Cloud) | Analytical warehouse for dev/prod environments |
| Orchestration | GitHub Actions | Automated CI/CD for dbt builds and promotion |
| Visualization | Power BI | Validates model accessibility and enables exploratory analytics |
File: ga4-ecommerce-project-1-pyspark-etl.py
- Connects to BigQuery:
bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_* - Transforms data into structured DataFrames (e.g.,
dim_users,dim_sessions,fact_events) - Runs built-in data quality checks via
run_dq_check() - Outputs Parquet datasets to both:
- Google Cloud Storage (
write_to_gcs_for_dbt) - Local directory (
write_local_for_dbt) for dbt consumption
- Google Cloud Storage (
- Uses Spark connectors:
spark-bigquery-with-dependencies_2.12:0.37.0gcs-connector:hadoop3-2.2.20javax.inject:1
This ensures reproducible, schema-consistent ingestion for downstream dbt modeling.
Transforms raw PySpark exports into standardized intermediate tables:
- Normalizes timestamps, identifiers, and session keys
- Produces reusable models like
stg__conversion_events,int__acquisition_touchpoints - Enforces data contracts and tests with
dbt_utils
Final business-ready models designed for marketing, attribution, and behavioral analytics.
Each enforces dbt contracts, column-level constraints, and data integrity tests.
| Model | Theme | Description |
|---|---|---|
mrt__conversion_funnel |
Conversion | Session-level aggregation showing drop-off at each funnel stage (view → cart → checkout → purchase). Validated via monotonic funnel logic test. |
mrt__first_touch_attribution |
Marketing Attribution | Attributes purchase revenue to the first recorded marketing touchpoint per user. Ensures causal timestamp ordering. |
mrt__last_touch_attribution |
Marketing Attribution | Attributes purchase revenue to the last marketing touch before conversion. Complements first-touch for channel comparison. |
mrt__first_touch_ltv |
LTV Analysis | Computes average and cumulative lifetime value segmented by the user’s first acquisition channel. |
mrt__campaign_performance_rev_metrics |
Campaign Performance | Aggregates revenue, users, and conversion metrics by campaign source/medium for marketing optimization. |
mrt__acquisition_device_cohorts |
Behavioral Cohorts | Groups users by device type and acquisition month to track engagement, retention, and spend trends. |
mrt__user_device_segments |
Behavioral Segmentation | Classifies users into device and spend quartiles for downstream personalization analysis. |
mrt__rfm |
Customer Value | Assigns recency, frequency, and monetary scores to users; supports customer segmentation and lifecycle modeling. |
mrt_engagement_intensity_last_30d |
Predictive Features | Computes rolling 30-day engagement metrics per user (e.g., inter-session gaps, active days, session duration) to support churn prediction and propensity modeling. |
mrt__funnel_steps_last_30d |
Funnel Health | Monitoring dashboard for rolling 30-day conversion rates across checkout steps (Cart → Shipping → Payment), segmented by new vs. returning users. |
mrt__item_interactions_last_30d |
Product Interest | Aggregates user-level product behaviors over 30 days, capturing high-value signals like abandoned cart value, price sensitivity, and category affinity. |
This project uses GitHub Actions for continuous integration and deployment of dbt models.
Triggered on pushes to dev or any feature/** branch.
Validates code changes by:
- Spinning up a temporary DuckDB/MotherDuck dev environment
- Running
dbt build --defer --state prod_state --select "state:modified+"
(only builds modified models vs production state) - Downloads production
manifest.jsonfrom latest successful CD run for state comparison - Ensures no regressions or test failures before merge
Triggered on merge/push to main.
Performs a full production build and artifact upload:
- Installs dependencies and configures MotherDuck prod profile
- Runs
dbt build --target prodto rebuild analytics marts - Uploads
manifest.jsonandrun_results.jsonas build artifacts (prod-state) for CI reuse
Together, these workflows enforce deployment hygiene, schema consistency, and safe promotion from dev → prod.
- PySpark:
run_dq_check()to flag nulls, duplicates, or schema drift pre-dbt - dbt: model-level tests (
unique,not_null,expression_is_true) - Power BI: manual validation confirming marts are queryable via ODBC connection
A sample visualization (purchase_revenue by touchpoint_medium and touchpoint_source) verifies:
- dbt gold models are BI-ready
- data lineage holds end-to-end from PySpark extraction through Power BI consumption
Example Power BI chart pulling purchase_revenue by source/medium from dbt gold marts.
---
⚠️ Prerequisite: This ETL assumes access to the GA4 sample dataset from BigQuery (bigquery-public-data.ga4_obfuscated_sample_ecommerce) and a configured local or remote GCS bucket for staging Parquet outputs.
This repository’s codebase (including dbt models, PySpark ETL scripts, and CI/CD configuration) is released under the MIT License
Note: The underlying dataset used in this project is derived from the publicly available Google Analytics 4 (GA4) Demo Ecommerce dataset hosted by Google. All trademarks and data remain the property of Google LLC. This repository is for educational and demonstration purposes only and does not distribute or claim ownership over any Google data.
Set up your local .env or environment variables with the following keys:
# Google Cloud project and bucket configuration
GCP_PROJECT_ID="your-project-id"
GCS_BUCKET_PATH="gs://your-staging-bucket/dbt_inputs/"
# Spark BigQuery connector jar locations (if running locally)
SPARK_JARS="spark-bigquery-with-dependencies_2.12-0.37.0.jar,gcs-connector-hadoop3-2.2.20.jar"
# 2️⃣ Run PySpark ETL
spark-submit \
--jars $SPARK_JARS \
ga4-ecommerce-project-1-pyspark-etl.py
# 3️⃣ Build dbt models
dbt deps
dbt build
# 4️⃣ Run dbt tests and docs
dbt test
dbt docs generate && dbt docs serve
| Model | Theme | Description |
|---|---|---|
mrt__multi_touch_pathing |
Attribution Pathing | Maps multi-touch journeys (source sequences) leading to purchase for advanced attribution modeling. |
mrt__purchase_summary_daily |
Revenue Summary | Daily revenue and purchase metrics rolled up by date, channel, and device; used for time-series visualizations. |
mrt__marketing_channel_efficiency |
ROI Metrics | Combines spend and attributed revenue to compute ROAS, CPA, and efficiency KPIs by channel. |
