LakePulse

A production-grade, fully open-source data lakehouse pipeline implementing modern data engineering patterns

LakePulse is a comprehensive data lakehouse solution that demonstrates real-world data engineering practices using industry-standard technologies. Built for local development and learning, it implements Change Data Capture (CDC) simulation, real-time data processing, and the modern Bronze-Silver-Gold medallion architecture.

Note: This project is still in development! The documentation here reflects the estimated design.

Overview

LakePulse showcases a complete end-to-end data pipeline that:

Ingests data from OLTP databases using simulated CDC
Processes data through multiple layers (Bronze, Silver, Gold)
Stores data in Delta Lake format for ACID transactions
Orchestrates workflows with Apache Airflow
Enables analytics through SQL interfaces

Architecture

Layer	Technology	Purpose
Storage	MinIO	S3-compatible object storage
Data Format	Delta Lake	ACID transactions, time travel
Processing	Apache Spark	Distributed data processing
Orchestration	Apache Airflow	Workflow management
Streaming	Apache Kafka with Schema Registry (Avro)	Change data capture simulation along with schema evolution
Analytics	Trino	Distributed SQL query engine
Source DB	PostgreSQL	OLTP database (Wide World Importers)
Continous Integration	Github Actions	Interactive data analysis
Monitoring and Alerting	Prometheus + Grafana	Interactive data analysis

Project Structure

lakepulse/
├── 📁 airflow/                 # Airflow DAGs and configuration
│   ├── config/                 # Airflow configuration
│   ├── dags/                   # Pipeline definitions
│   └── plugins/                # Airflow plugins
├── 📁 docker/                  # Service configurations
│   ├── postgres/               # Postgres database init
│   ├── airflow/                # Airflow Docker setup
│   ├── spark/                  # Spark Docker setup
│   └── kafka-connect/          # Kafka connect Docker setup
├── 📁 spark_jobs/              # Data processing jobs
│   ├── bronze/                 # Raw data ingestion
│   ├── silver/                 # Data cleaning & validation
│   ├── gold/                   # Business aggregations
│   └── utils/                  # Shared utilities
├── 📁 data/                    # Sample datasets
├── 📁 great_expectations/      # GE project config, expectation suites, checkpoints
├── 📁 trinio/                  # Trino catalog configs (e.g. catalog/delta.properties)
├── 📁 scripts/                 # Utility scripts
├── 🐳 docker-compose.yml       # Multi-service orchestration
├── 📋 Makefile                 # Common commands
└── 📖 README.md                # This file

Learning Objectives

This project demonstrates:

Modern Data Architecture: Medallion architecture with Bronze-Silver-Gold layers
Real-time Processing: Kafka-based streaming and CDC patterns
Data Lake Technologies: Delta Lake, Parquet, and S3-compatible storage
Workflow Orchestration: Airflow DAGs for complex data pipelines
Data Quality: Validation, deduplication, and monitoring techniques
Performance Optimization: Partitioning, indexing, and query optimization
DevOps Practices: Containerization, infrastructure as code

⭐ Star this repository if you find it helpful!

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github		.github
airflow		airflow
data		data
docker		docker
docs		docs
kafka/connectors		kafka/connectors
notebooks		notebooks
spark_jobs/bronze		spark_jobs/bronze
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LakePulse

Overview

Architecture

Project Structure

Learning Objectives

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

luinor223/lakepulse

Folders and files

Latest commit

History

Repository files navigation

LakePulse

Overview

Architecture

Project Structure

Learning Objectives

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages