Skip to content

A simulated Kafka data pipeline that generates fake customer and order data, processes it through Kafka, and stores it in PostgreSQL for real-time analysis with PySpark. Includes Kafdrop UI for monitoring. πŸš€

Notifications You must be signed in to change notification settings

s1ddh-rth/fake-kafka

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Ecommerce Data Pipeline (Kafka + PostgreSQL + PySpark)

A simple data pipeline that simulates Kafka producers, stores messages in PostgreSQL, and enables PySpark for further data analysis.


πŸš€ Features

  • βœ… Kafka Producer generates fake customer & order data at regular intervals.
  • βœ… Kafka Broker holds messages in topics (customers, orders).
  • βœ… Kafka Consumer writes data from Kafka to PostgreSQL.
  • βœ… Jupyter Notebook + PySpark for data analysis.
  • βœ… Kafdrop (Kafka UI) for monitoring Kafka topics.

πŸ›  Setup Instructions

1️⃣ Start All Services

To start Kafka, Zookeeper, PostgreSQL, and Jupyter Notebook, run:

docker-compose up -d --build

πŸ“Œ After running this:

  • Kafka & Zookeeper should be running.
  • PostgreSQL should be accessible.
  • Jupyter Notebook will be available at localhost:8888.

2️⃣ Verify Kafka Broker & Producer Logs

Check if Kafka is running fine:

docker logs kafka

Check if the producer is running:

docker logs kafka-producer

πŸ“© Consuming Messages from Kafka

To check messages inside Kafka topics, use:

πŸ“Œ Read customers topic messages:

docker exec -it kafka kafka-console-consumer.sh --bootstrap-server kafka:9092 --topic customers --from-beginning

πŸ“Œ Read orders topic messages:

docker exec -it kafka kafka-console-consumer.sh --bootstrap-server kafka:9092 --topic orders --from-beginning

πŸ“₯ Writing Kafka Messages to PostgreSQL

The Kafka Consumer stores messages in PostgreSQL.

To run the consumer that writes Kafka messages to PostgreSQL, execute:

docker-compose run --rm kafka-producer python app/kafka_to_postgres.py

πŸ“Œ Data is written in real-time to PostgreSQL!

To verify, connect to PostgreSQL and check records:

docker exec -it postgres psql -U myuser -d kafka_data

Once inside the psql shell, run:

SELECT * FROM customers LIMIT 10;
SELECT * FROM orders LIMIT 10;

πŸ“Š Analyzing Data Using PySpark

To analyze the data with PySpark inside Jupyter Notebook, follow these steps:

  1. Open Jupyter Notebook at localhost:8888
  2. Install PostgreSQL connector inside Jupyter:
!pip install psycopg2-binary
  1. Run SQL queries from Jupyter:
import psycopg2
import pandas as pd

conn = psycopg2.connect(
    dbname="kafka_data",
    user="myuser",
    password="mypassword",
    host="postgres",
    port=5432
)

df = pd.read_sql("SELECT * FROM customers", conn)
df.head()

πŸ“‘ Kafka UI - Kafdrop

To monitor Kafka topics in a web UI, access Kafdrop: πŸ”— http://localhost:9000

  • View Kafka topics (customers, orders).
  • Inspect messages, partitions, and offsets.

πŸ›‘ Stopping the Services

To stop all running containers, use:

docker-compose down

πŸ“Œ This will stop Kafka, PostgreSQL, and all dependent services.


πŸ›  Debugging Issues

πŸ”Ή Check running containers:

docker ps

πŸ”Ή Check logs for errors:

docker logs kafka
docker logs kafka-producer
docker logs postgres

πŸ”Ή Restart services:

docker-compose down
docker-compose up -d --build

πŸ“Œ Future Enhancements

  • Add Kafka Streams for real-time processing.
  • Implement Docker networking for external connections.
  • Store processed data in a data lake (S3, HDFS, etc.).

πŸ‘¨β€πŸ’» Author

Maintained by Siddharth. Contributions welcome! πŸš€

About

A simulated Kafka data pipeline that generates fake customer and order data, processes it through Kafka, and stores it in PostgreSQL for real-time analysis with PySpark. Includes Kafdrop UI for monitoring. πŸš€

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published