A simple data pipeline that simulates Kafka producers, stores messages in PostgreSQL, and enables PySpark for further data analysis.
- β Kafka Producer generates fake customer & order data at regular intervals.
- β
Kafka Broker holds messages in topics (
customers
,orders
). - β Kafka Consumer writes data from Kafka to PostgreSQL.
- β Jupyter Notebook + PySpark for data analysis.
- β Kafdrop (Kafka UI) for monitoring Kafka topics.
To start Kafka, Zookeeper, PostgreSQL, and Jupyter Notebook, run:
docker-compose up -d --build
π After running this:
- Kafka & Zookeeper should be running.
- PostgreSQL should be accessible.
- Jupyter Notebook will be available at localhost:8888.
Check if Kafka is running fine:
docker logs kafka
Check if the producer is running:
docker logs kafka-producer
To check messages inside Kafka topics, use:
π Read customers
topic messages:
docker exec -it kafka kafka-console-consumer.sh --bootstrap-server kafka:9092 --topic customers --from-beginning
π Read orders
topic messages:
docker exec -it kafka kafka-console-consumer.sh --bootstrap-server kafka:9092 --topic orders --from-beginning
The Kafka Consumer stores messages in PostgreSQL.
To run the consumer that writes Kafka messages to PostgreSQL, execute:
docker-compose run --rm kafka-producer python app/kafka_to_postgres.py
π Data is written in real-time to PostgreSQL!
To verify, connect to PostgreSQL and check records:
docker exec -it postgres psql -U myuser -d kafka_data
Once inside the psql
shell, run:
SELECT * FROM customers LIMIT 10;
SELECT * FROM orders LIMIT 10;
To analyze the data with PySpark inside Jupyter Notebook, follow these steps:
- Open Jupyter Notebook at localhost:8888
- Install PostgreSQL connector inside Jupyter:
!pip install psycopg2-binary
- Run SQL queries from Jupyter:
import psycopg2
import pandas as pd
conn = psycopg2.connect(
dbname="kafka_data",
user="myuser",
password="mypassword",
host="postgres",
port=5432
)
df = pd.read_sql("SELECT * FROM customers", conn)
df.head()
To monitor Kafka topics in a web UI, access Kafdrop: π http://localhost:9000
- View Kafka topics (
customers
,orders
). - Inspect messages, partitions, and offsets.
To stop all running containers, use:
docker-compose down
π This will stop Kafka, PostgreSQL, and all dependent services.
πΉ Check running containers:
docker ps
πΉ Check logs for errors:
docker logs kafka
docker logs kafka-producer
docker logs postgres
πΉ Restart services:
docker-compose down
docker-compose up -d --build
- Add Kafka Streams for real-time processing.
- Implement Docker networking for external connections.
- Store processed data in a data lake (S3, HDFS, etc.).
Maintained by Siddharth. Contributions welcome! π