#Table of Contents
- Introduction
- Data Set
- Data Transformations
- Live Demo
- Presentation Deck
- Instructions to Run this Pipeline
#Introduction This is a data engineering project at Insight Data Science. There are two goals that this project aims to accomplish:
- Provide an API for drivers, city planners, and data scientists, for analyzing long term trends in traffic pattern w.r.t metrics such as average car volume, speed etc.
- Enable a framework for real-time monitoring of traffic information, so that a user can know the best route to take and select a specific road to view the historical data.
#Data Set Historical: The project is based on historical traffic volume data for nearly 60,000 major roads in New York State, collected over 10 years. The data is available as a time series. The following table provides a snap shot of the raw data set:
roadID, timestamp, car count
Real-Time: The historical dataset is played back to simulate real-time behavior.
AWS Clusters: A distributed AWS cluster of four ec2 machines is being used for this project. All the components (ingestion, batch and real-time processing) are configured and run in distributed mode, with one master and three workers.
-
Ingestion Layer (Kafka): The raw data is consumed by a message broker, configured in publish-subscribe mode. Related files: producer.py, consumer.py.
-
Batch Layer (HDFS, Spark): A kafka consumer stores the data into HDFS. Additional columns are added to the dataset to generate metrics as described in the previous section. Following this, tables representing the aggregate views for serving queries at the user end are generated using Spark. Related Files: myBatch.py
-
Speed Layer (Storm): The topology for processing real-time data comprises of a kafka-spout and a bolt (with tick interval frequency of 2.5 sec). The data is filtered to only store clean (uncorrupted) entries. Related files: stormBolt.py, topology.yaml
-
Front-end (Flask): The car volume information for each road are rendered on Google Maps in terms of four colors and updated at 1 sec interval via Flask. Historical data is represented as line plot via Highcharts. Realted files: views.py, map.js.
-
Libraries and APIs: Cassandra, Pyleus, Kafka-python, Google Maps
#Data Transformations Following metrics are computed via a MapReduce operation on the raw dataset (Spark):
- Total car count in a month
#Live Demo: A Live Demo of the project is available here: trafficjam.today or trafficjam.online. A snap shot of the map with highlighted roads:
#Presentation The presentation slides are available here: trafficjam.today/slideshare
#Instructions to Run this Pipeline
Install python packages:
sudo pip install kafka-python cassandra-driver pyleus
Run the Kafka producer:
python kafka/producer.py
Run the Kafka consumer:
python kafka/kafka_consumer.py
Run Spark:
spark-submit --packages TargetHolding/pyspark-cassandra:0.1.5 ~/Insight-TrafficJam/spark/myBatch.py
Build storm topology:
pyleus build topology.yaml
Submit pyleus topology:
pyleus submit -n 54.174.177.48 topology.jar