Parking Ticket Analysis using Machine Learning

This work was done for Santa Clara University's Data Mining course taught by Professor Anastasiu

Introduction

Three different machine learning models were built in order to predict the type of traffic violation written in any given scenario. We used an open source dataset made available on Kaggle by the NYC Department of Finance. We found our Deep Feedforward Neural Network to have the best performance with 45.4% accuracy. Although this is low, it proves that there is a correlation between the cleaned features that we generated and the violation code itself. This information can then be used to create a revenue optimization strategy so that law enforcement agencies can appropriately dedicate resources.

Problem Statement

In fiscal year 2016, New York City generated $993 million in fines, a 35 percent increase in the last decade and 16 percent rise over the last four years. By far the largest category of fines continues to be parking violations, accounting for over 50% of the total revenue year after year. This exorbitant amount of state income is accomplished by means of over 10 million parking tickets issued yearly. As one can imagine, the state of New York heavily relies on this annual income and therefore dedicates many law enforcement resources to maintain it. This project seeks to understand how well parking tickets can be predicted based off of the available statistics of each violation. A prediction algorithm of such nature enables law enforcement agencies to create a revenue optimization strategy such that the same revenue numbers can be achieved while simultaneously allowing for the transfer of resources to more serious crimes.

Dataset

NYC Parking Tickets

The NYC Department of Finance collects data on every parking ticket issued in NYC (~10M per year). This data is made publicly available to aid in ticket resolution and to guide policymakers. This dataset (found on Kaggle) consists of four files, covering Aug 2013-June 2017. The files are roughly organized by fiscal year (July 1 - June 30) with the exception of the initial dataset. The initial dataset also lacks 8 columns that are included in the other three datasets but these columns are used sparingly. Each column contains information about the violation such as the vehicle ticketed, the type of ticket issued, location, and time. In total the dataset is almost 9GB in size. This was too large for the hardware resources that we had access to so we distilled that data down by choosing 500k samples from each of the 4 fiscal years.

Data	Number of Samples	Number of Features	Number of non-zero values
Raw Data (8.97GB)	42.3 million	51	N/A
Before Preprocessing	2 million	51	N/A
After Preprocessing	1.4 million	12	18.2 million

Preprocessing Efforts

Violation Code Distribution

Since our objective was to predict the type of violation, we first analyzed the violation code distribution. Of the 99 codes that exist, we found that 50% of the samples can be explained by just 6 codes. We therefore applied a threshold to the data, removing any violdation code that contained less than 100 samples. In doing this, we were able to keep 99.9% of the initial two million samples and reduced the number of violation codes from 99 to 63. The bar plot below demonstrates the lopsided nature of this distribution

Feature Selection

Of the 51 features available, many contained little to no data. These sparse features ranged from having less than 20% non-null values all the way down to less than 1%. There were also features that exhibited little to no correlation with the violation code or were disregarded due to their similarity with other features. To this extent we narrowed the features down to the 12 below

Selected Features
Registration State	Plate Type	Issue Date	Vehicle Body Type
Vehicle Make	Issuing Agency	Violation Precinct	Violation Time
Issuer Precinct	Vehicle Color	Street Name	Violation County

Feature Cleaning

Many of the features in the list above required cleaning due to samples having unusable representations or sub-categories not matching the parent feature. Below contains a detailed description of how this was addressed.

Registration State: Contained a ‘99’ category that didn’t match the parent feature. All samples that had this value were dropped.
Plate Type: Contained a ‘999’ category that didn’t match the parent feature. All samples that had this were dropped
Issue Date: This feature was recorded as Year/Month/Day. Each date was converted to the corresponding day of the week (Monday-Sunday) which was believed to be a better representation of the data.
Vehicle Color: There was quite a bit of inconsistency in the way vehicle colors were transcribed. For example, the color silver was recorded as ‘sl’, ‘silvr’, ‘sil’, ‘silve’, and ‘silver’, and as a result we decided to combine all these together into a unified color labeled ‘silver’. There were also colors such as ‘dark blue’ and ‘dark green’ which were binned into ‘blue’ and ‘green’ respectively.
Violation Time: This feature was recorded as hh:mm (AM/PM). To make a better use of this feature we divided the day into 3 periods: morning, afternoon, evening. The time was first converted to a 24 hour clock and then binned appropriately with morning occurring from 5-11, afternoon from 11-17, and evening from 17-5.

Feature Thresholding

Similar to the violation code, we applied a threshold to each feature to ensure that there was enough of each subcategory to be able to make a meaningful prediction. Not all the features have the same threshold because it is a function of the sub-category distribution in each feature.

Feature Thresholding
Registration State: 100	Plate Type: 100	Issue Date: 100	Vehicle Body Type: 5000
Vehicle Make: 220	Issuing Agency: 1000	Violation Precinct: 100	Violation Time: 20
Issuer Precinct: 100	Vehicle Color: 100	Street Name: 100	Violation County: 2000

Models

Deep Feedforward Neural Network

A deep feedforward neural network with 5 hidden fully-connected layers was built. Each layer contains 128 neurons. The activation function of all the layers is LeakyReLU, except the last layer, which uses a softmax function. To preserve the best checkpoint during training, the EarlyStopping functionality is used to monitor validation categorical accuracy at every epoch. The parameters of EarlyStopping are configured to maximize validation categorical accuracy with a patience value of 30, meaning the network will continue training for 30 consecutive epochs more if overfitting is detected (i.e. the validation categorical accuracy does not increase). The Checkpoint functionality saves the best weights after every epoch. The network uses the Adam optimizer and a Categorical Cross-entropy loss function. The batch size is 32 and was trained for 100 epochs. The model was built using Keras.

K-Nearest Neighbor (KNN)

The data structure used to feed into the KNN model was a numpy array but it was found that since the data is inherently very sparse due to the features being one-hot encoded, a more efficient data structure was a sparse matrix. Below demonstrates the runtime speed up achieved from this change. The execution times were estimated by clocking how long it took to classify 100 samples and then linearly scaled to project overall runtime.

In order to sweep through the KNN hyperparameters in a reasonable amount of time, the data was scaled back to 250k samples in order to observe which combination of parameters yielded the best performance. It is believed the same parameters are optimal for the larger dataset since the smaller subset contained 99% of the post one hot encoding features. The final hyperparameters found were a k of 50, Euclidean distance as the similarity metric, and uniform weights such that all points in each neighborhood are weighted equally.

Decision Tree Classifier

Many of the decision tree hyperparameters were left to their default values but the ones that were tuned are listed below

min_samples_split: The minimum number of samples required to split an internal node. Value chosen was 2
min_samples_leaf: The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least this number of training samples in each of the left and right branches. Value chosen was 1
max_features: The number of features to consider when looking for the best split. The value chosen was "None" which indicates that the max number of features is the number of features in the input data
class_weight: The weights associated with classes. The "balanced" mode was used so that the of the weights are inversely proportional to class frequencies.

Feature Thresholding
Registration State: 100	Plate Type: 100	Issue Date: 100	Vehicle Body Type: 5000
Vehicle Make: 220	Issuing Agency: 1000	Violation Precinct: 100	Violation Time: 20
Issuer Precinct: 100	Vehicle Color: 100	Street Name: 100	Violation County: 2000

Training

We used a train test split of 80/20 to break up the data. The loss plots for the neural network are below

Note that the run time calculated is the measured time it takes to train and predict, ignoring all other computations. The models were run on a standard laptop with an Intel i7 and 32GB of RAM.

Model	Train and Test Runtime
Deep Feedforward Neural Network	90 min
K-Nearest Neighbor (KNN)	226 min
Decision Tree Classifier	7.75 min

Results

Model	Accuracy
Deep Feedforward Neural Network	45.40%
K-Nearest Neighbor (KNN)	38.80%
Decision Tree Classifier	33.98%

We found the decision tree to have the fastest training time and the neural network to have the highest accuracy. If this were to be used in industry to optimize law enforcement strategies, we believe the neural network to be the best choice because online training could be employed as new training samples are published and the network's weights can be saved and loaded for instantaneous classification. We suspect that using a larger portion of the dataset would further the model accuracy.

Usage

File Description (main files only)

decision_tree.py - Decision Tree Classifier model
knn.py - KNN model
knn_sweep.py - Used to find optimal hyper parameters for KNN model
nn_main.py - Neural network model
preprocessing_1.py - Applies thresholds to features
preprocessing_2.py - Performs feature cleaning
clean_data.csv - Output of preprocessing_1.py
clean_data2.csv - Output of preprocessing_2.py

The raw dataset of 2 million samples was not included in this repository due to size limitations. However, the preprocessed data clean_data2.csv that the models use to train and validate on is included here

To re-train and verify the accuracy of each model, run each respective python script. Note that the data needs to be moved to the src directory or the file location needs to be edited in the model script. Each model can be run like this

    python nn_main.py

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
images		images
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parking Ticket Analysis using Machine Learning

Introduction

Problem Statement

Dataset

NYC Parking Tickets

Preprocessing Efforts

Models

Deep Feedforward Neural Network

K-Nearest Neighbor (KNN)

Decision Tree Classifier

Training

Results

Usage

About

Releases

Packages

Languages

License

tadowney/ticket_analysis

Folders and files

Latest commit

History

Repository files navigation

Parking Ticket Analysis using Machine Learning

Introduction

Problem Statement

Dataset

NYC Parking Tickets

Preprocessing Efforts

Models

Deep Feedforward Neural Network

K-Nearest Neighbor (KNN)

Decision Tree Classifier

Training

Results

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages