Skip to content

Hichemchir/taxi_demand_predictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NYC Taxi Demand Predictor

A machine learning solution for predicting taxi demand in New York City, designed to optimize fleet operations by matching driver supply with passenger demand.

Overview

This project implements an end-to-end machine learning pipeline to forecast taxi demand patterns. By analyzing historical trip data, the system helps operations teams maintain optimal fleet utilization, reducing both driver idle time and passenger wait times.

Key Features

  • Automated data ingestion from NYC TLC Trip Record Data
  • Time series data transformation and feature engineering
  • Multiple ML models: Baseline, XGBoost, and LightGBM
  • Comprehensive data validation and preprocessing pipeline
  • Interactive visualization and exploratory data analysis

Project Structure

taxi_demand_predictor/
├── data/
│   ├── raw/                    # Raw parquet files from NYC TLC
│   └── transformed/            # Processed time series data
├── notebooks/                  # Jupyter notebooks for analysis
│   ├── 01_load_and_validate_raw_data.ipynb
│   ├── 02_transform_raw_data_into_ts_data.ipynb
│   ├── 03_transform_ts_data_into_features_and_target.ipynb
│   ├── 04_transform_raw_data_into_features_and_targets.ipynb
│   ├── 05_visualize_training_data.ipynb
│   ├── 06_baseline_model.ipynb
│   ├── 07_xgboost_model.ipynb
│   ├── 08_lightgbm_model.ipynb
│   └── 09_lightgbm_model_with_feature_engineering.ipynb
├── src/                        # Source code modules
│   ├── data.py                 # Data loading and validation
│   ├── data_split.py           # Train/test splitting utilities
│   ├── paths.py                # Path configurations
│   └── plot.py                 # Visualization utilities
├── pyproject.toml              # Poetry dependencies
└── README.md

Installation

Prerequisites

  • Python 3.10 or higher
  • Poetry (for dependency management)

Setup

  1. Clone the repository:
git clone https://github.com/Hichemchir/taxi_demand_predictor.git
cd taxi_demand_predictor
  1. Install dependencies using Poetry:
poetry install
  1. Activate the virtual environment:
poetry shell

Usage

The project follows a sequential workflow through Jupyter notebooks:

  1. Data Loading: Load and validate raw data from NYC TLC
  2. Time Series Transformation: Convert raw data into time series format
  3. Feature Engineering: Generate features and targets for ML models
  4. Visualization: Explore and visualize training data patterns
  5. Modeling: Train and evaluate baseline, XGBoost, and LightGBM models

To run the notebooks:

jupyter notebook

Navigate to the notebooks/ directory and execute them in numerical order.

Data Source

Historical taxi trip data is sourced from the NYC Taxi and Limousine Commission (TLC).

Models

The project implements and compares three modeling approaches:

  • Baseline Model: Simple statistical baseline for benchmarking
  • XGBoost: Gradient boosting with optimized hyperparameters
  • LightGBM: Efficient gradient boosting with feature engineering

Dependencies

Core dependencies include:

  • pandas & numpy: Data manipulation
  • scikit-learn: ML utilities
  • xgboost & lightgbm: Gradient boosting models
  • plotly: Interactive visualizations
  • optuna: Hyperparameter optimization

See pyproject.toml for the complete dependency list.

License

This project is available for educational purposes.

About

End-to-end ML product that predicts taxi demand in NYC

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published