A machine learning solution for predicting taxi demand in New York City, designed to optimize fleet operations by matching driver supply with passenger demand.
This project implements an end-to-end machine learning pipeline to forecast taxi demand patterns. By analyzing historical trip data, the system helps operations teams maintain optimal fleet utilization, reducing both driver idle time and passenger wait times.
- Automated data ingestion from NYC TLC Trip Record Data
- Time series data transformation and feature engineering
- Multiple ML models: Baseline, XGBoost, and LightGBM
- Comprehensive data validation and preprocessing pipeline
- Interactive visualization and exploratory data analysis
taxi_demand_predictor/
├── data/
│ ├── raw/ # Raw parquet files from NYC TLC
│ └── transformed/ # Processed time series data
├── notebooks/ # Jupyter notebooks for analysis
│ ├── 01_load_and_validate_raw_data.ipynb
│ ├── 02_transform_raw_data_into_ts_data.ipynb
│ ├── 03_transform_ts_data_into_features_and_target.ipynb
│ ├── 04_transform_raw_data_into_features_and_targets.ipynb
│ ├── 05_visualize_training_data.ipynb
│ ├── 06_baseline_model.ipynb
│ ├── 07_xgboost_model.ipynb
│ ├── 08_lightgbm_model.ipynb
│ └── 09_lightgbm_model_with_feature_engineering.ipynb
├── src/ # Source code modules
│ ├── data.py # Data loading and validation
│ ├── data_split.py # Train/test splitting utilities
│ ├── paths.py # Path configurations
│ └── plot.py # Visualization utilities
├── pyproject.toml # Poetry dependencies
└── README.md
- Python 3.10 or higher
- Poetry (for dependency management)
- Clone the repository:
git clone https://github.com/Hichemchir/taxi_demand_predictor.git
cd taxi_demand_predictor- Install dependencies using Poetry:
poetry install- Activate the virtual environment:
poetry shellThe project follows a sequential workflow through Jupyter notebooks:
- Data Loading: Load and validate raw data from NYC TLC
- Time Series Transformation: Convert raw data into time series format
- Feature Engineering: Generate features and targets for ML models
- Visualization: Explore and visualize training data patterns
- Modeling: Train and evaluate baseline, XGBoost, and LightGBM models
To run the notebooks:
jupyter notebookNavigate to the notebooks/ directory and execute them in numerical order.
Historical taxi trip data is sourced from the NYC Taxi and Limousine Commission (TLC).
The project implements and compares three modeling approaches:
- Baseline Model: Simple statistical baseline for benchmarking
- XGBoost: Gradient boosting with optimized hyperparameters
- LightGBM: Efficient gradient boosting with feature engineering
Core dependencies include:
- pandas & numpy: Data manipulation
- scikit-learn: ML utilities
- xgboost & lightgbm: Gradient boosting models
- plotly: Interactive visualizations
- optuna: Hyperparameter optimization
See pyproject.toml for the complete dependency list.
This project is available for educational purposes.