This project focuses on clustering retail data to segment customers based on their purchasing behavior. The analysis is built around RFM analysis (Recency, Frequency, Monetary value) combined with K-Means clustering. The code is adapted from the tutorial Clustering Retail Data and extended with a reusable Python package and command line interface.
- Install dependencies (preferably in a virtual environment):
pip install pandas scikit-learn matplotlib seaborn
- Run the pipeline on your retail Excel file:
The script will automatically determine the optimal number of clusters using the silhouette score.
python -m retail_clustering.pipeline data/online_retail_II.xlsx --out results.csv
Unit tests are provided to validate the RFM calculations and clustering logic:
pytest -q