Build a production-ready data engineering pipeline using Snowflake to process e-commerce data, create ML features, and deploy predictive models. This project demonstrates advanced Snowflake capabilities including Snowpark, ML UDFs, and real-time analytics.
Business Case: An e-commerce company needs to predict customer churn and optimize marketing campaigns using real-time transaction data.
- Sign up at: https://signup.snowflake.com/
- Choose: AWS | US East (N. Virginia) | Standard Edition (Free Tier)
python3.10 -m venv snowflake_ml_env
snowflake_ml_env\Scripts\activate
pip install snowflake-connector-python
pip install snowflake-snowpark-python
pip install pandas numpy scikit-learn xgboost
pip install streamlit plotly faker
- Snowflake Extension
- Python
- Jupyter
Raw Data (CSV) ➔ Snowflake Stage ➔ Raw Tables ➔
Transformed Tables ➔ Feature Store (Snowpark) ➔
ML Models (UDFs) ➔ Streamlit Dashboard
- Generate synthetic users, products, and transactions using
data_generator.py
- Save to CSV and upload to Snowflake stage
- Create raw tables in Snowflake and copy data from stage
- Use
data_loader.py
to PUT CSVs into Snowflake and populate raw tables - Use
data_transformation.py
(Snowpark) to createFEATURES.USER_FEATURES
- Generate churn labels based on recent transaction activity
- Use
model_training.py
to train churn model on features using Random Forest or XGBoost - Evaluate with classification report and save model using
joblib
- Deploy model as Snowflake UDF using
deploy_model_udf.py
- Register both
predict_churn
andpredict_churn_probability
functions - Test using SQL queries and create prediction view
ML_MODELS.CUSTOMER_CHURN_PREDICTIONS
-
Use
dashboard.py
to build an interactive Streamlit dashboard -
Visualize:
- Customer segmentation
- Revenue by category
- Churn analysis by segment
- List of high-risk customers
- Schedule daily runs with
automated_pipeline.py
- Integrate feature generation, model retraining, and UDF updates
- Check for duplicate records
- Validate NULL values and range distributions
- Confirm model UDF predictions are aligned with expected churn logic
- Snowflake (Snowpark, UDFs, Data Warehousing)
- Machine Learning (Feature Engineering, Deployment)
- Data Engineering (ETL, Automation, Quality Checks)
- Python Development (API, Streamlit, Joblib, Scheduling)
- Cloud Platforms (Snowflake, AWS, Azure)
- Analytics & BI (Dashboarding, KPIs, Visualization)
- 100K+ transactions processed daily
- 89% model accuracy for churn prediction
- Real-time scoring with Snowflake UDFs
- Fully automated and tested pipeline