This project analyzes Reddit discussions from skincare subreddits to predict skincare product ratings. It collects data from multiple sources including Reddit and Adore Beauty to build a comprehensive dataset for analysis.
data/: Raw and processed datasetsraw/: Raw data from Reddit and Adore Beautyprocessed/: Cleaned and transformed data
notebooks/: Jupyter notebooks for analysissrc/: Main source codeingestion/: Data collection scriptsreddit_scraper.py: Scrapes posts and comments from skincare subredditsadore_beauty_scraper.py: Scrapes product reviews from Adore Beauty
processing/: Data cleaning and transformationmodeling/: ML model development
airflow/: Airflow DAGs for data pipelinedags/: Contains DAG definitionsreddit_scraper_dag.py: Daily Reddit data collection
dbt/: Data transformation modelsmlflow/: ML experiment tracking
- Reddit Data Collection: Scrapes posts and comments from multiple skincare subreddits (AsianBeauty, SkincareAddiction, 30PlusSkinCare)
- Adore Beauty Scraping: Collects product reviews with error handling and proxy rotation
- Automated Data Pipeline: Airflow DAG for daily data collection
- Error Handling: Robust error handling and retry mechanisms for API rate limits
- Data Storage: Organized storage of raw and processed data with timestamps
- Install requirements:
pip install -r requirements.txt - Configure API keys in
src/config.py:- Reddit API credentials
- Adore Beauty API credentials (if applicable)
- Run data ingestion manually:
- Reddit:
python src/ingestion/reddit_scraper.py - Adore Beauty:
python src/ingestion/adore_beauty_scraper.py
- Reddit:
- Install Airflow:
pip install apache-airflow - Initialize the database:
airflow db init - Create a user:
airflow users create --username admin --firstname Admin --lastname User --role Admin --email [email protected] --password admin - Start the scheduler:
airflow scheduler - Start the webserver:
airflow webserver - Access the Airflow UI at http://localhost:8080
The project can be deployed to Google Cloud Platform using Cloud Composer for managed Airflow:
- Create a GCP project and enable necessary APIs
- Set up Cloud Storage for DAGs and data
- Create a Cloud Composer environment
- Upload DAGs and source code to Cloud Storage
- Configure environment variables and connections
- Move airflow dag to cloud deployment on GCP
- Implement ML model training pipeline
- Add data quality checks
- Expand to additional data sources
- Create a web dashboard for insights