🚨 NOTE: This project is hosted on a temporary GCP environment. The dashboards may become unavailable after the cloud credits expire. 🚨
This project is an end-to-end data pipeline that ingests, transforms, and visualizes grocery sales data using:
- Apache Airflow for orchestration
- Apache Spark for transformations
- Google Cloud Platform (GCS & BigQuery) for storage and querying
- Looker Studio for visualization
.
├── airflow/
│ └── dags/
│ ├── data_ingestion_gcs_dag.py
│ ├── spark_grocery_transformations_dag.py
├── credentials/
│ └── google_credentials.json
├── terraform/
│ ├── main.tf
│ └── variables.tf
├── .env.example
├── docker-compose.yml
├── Dockerfile
├── README.md
- Docker installed
- Terraform installed
- GCP project with:
- BigQuery and GCS enabled
- A service account with the following roles:
BigQuery Admin
Storage Admin
cp .env.example .env
Then edit .env
and populate with your actual values:
# Terraform
TF_VAR_project=your-gcp-project-id
TF_VAR_region=us-central1
TF_VAR_location=US
TF_VAR_zone=us-central1-a
TF_VAR_bq_dataset_name=grocery_sales_pipeline
TF_VAR_gcs_storage=your-gcs-bucket-name
# Airflow
AIRFLOW_PROJ_DIR=./airflow
AIRFLOW_UID=501
GCP_PROJECT_ID=your-gcp-project-id
GCP_GCS_BUCKET=your-gcs-bucket-name
BIGQUERY_DATASET=grocery_sales_pipeline
AIRFLOW_CONN_SPARK='{
"conn_type": "spark",
"host": "spark://spark-master",
"port": 7077
}'
Place your GCP service account key in:
credentials/google_credentials.json
Navigate to the terraform
folder and run:
cd terraform
terraform init
terraform apply
This will create:
- GCS bucket for raw and processed data
- BigQuery dataset for storing external and transformed tables
docker-compose up
Once services are running, visit:
- Airflow UI: http://localhost:8080
Login with the default credentials:
Username: airflow
Password: airflow
- Monitor Spark Applications: http://localhost:8081
-
data_ingestion_gcs_dag
- Downloads raw CSVs
- Uploads them to GCS
- Creates BigQuery external tables
-
spark_grocery_transformations_dag
- Reads from GCS
- Transforms with PySpark
- Writes Parquet + transformed tables into BigQuery
🛑
hello_spark_dag
is for testing only — not part of the production pipeline.
Connect to the following tables in BigQuery for visualization:
fact_sales
https://lookerstudio.google.com/s/vzCIMs1DR7k
📌 Screenshots of dashboards go here:
Made with ❤️ using open data, open source, and cloud tools.