This project is a cloud-based data engineering pipeline for analyzing beverage sales. The pipeline ingests raw sales data, transforms it using dbt, and orchestrates workflows with Kestra. The processed data is stored in Google BigQuery, and insights are visualized through Looker Studio dashboards.
- Orchestration: Kestra (workflow automation)
- Transformations: dbt (Data Build Tool) (data modeling and transformations)
- Cloud Platform: Google Cloud Platform (GCP)
- Infrastructure as Code: Terraform
- Visualization: Looker Studio
.
├── README.md # Project documentation
├── .env # Environment variables (update as needed)
├── .env.example # Example environment file
├── dashboard # Looker Studio dashboards
├── dbt # dbt transformation logic
│ ├── beverage_sales # dbt project directory
│ │ ├── models # Core, marts, and staging models
│ │ ├── macros # Custom dbt macros
│ │ ├── tests # dbt tests
│ │ ├── seeds # Seed data
│ │ ├── snapshots # Snapshot tables
│ │ ├── dbt_project.yml # dbt project config
├── docker # Docker configuration
│ └── docker-compose.yml # Services setup
├── kestra # Kestra workflows for orchestration
│ ├── config.yml # Kestra configuration
│ ├── flows # Workflow definitions
│ └── data # Ingested data
└── terraform # Infrastructure as code
├── main.tf # Terraform configuration
├── variables.tf # Terraform variables
- Data Ingestion
- Raw CSV data is stored in Google Cloud Storage (GCS).
- Kestra orchestrates ingestion into an external BigQuery table.
- Transformations with dbt
- The data is cleaned, modeled, and enriched into core fact and dimension tables.
- Analytical tables include customer insights, sales performance, seasonal trends, and top products.
- Orchestration with Kestra
- Automates data loading, transformations, and scheduled runs.
- Visualization in Looker Studio
- Data is presented in interactive dashboards for analysis.
- Google Cloud project with BigQuery and Cloud Storage enabled.
- Terraform installed (
brew install terraform
or download). - Docker installed (
brew install docker
or Docker Desktop).
-
Clone the repository:
git clone https://github.com/Sharonsyra/beverage-sales-data-engineering-project cd beverage-sales-data-engineering-project
-
Make a copy of the
.env.example
file and rename it to.env
:cp .env.example .env
-
Fill
.env
with your Google Cloud credentials. -
Run this script to set up secrets for Kestra:
x=23 # Change this to the line number you want to start from awk "NR >= $x" .env | while IFS='=' read -r key value; do echo "SECRET_$key=$(echo -n "$value" | base64)" done >> .env # Append to the existing .env file
-
Set Environment Variables:
set -o allexport; source .env; set +o allexport
-
Deploy infrastructure with Terraform:
cd terraform terraform init terraform apply
-
Start Docker Services:
cd docker docker-compose --env-file ../.env up --build
-
Updating
staging/schema.yml
Before running
dbt
, you must editdbt/beverage_sales/models/staging/schema.yml
with your GCP details. Update the following section:sources: - name: staging database: <YOUR_GCP_PROJECT_ID> # Replace with your actual GCP project ID schema: <YOUR_GCP_DATASET> # Replace with your actual BigQuery dataset tables: - name: <YOUR_DATA_TABLE_NAME> # Replace with your actual table name
Make sure the values match those in your
.env
file. -
Run dbt transformations:
cd dbt/beverage_sales dbt compile dbt run
-
Access dashboards in Looker Studio (links above).
To securely store your Google Cloud Service Account JSON, follow these steps:
- Open the Kestra UI at http://localhost:8080.
- Navigate to Namespaces → Keystore.
- Select the zoomcamp namespace.
- Click KV Store and enter the following:
- Key:
GCP_CREDS
- Type:
JSON
- Value: Paste the contents of your
service-account.json
file.
- Key:
- Click Save.
Once added, the service account credentials will be securely accessible inside Kestra workflows.
-
Load Kestra KV Store
- Open Kestra UI at http://localhost:8080.
- Navigate to Flows.
- Select
zoomcamp.gcp_kv
. - Click the Execute button (top-right).
- Ensure
GCS_CREDS
is set as pointed out in the instructions above.
-
Ingest Data
- Open Kestra UI at http://localhost:8080.
- Navigate to Flows.
- Select
zoomcamp.gcp_ingest_and_load
. - Click the Execute button (top-right).
-
Run dbt Transformations
- Open Kestra UI at http://localhost:8080.
- Navigate to Flows.
- Select
zoomcamp.gcp_dbt
. - Click the Execute button (top-right).
- This only runs on my gcp instance given it is syncing from github, just an extra flow.
gcp_dbt_flow
gcp_ingest_and_load
gcp_kv
zoomcamp_kv_store
- Implement data quality checks using Great Expectations.
- Optimize cost efficiency in BigQuery storage.
- Add real-time streaming with Pub/Sub and Dataflow.
- Sharon Waithîra – GitHub