Project for the DataTalksClub/Data Engineering Zoomcamp
This project is part of the Data Engineering Zoomcamp, a course organization by DataTalksClub .The goal of this project is to apply everything we have learned in this course to build an end-to-end data pipeline.
-
This project uses data from English Premier League matches over the past 10 seasons (2014/2015-2023/2024). This data is taken from the website https://www.football-data.co.uk/ . The goal of this project is to create a dashboard to search for potential football teams (so bettors can confidently bet on their favorite team), helping to minimize risks before placing a bet money goes to bookie. Because each season there will be players transferred, it is impossible to know their mutations, so all data are for reference only. Think carefully before placing a bet.
- Betting in some countries is illegal and can result in criminal prosecution. I don't support online betting. Any attempt to claim compensation or blame for reliance on this dashboard is unacceptable.
- Container: Docker
- Iac: Terraform
- Cloud: Google Cloud Platform (GCP)
- Orchestration: Airflow
- Data Lake: Google Cloud Storage (GCS)
- Data Warehouse: BigQuery
- Transformation: Data build tool (dbt)
- Visualization: Looker
- Installed locally:
- Terraform
- Python 3
- Docker & docker-compose
- A project in Google Cloud Platform
-
To run this project, you need to clone this repository:
git clone https://github.com/truongvude/epl_statistics
-
Terraform
- Setup GCP for the first time.
- Move to terraform folder. Update variables
credentials
,gcs_bucket_name
,bq_dataset_name
invariables.tf
file to your desired. - Run this command to execute terraform
# Login to Gcloud CLI gcloud auth application-default login # Initialize state file (.tfstate) terraform init # Check changes to new infra plan terraform plan
# Create new infra terraform apply
-
Airflow + Bigquery
- Setup Airflow with Docker
- Change GCP_PROJECT_ID & GCP_GCS_BUCKET in
docker-compose.yaml
, BIGQUERY_DATASET indata_ingestion_gcs_dag.py
as your config. - Run this command
# Move to airflow folder cd airflow # Build the image (only first-time, or when there's any change in the Dockerfile, takes ~15 mins for the first-time): docker compose build # Initialize the Airflow scheduler, DB, and other config docker compose up airflow-init # Kick up the all the services from the container: docker compose up
- Login to Airflow web UI on localhost:8080 with default creds (username/password): airflow/airflow
- Run DAG on the Web Console. On finishing your run or to shut down the container/s:
docker compose down
- Check your external table in BigQuery.
-
Dbt
-
Looker Studio
In this step you need to connect to connect table in BigQuery with your Looker Studio
- Go to Looker Studio: https://lookerstudio.google.com/.
- Create a blank report -> Select BigQuery in Google Connector. Select your project, dataset and table.
- Create your dashboard.
- Link dashboard: EPL Statistics