Skip to content

truongvude/epl_statistics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

epl_statistics

Project for the DataTalksClub/Data Engineering Zoomcamp

Overview

This project is part of the Data Engineering Zoomcamp, a course organization by DataTalksClub .The goal of this project is to apply everything we have learned in this course to build an end-to-end data pipeline.

Problem

  • This project uses data from English Premier League matches over the past 10 seasons (2014/2015-2023/2024). This data is taken from the website https://www.football-data.co.uk/ . The goal of this project is to create a dashboard to search for potential football teams (so bettors can confidently bet on their favorite team), helping to minimize risks before placing a bet money goes to bookie. Because each season there will be players transferred, it is impossible to know their mutations, so all data are for reference only. Think carefully before placing a bet.

    Disclaimed

    • Betting in some countries is illegal and can result in criminal prosecution. I don't support online betting. Any attempt to claim compensation or blame for reliance on this dashboard is unacceptable.

Dataflow diagram

Stack

  • Container: Docker
  • Iac: Terraform
  • Cloud: Google Cloud Platform (GCP)
  • Orchestration: Airflow
  • Data Lake: Google Cloud Storage (GCS)
  • Data Warehouse: BigQuery
  • Transformation: Data build tool (dbt)
  • Visualization: Looker

Tutorial

Prerequisites

  • Installed locally:
    • Terraform
    • Python 3
    • Docker & docker-compose
  • A project in Google Cloud Platform

Setup

  1. To run this project, you need to clone this repository:

    git clone https://github.com/truongvude/epl_statistics
  2. Terraform

    • Setup GCP for the first time.
    • Move to terraform folder. Update variables credentials, gcs_bucket_name, bq_dataset_name in variables.tf file to your desired.
    • Run this command to execute terraform
    # Login to Gcloud CLI
    gcloud auth application-default login
    # Initialize state file (.tfstate)
    terraform init
    # Check changes to new infra plan
    terraform plan
    # Create new infra
    terraform apply
    
  3. Airflow + Bigquery

    # Move to airflow folder
    cd airflow
    # Build the image (only first-time, or when there's any change in the Dockerfile, takes ~15 mins for the first-time):
    docker compose build
    # Initialize the Airflow scheduler, DB, and other config
    docker compose up airflow-init
    # Kick up the all the services from the container:
    docker compose up
    • Login to Airflow web UI on localhost:8080 with default creds (username/password): airflow/airflow
    • Run DAG on the Web Console. On finishing your run or to shut down the container/s:
    docker compose down
    • Check your external table in BigQuery.
  4. Dbt

    • Setup your dbt account and project.
    • Go to Develop -> Cloud IDE.
    • Copy code from this folder.
    • Run dbt build to execute.
    • Check your dataset in BigQuery.
  5. Looker Studio

    In this step you need to connect to connect table in BigQuery with your Looker Studio

    • Go to Looker Studio: https://lookerstudio.google.com/.
    • Create a blank report -> Select BigQuery in Google Connector. Select your project, dataset and table.
    • Create your dashboard.

Dashboard

About

Project for the DataTalksClub/Data Engineering Zoomcamp

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published