Problem statement Develop a dashboard with two tiles by (with my progress):
- Selecting a dataset of interest
- Creating a pipeline for processing this dataset and putting it to a datalake
- Creating a pipeline for moving the data from the lake to a data warehouse
- Transforming the data in the data warehouse: prepare it for the dashboard
- Building a dashboard to visualize the data
- In terminal run
ssh-keygen -t ed25519 -f ~/.ssh/covid_project_gcp -C cncPomper -b 2048
NOTE:
cncPomper
becomes our profile name on later created VM
in order to generate ssh key pair
-
Add generated ssh key to GCP
- go to Settings > Metadata > Add ssh key
- add generated public key
-
Create VM instance
- Region <- europe
- Zone <- eurobe-b
- Machine type <- e2-standard-4 (4 vCPU, 16GB memory)
- Boot disk
- OS <- Ubuntu
- Version <- Ubuntu 20.04 LTS
- Size <- 50 GB
-
Connect to VM
- ssh VM
ssh -i ~/.ssh/covid_project_gcp cncPomper@EXTERNAL_IP_ADDRESS_OF_VM
NOTE: To make our lives easier we could create a
Host
profile in.ssh/config
Host covid-project
HostName EXTERNAL_IP_ADDRESS_OF_VM
User cncPomper
IdentityFile c:/Users/MS_USERNAME/.ssh/covid_project_gcp or ~/.ssh/covid_project_gcp if on linux
- Install Anaconda
wget https://repo.anaconda.com/archive/Anaconda3-2024.02-1-Linux-x86_64.sh
bash Anaconda3-2024.02-1-Linux-x86_64.sh
Run .bashrc
(If you decided to run conda init during installation)
source .bashrc
- Install docker
sudo apt-get update
sudo apt-get install docker.io
Follow this instruction in order to run docker on VM without sudo permission
Test if docker installed succesfully
docker run hello-world
Now we need to setup docker compose
mkdir bin
cd bin
wget https://github.com/docker/compose/releases/download/v2.26.0/docker-compose-linux-x86_64 -O docker-compose
Now in ~/bin
folder we need to make the downloaded package executable
chmod +x docker-compose
Add docker-compose
to PATH:
- add at the end of
.bashrc
file the following
export PATH="${HOME}/bin:${PATH}"
- 'refresh'
.bashrc
by running
source .bashrc
Now to check if everything works run
docker-compose version
To check running containers
docker ps
- Install terraform
wget -O- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install terraform
- Terraform setup
-
Go to
IAM & Admin
>Service Accounts
-
Click on the top button
CREATE SERVICE ACCOUNT
- name : covid-project
- service account access :
Cloud Storage
>Storage Admin
BigQuery
>BigQuery Admin
Compute Engine
>Compute Admin
-
Go to
IAM & Admin
>Service Accounts
> Service account you just created >Manage keys
- Add key > Create new Json key (This will download the key on system)
-
Create directory for GCP keys
mkdir keys
cd keys
- Put downloaded key to
keys
folder
- Run
terraform init
terraform plan
Run to create resources in the cloud
terraform apply
Destroy resources configured by terraform
terraform destroy
Probably the most convienent way of download this particular dataset is by manually downloading it from kaggle and then:
- unzip in the
/data
directory
I have used data from this dataset