Bitcoin volume calculation realtime - (end project Datazoomcamp 2024)

Problem statement

We want to calculate the volume of "BUY" and "SELL" transaction of bitcoin for specific period of times ( per minutes, hours, days...).

The collection of transaction can be done from different sources of data. In this Exemple we use Coinbase which is one of the largest centralized exchange of cryptocurrencies.

One of the challenge is the large amount of data : the number of transactions can be huge.

So it is difficult to have aggregated data such as volume for short period of times without too much delay, or near real time.

The goal of this project is too show a way to address this issue.

By connecting to Coinbase through their websocket, we get transactions in real time.
Transactions are injected into kafka, which process them in real time as well
The volume aggregation is done by using stateful streaming technique thanks to the kafka stream api

Description of the data aggregation part ( kafka stream)

In order to get bitcoin transactions volume we need to define dimensions for our aggregation.

the first obvious one is the "side" of a transaction ( buy or sell)
the second one is the period of time for bucketting our transactions, also called a "time window" : In our case we limit to one minute windows

In summary, transactions are grouped by side and time window, for each group we sum the price of each transaction

After each kafka event is processed a new state of the volume for a specific side and time window is emitted, and stored in a database , so we can expose later the volumes for a large time scale.

for more detail check the java class OneMinuteWindowKstream which contains the mechanic explained above.

description of ingestion part ( websocket -> kafka)

This part is more obvious. A websocket connection is attached to coinbase websoket endpoint , and each event received is broadcasted to a kafka topic.

please refer to python connector

Cloud based solution

The infrasctructures chosen to handle the data are big query in GCP and a kafka cluster in Confluent cloud

please refer to the terraform directory to check how the infrastructure definition of both google cloud part and confluent cloud part.

for simplicity, there is no cloud infrasctructure definition or deployment to the cloud for the 2 apps that ( python and java part), so we will run them locally in docker.

How to reproduce the solution

pre-requisite

install python3 and java11

create cloud infrastructure

install terraform

follow instruction here https://developer.hashicorp.com/terraform/install

infrastructure as a code - kafka

This section allows to recreate/update a kafka cluster in confluent cloud using terraform template.

create confluent cloud account

If you already have a confluent cloud account you can skip this step.

go to https://confluent.cloud/ and follow the given instructions to get your account working you can skip the payment method and you will have a trial budget for a few week.

create a confluent api key and secret

this will be recquired for terraform to have access to your confluent could account go to https://confluent.cloud/settings/api-keys/create?user_dir=ASC&user_sort=status and follow the procedure.

copy or download them but remind that it s security issue to keep them, but you will need them to run the terraform code.

running terraform code

go in terraform/confluent-cloud folder cd terraform/confluent-cloud/
initialise terraform terraform init
check the planning terraform plan a prompt will ask you for the key and secret you just created and your account id ( find here https://confluent.cloud/settings/org/accounts/users under ID column) you can check the output it should display the list of component terraform is planning to create.
- a environment
- a kafka cluster
- some key to access it
- and 2 topics ( used by the apps in this repo)
apply the plan terraform apply and check that everything was created from https://confluent.cloud/environments note that you will need the new created kafka cluster api key and secret for running apps they can be found in terraform.tfstate file looking for credentials ( if you have jq installed execute terraform state pull | jq '.resources[] | select(.type == "confluent_kafka_topic") | .instances[0].attributes.credentials')

infrastructure as a code - big query

This section allows to recreate/update big query dataset in google cloud platform with terraform.

create GCP account

If you already have a google cloud account you can skip this step.

go to https://console.cloud.google.com/ and follow the given instructions to get your account working you have to provide payment method BUT don't worry you will have a trial budget for 3 months ( it's advised to set up an budget alert since the solution provided is quite data intensive and could reach a certain amount of paid service if you keep it running for a few days without interuption).

configure gcloud cli ( required for terraform auth to gcp) https://cloud.google.com/sdk/docs/initializing

running terraform code

go in terraform/google-cloud folder cd terraform/google-cloud/
initialise terraform terraform init
check the planning terraform plan
apply the plan terraform apply and check that everything was created from https://console.cloud.google.com/bigquery

check also that the gcp-key.json as been create on your local , which is used by the java ( kafka stream) app to authenticate to gcp using the service account created in terraform

build the data aggregation part

to build the data consumer executable ./gradlew clean build -PcustomName=consumer -PmainClass=org.example.OneMinuteWindowKstream

run local applications vs cloud based kafka and bigquery

create env file

use the template given from the env file to create your own .env cp env .env replace cluster key, secret and url with yours ( created after applying terraform plan) and your gcp project id

Run producer against remote kafka

using docker in order to provide a working environment for python

1- build image docker-compose -f docker-compose-prod.yml build --no-cache producer 2- build and run the docker container docker-compose -f docker-compose-prod.yml up producer

Run consumer against remote kafka

1- build image docker-compose -f docker-compose-prod.yml build --no-cache consumer 2- execute docker-compose -f docker-compose-prod.yml up consumer

Run producer and consumer against remote kafka

1- build images docker-compose -f docker-compose-prod.yml build 2 - execute docker-compose -f docker-compose-prod.yml up

run all in local kafka ( missing bigquery part)

if you want to use kafka locally
using docker in order to provide a working environment without pre-installation like java, etcc..

1- build artifact ( jar file) 2- build and run the docker containers

start the containers

docker-compose build --no-cache docker-compose up

to check kafka cluster info

http://localhost:9021/

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
gradle/wrapper		gradle/wrapper
src/main		src/main
terraform		terraform
websocket		websocket
.gitignore		.gitignore
build.gradle		build.gradle
docker-compose-prod.yml		docker-compose-prod.yml
docker-compose.yml		docker-compose.yml
dockerfile		dockerfile
env		env
gradlew		gradlew
gradlew.bat		gradlew.bat
readme.md		readme.md
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bitcoin volume calculation realtime - (end project Datazoomcamp 2024)

Problem statement

Description of the data aggregation part ( kafka stream)

description of ingestion part ( websocket -> kafka)

Cloud based solution

How to reproduce the solution

pre-requisite

create cloud infrastructure

install terraform

infrastructure as a code - kafka

create confluent cloud account

create a confluent api key and secret

running terraform code

infrastructure as a code - big query

create GCP account

running terraform code

build the data aggregation part

run local applications vs cloud based kafka and bigquery

create env file

Run producer against remote kafka

Run consumer against remote kafka

Run producer and consumer against remote kafka

run all in local kafka ( missing bigquery part)

start the containers

to check kafka cluster info

About

Uh oh!

Releases

Packages

Languages

datazoomcamp2024-endprojet/realtime-data-feed

Folders and files

Latest commit

History

Repository files navigation

Bitcoin volume calculation realtime - (end project Datazoomcamp 2024)

Problem statement

Description of the data aggregation part ( kafka stream)

description of ingestion part ( websocket -> kafka)

Cloud based solution

How to reproduce the solution

pre-requisite

create cloud infrastructure

install terraform

infrastructure as a code - kafka

create confluent cloud account

create a confluent api key and secret

running terraform code

infrastructure as a code - big query

create GCP account

running terraform code

build the data aggregation part

run local applications vs cloud based kafka and bigquery

create env file

Run producer against remote kafka

Run consumer against remote kafka

Run producer and consumer against remote kafka

run all in local kafka ( missing bigquery part)

start the containers

to check kafka cluster info

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages