This data platform is design to solve the following problems:
- Business Intelligence: prepared data will allow analytics to assess the state of the business at various levels.
- [plans] Proactive actions based on user activity: architechture should allow making offers to users in the app based on real-time data.
Requirements & SLA:
- Data must be in a centralized repository;
- Events should be written to the storage with a delay of no more than one minute;
- Data model must reflect business processes;
- Data must be clean and reliable;
- Data must be accessible 24/7;
- Data must be documented.
The project is based on events generated by https://github.com/viirya/eventsim. This events reflect user behaviour in a fake music web site (like Spotify).
Launch instructions are here: data-generation/readme
Kafka is used to store events before they are sent to the data warehouse.
There are two options:
- [preffered] Deploy Kafka with Google Kubernetes Engine and Terraform: infrastructure/terraform
- Deploy Kafka locally in Docker: infrastructure/kafka
#terraform #kubernetes #docker #kafka
Custom Java consumers are used to consume and send events from Kafka topics to Data Warehouse tables.
Link to Java application implementation: audio-streaming-java-consumer
#java #kafka-consumers
Data Warehouse is built on BigQuery.
Documentation of the data model (including tables specification & optimizations): audio-streaming-data-platform/data-warehouse
There are three main data layers:
- Raw - raw data ingested from Kafka;
- Core - cleaned and normalized data according to Data Vault 2.0;
- Data Marts - wide tables that are easy to analyze and create reports & dashboards. This is the main entry point into the data for data analysts & scientists.
To transform the data, dbt with the dbtvault library is used: audio-streaming-dbt-datavault
#bigquery #dbt #dbtvault #data-vault
Looker is used to create reports & dashboards. The dashboard in the picture below is available at the link: https://lookerstudio.google.com/s/iWa4oRy9nc4
#looker
- Try workflow orchestration tool: https://flyte.org/