Technology used: Azure (VM, Blob Storage, Azure Functions, Azure Devops, Azure Data Lake Analytics, Azure Synapse), Kafka, python
The goal of the project was to get to know how Apache Kafka streams works. For training purposes, I used an Azure free D2S machine and simulated the stream of the data to avoid memory problems on the Azure instance. Each event in the simulated streaming was created as a sample JSON file from an existing dataset. The streamed data was uploaded to an Azure Blob Storage and analyzed by a Data Factory, which created a table schema in the Azure Data Lake Analytics. That allowed me to perform queries in Azure Synapse Analytics.
- active Azure account subscription
Setup Kafka server on Azure VM:
- Create a VM instance in azure : I personally used a ubuntu server 24.04 LTS
- Start the instance, the azure console should look like this:
- Kafka port: create an inbound security rule for port 9092
- Connect to the instance in ssh using the private key given at the VM creation
- install and start Docker Engine on the instance
make install_docker
- Kafka setup
make start_kafka
make create_topic
python src/kafka_producer.py
python src/kafka_consumer.py
The Goal is to create an azure function that will copy the live data (streamed by kafka into blob storage) every 10 mn into azure data lake gen2
- create a new azure function app in azure portal
- install azure function extension in your vscode, don't forget to connect to your azure account
- install azure cli in your terminal and connect. this installation is needed to add azure application environement variable to connect to blob storage and data lake storage
make install_azure_cli
verify that the installation worked and connnects to your azure account
az login
- Use the azure cli to add your connection string add your blob storage and data lake connection strings as environement variable for your azure function app
az functionapp config appsettings set --name <FunctionAppName> --resource-group <ResourceGroupName> --settings "BLOB_CONNECTION_STRING=your_blob_connection_string"
Replace with the name of your Function App, with the name of your resource group, and your_blob_connection_string with your actual connection string.