Skip to content

joshyorko/open-datalakehouse

Repository files navigation

title author
Open Datalakehouse - Bootstrapping a Datalakehouse on Kubernetes

Open Datalakehouse - Bootstrapping a Datalakehouse on Kubernetes

Logo

Open in DevPod!

DISCLAIMER - THIS IS NOT MEANT FOR PRODUCTION! - Open a GitHub issue first! - DISCLAIMER


Whoami

Just a really big nerd who likes Distributed Systems and bootstrapping stuff

Josh Yorko - @joshyorko - [email protected]

Goal

To simplify the deployment and management of a complete data lakehouse on Kubernetes, demonstrating best practices in GitOps, distributed systems, and data engineering. This project assumes that you have a basic understanding of Kubernetes and GitOps principles as well as experience with the tools and technologies used in the data lakehouse architecture. This data lake is meant to work, however you will need to fine tune your workloads resources obviously to scale to your needs.

Technologies Used

  • Kubernetes (The foundation of our platform)
  • ArgoCD (GitOps continuous delivery)
  • Minio (S3-compatible object storage, using Bitnami chart)
  • Dremio (SQL query engine for data lakes, using Bitnami chart)
  • Project Nessie (Multi-modal versioned data catalog, using Bitnami chart)
    • PostgreSQL (Database for Nessie, using Bitnami chart)
  • Apache Superset (Business intelligence and data visualization, using official chart)
  • Jupyter Labs (Custom PySpark Notebook with Spark built in)

Prerequisites

  • Kubernetes cluster (tested on Minikube, k3s, EKS)
  • Helm (version v3.15.2)
  • kubectl (compatible with your cluster version)
  • Basic understanding of Kubernetes concepts and ArgoCD

Quick Start

TLDR;

## For Minikube:

curl -sSL https://raw.githubusercontent.com/joshyorko/open-datalakehouse/main/setup_datalkehouse.sh | bash -s -- --platform minikube

## For K3s:

curl -sSL https://raw.githubusercontent.com/joshyorko/open-datalakehouse/main/setup_datalkehouse.sh | bash -s -- --platform k3s

## For the Current Kubernetes Context:

curl -sSL https://raw.githubusercontent.com/joshyorko/open-datalakehouse/main/setup_datalkehouse.sh | bash -s -- --platform current 

or Assuming you have a cluster already setup

git clone https://github.com/joshyorko/open-datalakehouse.git
cd open-datalakehouse
kubectl create ns argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
kubectl apply -f app-of-apps.yaml

Automated Setup Script

To streamline the setup process, a bash script has been provided to automate the creation of a high-availability Minikube cluster and the deployment of the data lakehouse components. The script will guide you through the following steps:

  1. Use Minikube or Current Context: The script will detect if you have a Kubernetes context available. If not, it will use Minikube for local development.

  2. Graceful Exit: If no Kubernetes context is detected after choosing not to use Minikube, the script will exit gracefully.

  3. Deploy Components: The script will automatically install ArgoCD and apply the Open Datalakehouse from the app-of-apps.yaml manifest located in the root of the repository.

Architecture Overview

This project deploys a complete data lakehouse architecture on Kubernetes:

  • Dremio provides SQL query capabilities over the data lake (deployed using Bitnami Helm chart)

  • Project Nessie acts as a versioned metadata catalog (deployed using Bitnami Helm chart)

    • Nessie relies on a PostgreSQL database, also deployed using a Bitnami Helm chart
  • Minio serves as the object storage layer (deployed using Bitnami Helm chart)

  • Apache Superset offers data visualization and exploration (deployed using the official Helm chart)

  • Custom Jupyter Lab Image with Spark built in for PySpark Notebooks enables distributed data processing (custom image built and maintained by the project author)

By using Bitnami charts for Dremio, Nessie, Minio, and PostgreSQL, we ensure consistent and well-maintained deployments of these components. The official Superset chart provides the latest features and best practices for deploying Superset. The custom Spark image allows for tailored configuration and dependencies specific to this project's needs.

These settings will ensure that Dremio can properly communicate with Minio for S3-compatible storage and Nessie for metadata management.

Some Nice to Haves

Dremio UI Setup for Nessie and S3 Storage if You utilize the setup_datalakehouse.sh script

After deploying Dremio, you will notice that the follow has been setup for you:

  1. Log in to the Dremio UI
  2. Nessie Has been added as a data source
  3. Minio has been added as an Object Store
  4. Three Workspace spaces have been created for you:
    • Bronze
    • Silver
    • Gold

This project includes several tools to help you generate sample data and analyze it within your data lakehouse:

Data Generation Scripts

  1. Go Script (scripts/main_minio.go):

    • Generates fake company, employee, and department data.
    • Writes data directly to MinIO in Parquet format.
    • Supports concurrent data generation and upload for improved performance.
  2. Python Script (scripts/company.py):

    • Generates fake company, employee, and department data.
    • Writes data to CSV files locally.
    • Provides a simpler alternative to the Go script.
  3. FastAPI Application (scripts/app.py):

    • Offers a RESTful API for generating and uploading fake data to S3.
    • Useful for programmatic data generation and integration with other tools.

To use these scripts, navigate to the scripts directory and run them with Python or Go, depending on the script.

Jupyter Notebooks

The project includes two Jupyter notebooks in the DockerFiles/notebooks directory:

  1. start_here.ipynb:

    • Demonstrates how to initialize a Spark session and interact with the data lakehouse.
    • Shows examples of querying Iceberg tables and loading data into DuckDB for analysis.
  2. test.ipynb:

    • Contains examples of writing data to Iceberg tables using Spark.
    • Demonstrates querying and analyzing data using Spark and DuckDB.

Pre-built Docker Image

A pre-built Docker image is available on Docker Hub, containing all the necessary dependencies for running the Jupyter notebooks and interacting with the data lakehouse. To use this image, you can visit this link.

  1. Pull the image:
    docker pull jedock87/datalake-spark:latest
  2. You can also build and deploy the Notebook locally yourself using the provided Compose Files:
    docker compose -f DockerFiles/docker-compose.yaml up -d

This will start a Jupyter Lab instance with PySpark and all required dependencies pre-installed.

Troubleshooting

  1. Check the application status in ArgoCD:

    kubectl get applications -n argocd
  2. View logs for a specific pod:

    kubectl logs -n data-lakehouse <pod-name>
  3. Describe a pod for more details:

    kubectl describe pod -n data-lakehouse <pod-name>

Conclusion

This project demonstrates a Kubernetes-native approach to building a modern data lakehouse. It leverages GitOps principles for deployment and management, showcasing the integration of various open-source technologies in a distributed systems architecture.

Remember, this setup is intended for development and testing purposes. For production deployments, additional security measures, high availability configurations, and performance tuning would be necessary.

Contributions and feedback are welcome! Open an issue or submit a pull request to help improve this project.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published