Data Lakehouse Boilerplate

A minimalistic boilerplate to serve a modern stack data lakehouse.
Uses the following stack:

S3(Minio) + Iceberg(S3 Table Format)
Trino(For querying data)
Spark(For inserting data at scale)
Airflow(Task orchestration)

Includes an implementation of an ETL service, extracting data, processing it and loading into the datalake using Spark. Orchestrated by Airflow. Project can be found in scripts folder.

Starting up

In order to start up the project, you must have docker running on your machine.
Run the following command to start up:

docker-compose up -d

Exposed ports:

Port 9090: Trino
Port 9000: Minio S3
Port 9001: Minio Admin UI
Port 9123: Iceberg postgres(for JDBC connection)
Port 8080: Airflow Webserver
Port 8081: Airflow driver Spark UI
Port 7077: Spark Master
Port 8082: Spark Master UI
Port 8083: Spark Worker UI

Setting up The DataLake

To start working with the datalake, you must first generate a pair of Access/Secret Keys in Minio.
To do so, head over to localhost:9001 and login using default credentials:

username=minio
password=minio123

Head over to "Access Keys" and generate yourself a pair of keys.
Replace the default settings that are set in assets/trino_iceberg.properties(Will require restart to docker-compose).
You may use the same pair for Spark connection/set up different pairs for Spark/Trino.

Extras - Connecting to Trino

Note - All those settings are already a part of the docker-compose file.
You must first set up a trino "Iceberg Connector" to serve the iceberg tables for a JDBC connection. to do so, docker-compose copies the trino_iceberg.properties file into your trino config.
An example connector would look like the following:

connector.name=iceberg
iceberg.catalog.type=jdbc

# JDBC catalog configuration for PostgreSQL
# Uses 5432 Postgres port as it is from within the docker closed network and not from our machine
iceberg.jdbc-catalog.connection-url=jdbc:postgresql://postgres:5432/<JDBC_DB>
iceberg.jdbc-catalog.connection-user=<JDBC_USER>
iceberg.jdbc-catalog.connection-password=<JDBC_PASSWORD>
iceberg.jdbc-catalog.driver-class=org.postgresql.Driver
iceberg.jdbc-catalog.catalog-name=<CATALOG_NAME> #Name of your choice
iceberg.jdbc-catalog.default-warehouse-dir=s3a://<LAKEHOUSE_DIRECTORY>/ #Directory on S3 configured for iceberg

# S3 configuration
fs.hadoop.enabled=false
fs.native-s3.enabled=true
s3.endpoint=http://minio:9000
s3.region=<MINIO_S3_REGION>
s3.aws-access-key=<MINIO_ACCESS_KEY>
s3.aws-secret-key=<MINIO_SECRET_KEY>
s3.path-style-access=true

# File format configuration
iceberg.file-format=PARQUET
iceberg.compression-codec=GZIP

To connect to trino with third party IDE(Such as DataGrip), you must setup a JDBC connection.
Default credentials to trino's JDBC connection:

username=trino
password=<BLANK>

Extras - Connecting to Spark

Spark connects to the JDBC metastore and interacts with Iceberg tables natively(using the right JARs)
An example connection would look like the following:

from pyspark.sql import SparkSession

    session = SparkSession.builder \
        .appName("DataLake") \
        .config("spark.jars", get_spark_jars()) \
        .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
        .config("spark.sql.session.timeZone", "Asia/Jerusalem") \
        .config("spark.sql.parquet.enableVectorizedReader", "false") \
        .config("spark.sql.codegen.wholeStage", "false") \
        .config(f"spark.sql.catalog.{CATALOG_NAME}", "org.apache.iceberg.spark.SparkCatalog") \
        .config(f"spark.sql.catalog.{CATALOG_NAME}.catalog-impl", "org.apache.iceberg.jdbc.JdbcCatalog") \
        .config(f"spark.sql.catalog.{CATALOG_NAME}.uri", JDBC_URL) \
        .config(f"spark.sql.catalog.{CATALOG_NAME}.jdbc.user", JDBC_USERNAME) \
        .config(f"spark.sql.catalog.{CATALOG_NAME}.jdbc.password", JDBC_PASSWORD) \
        .config(f"spark.sql.catalog.{CATALOG_NAME}.warehouse", S3_LAKEHOUSE_DIRECTORY) \
        .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("spark.hadoop.fs.s3a.endpoint", S3_ENDPOINT) \
        .config("spark.hadoop.fs.s3a.access.key", S3_ACCESS_KEY) \
        .config("spark.hadoop.fs.s3a.secret.key", S3_SECRET_KEY) \
        .config("spark.hadoop.fs.s3a.path.style.access", "true") \
        .getOrCreate()

    tables = session.sql("SHOW SCHEMAS in iceberg")
    tables.show()

Make sure you provide the necessary JARs for spark to run as expected.
All JARs must match versions for each component of the datalake:

Postgres(JDBC connection)
Spark Iceberg(Must match both spark master(or local) and iceberg version of 3rd party modules like Trino)

A list of JARs is provided in assets/spark_jars.txt.

Extras - Running spark locally

To run spark locally, you must first download pyspark using pip command:

pip install pyspark

You also need to have an Hadoop instance on your machine. For windows, you can download the following Hadoop binaries and add it to your environment path for Spark to work as it should: WinUtils
Latest binaries(Hadoop 3.6.6) would work fine.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
dags		dags
requirements		requirements
scripts		scripts
.gitignore		.gitignore
Dockerfile.airflow		Dockerfile.airflow
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Lakehouse Boilerplate

Starting up

Setting up The DataLake

Extras - Connecting to Trino

Extras - Connecting to Spark

Extras - Running spark locally

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Lakehouse Boilerplate

Starting up

Setting up The DataLake

Extras - Connecting to Trino

Extras - Connecting to Spark

Extras - Running spark locally

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages