Skip to content

CynicDog/data-engineering-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Engineering Lab

This repository contains Scala and PySpark examples for working with Delta Lake, including:

  • Creating and managing Delta tables
  • Writing and overwriting data
  • Exploring Delta metadata, _delta_log, and table history

You can run the lab either inside a Docker container using SBT (Scala) or directly in Google Colab (Python / PySpark).

Environment Versions

Component Docker/SBT Colab/PySpark
Scala 2.13.16 N/A
SBT 1.11.7 N/A
Spark 4.0.0 3.5.1
Delta Lake 4.0.0 3.2.0
Docker image deltaio/delta-docker:latest N/A

Option 1: Run in Docker (Scala + SBT)

1. Start the Delta Docker container

docker run --name delta-lake-lab --rm -it -p 4040:4040 -u root --entrypoint bash deltaio/delta-docker
  • --rm → automatically remove the container when stopped
  • -p 4040:4040 → exposes Spark UI on port 4040
  • -u root → run as root to install packages
  • --entrypoint bash → opens a bash shell

2. Install SBT and utilities

Inside the container:

apt-get update
apt-get install -y apt-transport-https curl jq

# Add SBT repository key
curl -sL "https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x2EE0EA64E40A89B84B2DF73499E82A75642AC823" | apt-key add -

# Add SBT repository
echo "deb https://repo.scala-sbt.org/scalasbt/debian all main" | tee /etc/apt/sources.list.d/sbt.list

# Update and install SBT
apt-get update
apt-get install -y sbt

3. Clone the lab repository

git clone https://github.com/CynicDog/delta-lake-lab.git
cd delta-lake-lab/delta-lake-lab

4. Project structure

delta-lake-lab/
├── build.sbt           # SBT build configuration
├── README.md           # This file
├── src/                # Scala source files
│   └── main/scala/
│       ├── DeltaApp.scala
│       └── 01_table_batch_reads_and_writes/
│           └── DeltaMetaExample.scala
└── target/             # SBT build outputs (ignored in Git)

5. Run the project

sbt run
  • SBT will compile the project and prompt you to select the main class.
  • Enter the number corresponding to the app you wish to run.

Option 2: Run in Google Colab (PySpark + Delta Lake)

You can also run Delta Lake examples in Python directly in Colab. Open In Colab

1. Install dependencies

!pip uninstall -y pyspark delta-spark dataproc-spark-connect
!pip install -q pyspark==3.5.1 delta-spark==3.2.0

2. Create a SparkSession configured for Delta

from pyspark.sql import SparkSession
from delta import configure_spark_with_delta_pip

def get_spark():
    builder = (
        SparkSession.builder.appName("DeltaLakeApp")
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    )
    return configure_spark_with_delta_pip(builder).getOrCreate()

spark = get_spark()
spark

3. Test Delta table features

from delta.tables import DeltaTable
data = [(1, "Alice", 100), (2, "Bob", 200), (3, "Charlie", 300)]
df = spark.createDataFrame(data, ["id", "name", "amount"])

# Write Delta table
delta_path = "/tmp/delta-table"
df.write.format("delta").mode("overwrite").save(delta_path)

# Read Delta table
delta_df = spark.read.format("delta").load(delta_path)
delta_df.show()

# Update
delta_table = DeltaTable.forPath(spark, delta_path)
delta_table.update(condition="name = 'Bob'", set={"amount": "250"})
delta_table.toDF().show()

# Delete
delta_table.delete(condition="name = 'Alice'")
delta_table.toDF().show()

# Time travel
old_version_df = spark.read.format("delta").option("versionAsOf", 0).load(delta_path)
old_version_df.show()

Notes

  • All directories generated by Spark or SBT (target/, spark-warehouse/, project/) are ignored in Git.
  • Docker is ideal for Scala-focused exercises; Colab is easier for quick Python-based experiments.

About

Delta Lake Lab - Quickstart with Docker and Colab

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published