Data Engineering Lab

Table of Contents

This repository contains Scala and PySpark examples for working with Delta Lake, including:

Creating and managing Delta tables
Writing and overwriting data
Exploring Delta metadata, _delta_log, and table history

You can run the lab either inside a Docker container using SBT (Scala) or directly in Google Colab (Python / PySpark).

Environment Versions

Component	Docker/SBT	Colab/PySpark
Scala	2.13.16	N/A
SBT	1.11.7	N/A
Spark	4.0.0	3.5.1
Delta Lake	4.0.0	3.2.0
Docker image	`deltaio/delta-docker:latest`	N/A

Option 1: Run in Docker (Scala + SBT)

1. Start the Delta Docker container

docker run --name delta-lake-lab --rm -it -p 4040:4040 -u root --entrypoint bash deltaio/delta-docker

--rm → automatically remove the container when stopped
-p 4040:4040 → exposes Spark UI on port 4040
-u root → run as root to install packages
--entrypoint bash → opens a bash shell

2. Install SBT and utilities

Inside the container:

apt-get update
apt-get install -y apt-transport-https curl jq

# Add SBT repository key
curl -sL "https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x2EE0EA64E40A89B84B2DF73499E82A75642AC823" | apt-key add -

# Add SBT repository
echo "deb https://repo.scala-sbt.org/scalasbt/debian all main" | tee /etc/apt/sources.list.d/sbt.list

# Update and install SBT
apt-get update
apt-get install -y sbt

3. Clone the lab repository

git clone https://github.com/CynicDog/delta-lake-lab.git
cd delta-lake-lab/delta-lake-lab

4. Project structure

delta-lake-lab/
├── build.sbt           # SBT build configuration
├── README.md           # This file
├── src/                # Scala source files
│   └── main/scala/
│       ├── DeltaApp.scala
│       └── 01_table_batch_reads_and_writes/
│           └── DeltaMetaExample.scala
└── target/             # SBT build outputs (ignored in Git)

5. Run the project

sbt run

SBT will compile the project and prompt you to select the main class.
Enter the number corresponding to the app you wish to run.

Option 2: Run in Google Colab (PySpark + Delta Lake)

You can also run Delta Lake examples in Python directly in Colab.

1. Install dependencies

!pip uninstall -y pyspark delta-spark dataproc-spark-connect
!pip install -q pyspark==3.5.1 delta-spark==3.2.0

2. Create a SparkSession configured for Delta

from pyspark.sql import SparkSession
from delta import configure_spark_with_delta_pip

def get_spark():
    builder = (
        SparkSession.builder.appName("DeltaLakeApp")
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    )
    return configure_spark_with_delta_pip(builder).getOrCreate()

spark = get_spark()
spark

3. Test Delta table features

from delta.tables import DeltaTable
data = [(1, "Alice", 100), (2, "Bob", 200), (3, "Charlie", 300)]
df = spark.createDataFrame(data, ["id", "name", "amount"])

# Write Delta table
delta_path = "/tmp/delta-table"
df.write.format("delta").mode("overwrite").save(delta_path)

# Read Delta table
delta_df = spark.read.format("delta").load(delta_path)
delta_df.show()

# Update
delta_table = DeltaTable.forPath(spark, delta_path)
delta_table.update(condition="name = 'Bob'", set={"amount": "250"})
delta_table.toDF().show()

# Delete
delta_table.delete(condition="name = 'Alice'")
delta_table.toDF().show()

# Time travel
old_version_df = spark.read.format("delta").option("versionAsOf", 0).load(delta_path)
old_version_df.show()

Notes

All directories generated by Spark or SBT (target/, spark-warehouse/, project/) are ignored in Git.
Docker is ideal for Scala-focused exercises; Colab is easier for quick Python-based experiments.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
src/main/scala		src/main/scala
.gitignore		.gitignore
Data_Engineering_Lab.ipynb		Data_Engineering_Lab.ipynb
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Engineering Lab

Environment Versions

Option 1: Run in Docker (Scala + SBT)

1. Start the Delta Docker container

2. Install SBT and utilities

3. Clone the lab repository

4. Project structure

5. Run the project

Option 2: Run in Google Colab (PySpark + Delta Lake)

1. Install dependencies

2. Create a SparkSession configured for Delta

3. Test Delta table features

Notes

About

Uh oh!

Releases

Packages

Languages

CynicDog/data-engineering-lab

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Lab

Environment Versions

Option 1: Run in Docker (Scala + SBT)

1. Start the Delta Docker container

2. Install SBT and utilities

3. Clone the lab repository

4. Project structure

5. Run the project

Option 2: Run in Google Colab (PySpark + Delta Lake)

1. Install dependencies

2. Create a SparkSession configured for Delta

3. Test Delta table features

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages