- Table of Contents
This repository contains Scala and PySpark examples for working with Delta Lake, including:
- Creating and managing Delta tables
- Writing and overwriting data
- Exploring Delta metadata,
_delta_log, and table history
You can run the lab either inside a Docker container using SBT (Scala) or directly in Google Colab (Python / PySpark).
| Component | Docker/SBT | Colab/PySpark |
|---|---|---|
| Scala | 2.13.16 | N/A |
| SBT | 1.11.7 | N/A |
| Spark | 4.0.0 | 3.5.1 |
| Delta Lake | 4.0.0 | 3.2.0 |
| Docker image | deltaio/delta-docker:latest |
N/A |
docker run --name delta-lake-lab --rm -it -p 4040:4040 -u root --entrypoint bash deltaio/delta-docker--rm→ automatically remove the container when stopped-p 4040:4040→ exposes Spark UI on port 4040-u root→ run as root to install packages--entrypoint bash→ opens a bash shell
Inside the container:
apt-get update
apt-get install -y apt-transport-https curl jq
# Add SBT repository key
curl -sL "https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x2EE0EA64E40A89B84B2DF73499E82A75642AC823" | apt-key add -
# Add SBT repository
echo "deb https://repo.scala-sbt.org/scalasbt/debian all main" | tee /etc/apt/sources.list.d/sbt.list
# Update and install SBT
apt-get update
apt-get install -y sbtgit clone https://github.com/CynicDog/delta-lake-lab.git
cd delta-lake-lab/delta-lake-labdelta-lake-lab/
├── build.sbt # SBT build configuration
├── README.md # This file
├── src/ # Scala source files
│ └── main/scala/
│ ├── DeltaApp.scala
│ └── 01_table_batch_reads_and_writes/
│ └── DeltaMetaExample.scala
└── target/ # SBT build outputs (ignored in Git)
sbt run- SBT will compile the project and prompt you to select the main class.
- Enter the number corresponding to the app you wish to run.
You can also run Delta Lake examples in Python directly in Colab.
!pip uninstall -y pyspark delta-spark dataproc-spark-connect
!pip install -q pyspark==3.5.1 delta-spark==3.2.0from pyspark.sql import SparkSession
from delta import configure_spark_with_delta_pip
def get_spark():
builder = (
SparkSession.builder.appName("DeltaLakeApp")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
)
return configure_spark_with_delta_pip(builder).getOrCreate()
spark = get_spark()
sparkfrom delta.tables import DeltaTable
data = [(1, "Alice", 100), (2, "Bob", 200), (3, "Charlie", 300)]
df = spark.createDataFrame(data, ["id", "name", "amount"])
# Write Delta table
delta_path = "/tmp/delta-table"
df.write.format("delta").mode("overwrite").save(delta_path)
# Read Delta table
delta_df = spark.read.format("delta").load(delta_path)
delta_df.show()
# Update
delta_table = DeltaTable.forPath(spark, delta_path)
delta_table.update(condition="name = 'Bob'", set={"amount": "250"})
delta_table.toDF().show()
# Delete
delta_table.delete(condition="name = 'Alice'")
delta_table.toDF().show()
# Time travel
old_version_df = spark.read.format("delta").option("versionAsOf", 0).load(delta_path)
old_version_df.show()- All directories generated by Spark or SBT (
target/,spark-warehouse/,project/) are ignored in Git. - Docker is ideal for Scala-focused exercises; Colab is easier for quick Python-based experiments.