This guide walks you through the steps to build a batch data pipeline using AWS Glue, Apache Iceberg, PySpark, and AWS S3.
This project demonstrates batch processing with AWS Glue, Apache Iceberg, and Athena using PySpark for ETL. Data is processed and stored as Iceberg tables in S3, with Athena for querying.
- AWS Glue for metadata management and ETL jobs.
- Apache Iceberg for efficient table formats with time travel and schema evolution.
- Athena for querying processed data.
- PyIceberg for programmatic interaction with Iceberg tables.
The dataset used in this project is the NYC Taxi Trip Data. You can download it from the following link:
NYC Taxi Trip Data on Kaggle
-
Create S3 Buckets:
nyc-taxi-data-project-raw
for raw data.nyc-taxi-data-project-warehouse
for processed data.nyc-taxi-data-project-results
for Athena query results.
-
Upload Raw Data: Upload
sample_nyc_taxi_data.csv
tonyc-taxi-data-project-raw
. -
Create Glue Database:
- Name:
nyc_taxi_data_db
.
- Name:
-
Create Glue Crawler:
- Input S3 path:
s3://nyc-taxi-data-project-raw/
. - Output Database:
nyc_taxi_data_db
.
- Input S3 path:
-
Submit Glue ETL Job:
- Use
scripts/glue_etl_job.py
.
- Use
-
Run Athena Queries:
- Use
scripts/test_queries.sql
.
- Use
Learn more about setting up this project step-by-step in my Medium article:
Getting Started with AWS Glue and Apache Iceberg for ETL Pipelines