AWS Glue + Apache Iceberg ETL Project

This guide walks you through the steps to build a batch data pipeline using AWS Glue, Apache Iceberg, PySpark, and AWS S3.

Overview

This project demonstrates batch processing with AWS Glue, Apache Iceberg, and Athena using PySpark for ETL. Data is processed and stored as Iceberg tables in S3, with Athena for querying.

Features

AWS Glue for metadata management and ETL jobs.
Apache Iceberg for efficient table formats with time travel and schema evolution.
Athena for querying processed data.
PyIceberg for programmatic interaction with Iceberg tables.

Dataset

The dataset used in this project is the NYC Taxi Trip Data. You can download it from the following link:
NYC Taxi Trip Data on Kaggle

Architecture

Steps to Set Up

Create S3 Buckets:
- nyc-taxi-data-project-raw for raw data.
- nyc-taxi-data-project-warehouse for processed data.
- nyc-taxi-data-project-results for Athena query results.
Upload Raw Data: Upload sample_nyc_taxi_data.csv to nyc-taxi-data-project-raw.
Create Glue Database:
- Name: nyc_taxi_data_db.
Create Glue Crawler:
- Input S3 path: s3://nyc-taxi-data-project-raw/.
- Output Database: nyc_taxi_data_db.
Submit Glue ETL Job:
- Use scripts/glue_etl_job.py.
Run Athena Queries:
- Use scripts/test_queries.sql.

Medium Article

Learn more about setting up this project step-by-step in my Medium article:
Getting Started with AWS Glue and Apache Iceberg for ETL Pipelines

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

AWS Glue + Apache Iceberg ETL Project

Overview

Features

Dataset

Architecture

Steps to Set Up

Medium Article

Files

README.md

Latest commit

History

README.md

File metadata and controls

AWS Glue + Apache Iceberg ETL Project

Overview

Features

Dataset

Architecture

Steps to Set Up

Medium Article