Skip to content

Abd-al-RahmanH/Automated-Big-Data-ETL-Pipeline-Scalable-Processing-with-AWS-EMR-Airflow-and-Snowflake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Automated Big Data ETL Pipeline: Scalable Processing with AWS EMR, Airflow, and Snowflake ❄️

🏒 Introduction

This project demonstrates how to build an automated big data ETL pipeline using Apache Airflow πŸ“š, Amazon EMR πŸ“Š, Amazon S3 πŸ’Ύ, and Snowflake ❄️. The pipeline:

  • 🏭 Creates an EMR Cluster
  • πŸ”₯ Runs Spark jobs to process raw data
  • πŸ’Ύ Stores the transformed data in Amazon S3
  • ❄️ Loads the processed data into Snowflake
  • 🚫 Shuts down the EMR cluster after processing

The DAG is orchestrated using Apache Airflow, automating the entire ETL process. βš™οΈ


πŸ“ Repository Structure

πŸ› οΈ Clone the Project

git clone https://github.com/Abd-al-RahmanH/Automated-Big-Data-ETL-Pipeline-Scalable-Processing-with-AWS-EMR-Airflow-and-Snowflake.git

cd Automated-Big-Data-ETL-Pipeline-Scalable-Processing-with-AWS-EMR-Airflow-and-Snowflake

πŸ“Œ Project Files

FINAL_EMR_REDFIN/
β”‚-- dags/
β”‚   β”œ-- redfin_analytics.py       # Airflow DAG for EMR cluster and Spark jobs
β”‚
β”‚-- scripts/
β”‚   β”œ-- ingest.sh                 # Shell script for data ingestion
β”‚   β”œ-- transform_redfin_data.py  # Spark script for data transformation
β”‚
β”‚-- commands.txt  # Commands for setting up AWS & Airflow
β”‚-- README.md     # Project Documentation
β”‚-- requirements.txt  # Python dependencies

⚑ Prerequisites

Before running the project, ensure you have the following:

βœ”οΈ AWS Account with permissions for VPC, S3, EMR, IAM, and Snowflake integration
βœ”οΈ S3 Bucket for raw and processed data
βœ”οΈ Apache Airflow πŸ“š installed and configured
βœ”οΈ AWS credentials configured in Airflow
βœ”οΈ Snowflake ❄️ Account with the necessary database and tables


πŸ›‘οΈ Step 1: Setting Up a VPC

Before creating an S3 bucket, set up a VPC (Virtual Private Cloud) to ensure a secure and controlled networking environment.

πŸ”Ή Create a VPC

aws configure #Then Enter your access key ,secret accesskey and region.
aws ec2 create-vpc --cidr-block 10.0.0.0/16

πŸ”Ή Create a Subnet

aws ec2 create-subnet --vpc-id <your-vpc-id> --cidr-block 10.0.1.0/24

πŸ”Ή Create an Internet Gateway

aws ec2 create-internet-gateway
aws ec2 attach-internet-gateway --vpc-id <your-vpc-id> --internet-gateway-id <your-igw-id>

Create in UI by follow the image


πŸ’Ύ Step 2: Setting Up an S3 Bucket

Create an S3 bucket with the following structure:

s3://redfin-emr-project/
β”œ-- raw-data/            # Stores raw CSV data
β”œ-- scripts/             # Stores Spark and ingestion scripts
β”œ-- transformed-data/    # Stores processed parquet files
β”œ-- emr-logs/            # Stores EMR logs

Upload necessary scripts to the S3 bucket:

aws s3 cp scripts/ s3://redfin-emr-project/scripts/ --recursive

πŸ”„ Step 3: Configuring Airflow DAG

The redfin_analytics.py DAG automates the ETL workflow.

πŸ”— DAG Flow:

1️⃣ Start pipeline (tsk_start_pipeline)
2️⃣ Create EMR cluster (tsk_create_emr_cluster)
3️⃣ Check cluster status (tsk_is_emr_cluster_created)
4️⃣ Submit Spark jobs (tsk_add_extraction_step, tsk_add_transformation_step)
5️⃣ Check Spark job completion (tsk_is_extraction_completed, tsk_is_transformation_completed)
6️⃣ Load data into Snowflake (tsk_load_to_snowflake)
7️⃣ Terminate EMR cluster (tsk_remove_cluster)
8️⃣ End pipeline (tsk_end_pipeline)

⚑ Key Configurations in the DAG

Ensure to update the following in redfin_analytics.py:

  • βœ… Change the Subnet ID: Ec2SubnetId": "subnet-XXXXXX"
  • βœ… Update EMR Cluster Name: "Name": "Your-EMR-Cluster-Name"
  • βœ… Specify Your Key Name: "Ec2KeyName": "YourKeyName"
  • βœ… Update S3 Bucket Name: s3://your-default-bucket-name/

Make sure u add the role from the commnads.txt


▢️ Step 4: Running the DAG

1️⃣ Start the Airflow webserver and scheduler:

airflow webserver -p 8080 &
airflow scheduler &

2️⃣ Open Airflow UI (http://localhost:8080) and get th password from the terminal 3️⃣ Enable the DAG: redfin_analytics_spark_job_dag by clicking play button

4️⃣ Trigger the DAG manually or wait for its scheduled execution

▢️ Step 5: The job runs

  1. After you executed the job will creat the EMR cluster

  1. It will fetch the raw data

  1. It will passed to spark jobs from processing and put the changed job in transformed data s3 bucket

  1. After that the emr cluster will be terminated automatically

  1. And the job is completed


❄️ Step 6: Setting Up Snowflake for Data Storage and Querying

Now that the ETL pipeline has processed the data and stored it in Amazon S3, the next step is to load it into Snowflake for further analysis.

πŸ—οΈ Creating a Snowflake Account

  1. Sign up for a Snowflake account here.
  2. Login to the Snowflake console and navigate to the Worksheets section.
  3. Create a new worksheet to run SQL queries.

πŸ“Œ Executing Snowflake Queries

  1. Open the snowflake_dbscripts.txt file.
  2. Copy each SQL query and execute it in Snowflake.
  3. Ensure you replace the S3 bucket name with your actual bucket where the final cleaned CSV is stored.

Example Query to Create an External Stage:

CREATE OR REPLACE STAGE redfin_stage
URL = 's3://redfin-emr-project/transformed-data/'
FILE_FORMAT = (TYPE = 'CSV' FIELD_OPTIONALLY_ENCLOSED_BY='"');

Example Query to Load Data into Snowflake Table:

COPY INTO redfin_housing_data
FROM @redfin_stage
FILE_FORMAT = (TYPE = 'CSV' FIELD_OPTIONALLY_ENCLOSED_BY='"')
ON_ERROR = 'CONTINUE';

πŸ” Verify Data in Snowflake

Run the following query to verify the loaded data:

SELECT * FROM redfin_housing_data LIMIT 10;


πŸ“Š Step 7: Connecting Snowflake to a BI Tool

Once the data is available in Snowflake, it can be visualized using BI tools like Tableau, Power BI, or Looker.

πŸ› οΈ Connecting Snowflake to Power BI

Follow these steps to integrate Snowflake with Power BI:

  1. Open Power BI Desktop.
  2. Click on Get Data β†’ More.

3. Select Snowflake as the data source. 4. Enter the Snowflake account URL.

5. Provide username & password.

6. Select the warehouse, database, and schema.

7. Load the data into Power BI and start creating dashboards.

πŸŽ₯ Final Output: Watch the Data Pipeline in Action!

Check out the complete execution of the Automated Big Data ETL Pipeline in the following GIF:


πŸ› οΈ Troubleshooting

❗ Common Issues & Fixes

  • ❌ EMR creation fails? Ensure correct subnet ID and AWS key pair.
  • ❌ S3 path issues? Verify bucket structure and permissions.
  • πŸ” Airflow task failure? Check task logs and IAM roles.

πŸš€ Conclusion

This end-to-end Big Data ETL Pipeline demonstrates how to automate data processing using AWS EMR, Apache Airflow, and Snowflake. By following this approach, organizations can efficiently handle large-scale data transformations and integrate with BI tools for analysis.

βœ… Key Takeaways:

  • Used Apache Airflow to automate the ETL process.
  • Processed raw data using Spark on AWS EMR.
  • Stored the transformed data in Amazon S3.
  • Loaded the cleaned data into Snowflake.
  • Connected Snowflake to Power BI for data visualization.

For any questions or improvements, feel free to drop a comment below! πŸš€

About

An automated Big Data ETL pipeline using Airflow, AWS EMR, S3, and Snowflake to process, transform, and analyze data, integrating with Power BI for visualization and business insights. πŸš€

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published