π Automated Big Data ETL Pipeline: Scalable Processing with AWS EMR, Airflow, and Snowflake βοΈ
This project demonstrates how to build an automated big data ETL pipeline using Apache Airflow π, Amazon EMR π, Amazon S3 πΎ, and Snowflake βοΈ. The pipeline:
- π Creates an EMR Cluster
- π₯ Runs Spark jobs to process raw data
- πΎ Stores the transformed data in Amazon S3
- βοΈ Loads the processed data into Snowflake
- π« Shuts down the EMR cluster after processing
The DAG is orchestrated using Apache Airflow, automating the entire ETL process. βοΈ
git clone https://github.com/Abd-al-RahmanH/Automated-Big-Data-ETL-Pipeline-Scalable-Processing-with-AWS-EMR-Airflow-and-Snowflake.git
cd Automated-Big-Data-ETL-Pipeline-Scalable-Processing-with-AWS-EMR-Airflow-and-SnowflakeFINAL_EMR_REDFIN/
β-- dags/
β β-- redfin_analytics.py # Airflow DAG for EMR cluster and Spark jobs
β
β-- scripts/
β β-- ingest.sh # Shell script for data ingestion
β β-- transform_redfin_data.py # Spark script for data transformation
β
β-- commands.txt # Commands for setting up AWS & Airflow
β-- README.md # Project Documentation
β-- requirements.txt # Python dependencies
Before running the project, ensure you have the following:
βοΈ AWS Account with permissions for VPC, S3, EMR, IAM, and Snowflake integration
βοΈ S3 Bucket for raw and processed data
βοΈ Apache Airflow π installed and configured
βοΈ AWS credentials configured in Airflow
βοΈ Snowflake βοΈ Account with the necessary database and tables
Before creating an S3 bucket, set up a VPC (Virtual Private Cloud) to ensure a secure and controlled networking environment.
aws configure #Then Enter your access key ,secret accesskey and region.
aws ec2 create-vpc --cidr-block 10.0.0.0/16aws ec2 create-subnet --vpc-id <your-vpc-id> --cidr-block 10.0.1.0/24aws ec2 create-internet-gateway
aws ec2 attach-internet-gateway --vpc-id <your-vpc-id> --internet-gateway-id <your-igw-id>Create an S3 bucket with the following structure:
s3://redfin-emr-project/
β-- raw-data/ # Stores raw CSV data
β-- scripts/ # Stores Spark and ingestion scripts
β-- transformed-data/ # Stores processed parquet files
β-- emr-logs/ # Stores EMR logs
Upload necessary scripts to the S3 bucket:
aws s3 cp scripts/ s3://redfin-emr-project/scripts/ --recursiveThe redfin_analytics.py DAG automates the ETL workflow.
1οΈβ£ Start pipeline (tsk_start_pipeline)
2οΈβ£ Create EMR cluster (tsk_create_emr_cluster)
3οΈβ£ Check cluster status (tsk_is_emr_cluster_created)
4οΈβ£ Submit Spark jobs (tsk_add_extraction_step, tsk_add_transformation_step)
5οΈβ£ Check Spark job completion (tsk_is_extraction_completed, tsk_is_transformation_completed)
6οΈβ£ Load data into Snowflake (tsk_load_to_snowflake)
7οΈβ£ Terminate EMR cluster (tsk_remove_cluster)
8οΈβ£ End pipeline (tsk_end_pipeline)
Ensure to update the following in redfin_analytics.py:
- β
Change the Subnet ID:
Ec2SubnetId": "subnet-XXXXXX" - β
Update EMR Cluster Name:
"Name": "Your-EMR-Cluster-Name" - β
Specify Your Key Name:
"Ec2KeyName": "YourKeyName" - β
Update S3 Bucket Name:
s3://your-default-bucket-name/
1οΈβ£ Start the Airflow webserver and scheduler:
airflow webserver -p 8080 &
airflow scheduler &2οΈβ£ Open Airflow UI (http://localhost:8080) and get th password from the terminal
3οΈβ£ Enable the DAG: redfin_analytics_spark_job_dag by clicking play button
4οΈβ£ Trigger the DAG manually or wait for its scheduled execution
- After you executed the job will creat the EMR cluster
- It will fetch the raw data
- It will passed to spark jobs from processing and put the changed job in transformed data s3 bucket
- After that the emr cluster will be terminated automatically
- And the job is completed
Now that the ETL pipeline has processed the data and stored it in Amazon S3, the next step is to load it into Snowflake for further analysis.
- Sign up for a Snowflake account here.
- Login to the Snowflake console and navigate to the Worksheets section.
- Create a new worksheet to run SQL queries.
- Open the
snowflake_dbscripts.txtfile. - Copy each SQL query and execute it in Snowflake.
- Ensure you replace the S3 bucket name with your actual bucket where the final cleaned CSV is stored.
CREATE OR REPLACE STAGE redfin_stage
URL = 's3://redfin-emr-project/transformed-data/'
FILE_FORMAT = (TYPE = 'CSV' FIELD_OPTIONALLY_ENCLOSED_BY='"');COPY INTO redfin_housing_data
FROM @redfin_stage
FILE_FORMAT = (TYPE = 'CSV' FIELD_OPTIONALLY_ENCLOSED_BY='"')
ON_ERROR = 'CONTINUE';Run the following query to verify the loaded data:
SELECT * FROM redfin_housing_data LIMIT 10;Once the data is available in Snowflake, it can be visualized using BI tools like Tableau, Power BI, or Looker.
Follow these steps to integrate Snowflake with Power BI:
- Open Power BI Desktop.
- Click on Get Data β More.
3. Select Snowflake as the data source.
4. Enter the Snowflake account URL.
5. Provide username & password.
6. Select the warehouse, database, and schema.
Check out the complete execution of the Automated Big Data ETL Pipeline in the following GIF:
- β EMR creation fails? Ensure correct subnet ID and AWS key pair.
- β S3 path issues? Verify bucket structure and permissions.
- π Airflow task failure? Check task logs and IAM roles.
This end-to-end Big Data ETL Pipeline demonstrates how to automate data processing using AWS EMR, Apache Airflow, and Snowflake. By following this approach, organizations can efficiently handle large-scale data transformations and integrate with BI tools for analysis.
- Used Apache Airflow to automate the ETL process.
- Processed raw data using Spark on AWS EMR.
- Stored the transformed data in Amazon S3.
- Loaded the cleaned data into Snowflake.
- Connected Snowflake to Power BI for data visualization.
For any questions or improvements, feel free to drop a comment below! π














