Skip to content

This project implements a batch data pipeline for NYC's Citibike data. It extracts raw data, stores it in Google Cloud Storage and BigQuery, transforms it using DBT, and visualizes insights with Google Looker Data Studio. The pipeline showcases the end-to-end data engineering process.

License

Notifications You must be signed in to change notification settings

ShreyaJaiswal1604/NYC-Citibike-Data-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

20 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

NYC-Citibike-Data-Pipeline Citi Bike Logo

1. Project Description

This project implements a batch data pipeline for NYC's Citibike data. It extracts raw data, stores it in Google Cloud Storage and BigQuery, transforms it using DBT, and visualizes insights with Google Looker Data Studio. The pipeline showcases the end-to-end data engineering process.

2. Dataset

The Citi Bike dataset offers detailed information about bike rides in New York City, including insights into usage patterns, ride durations, and station popularity. You can download the dataset from the following link: Citi Bike Dataset.

3. Tools and Technologies

  • Google Cloud Platform (GCP): A suite of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products.

    • Google Cloud Storage: A scalable, fully-managed object storage service that allows you to store and retrieve any amount of data at any time.
    • BigQuery: A fully-managed, serverless data warehouse that enables scalable analysis over petabytes of data.
    • Google Looker Data Studio: A business intelligence tool that helps you turn your data into informative dashboards and reports that are easy to read, easy to share, and fully customizable.
  • DBT (Data Build Tool): A command-line tool that enables data analysts and engineers to transform data in their warehouse more effectively by allowing them to write data transformation code in SQL.

  • Terraform: An open-source infrastructure as code software tool that provides a consistent CLI workflow to manage hundreds of cloud services.

  • Prefect: A workflow management system that allows you to build, run, and monitor data pipelines at scale.

4. Citibike Pipeline Architecture

Citibike Pipeline Architecture

Data Flow

  1. Extraction: Raw data is extracted and stored in Google Cloud Storage.
  2. Loading: Data is loaded into BigQuery for further processing.
  3. Transformation: DBT (Data Build Tool) is used to transform and model the data within BigQuery.
  4. Visualization: Insights are visualized using Google Looker Data Studio.

5. Steps to Execute

To successfully execute this project, follow the steps outlined below to set up the necessary environments and tools:

πŸ’» Code Setup

1. Clone the git repo to your system

git clone <your-repo-url>

2. Python Environment Setup

python3 -m venv .venv
source .venv/bin/activate

3. Install necessary packaged and libraries

  pip install -r requirements.txt

🌐 Google Cloud Environment Setup

1. Log In with the Desired Google Account and Create a Project

2. Configure Identity and Access Management (IAM) for the Service Account

  • Assign the following roles:
    • BigQuery Admin
    • Storage Admin
    • Storage Object Admin

3. Authenticate Your Google Account

  • To authenticate with your Google account, use the following command:
    gcloud auth login
  • Set the project for your account:
    gcloud config set project YOUR_PROJECT_ID

πŸ› οΈ Terraform Setup

1. Installing Terraform and Adding it to Your PATH

  • If you don't have Terraform installed, you can download it from here and then add it to your PATH.

2. Configure Identity and Access Management (IAM) for the Service Account

  • Assign the following roles:
    • BigQuery Admin
    • Storage Admin
    • Storage Object Admin

3. After step 1 and 2 navigate to the terraform folder

  • command to navigate to the terraform folder
     cd terraform/

4. Run the Following Commands to Create Your Project Infrastructure

  • Terraform commands:
     terraform init
     terraform validate
     terraform plan -var="project=nyc-citibike-data-pipeline"
     terraform apply -var="project=nyc-citibike-data-pipeline"
    

🧩 Prefect Framework Setup

1. Confirm Prefect Installation in your virtual Environment

  • command to check the current version of the Prefect CLI
     prefect --version

2. Start Prefect server

  • Command to initiate the Prefect server to begin managing and orchestrating your workflows
     prefect server start

3. Accessing and Configuring Blocks in the Prefect UI

  • Access the UI at http://127.0.0.1:4200/.
  • Update the blocks to register them with your credentials for Google Cloud Storage (GCS) and BigQuery. This can be done in the Blocks options.
  • You can either keep the block names as they appear in the code or rename them. If you choose to rename them, ensure that you update the code to reference the new block names.

4. Running the Prefect Data Pipeline

  • Return to the terminal and navigate to the prefect/ directory:
     cd prefect/
  • Execute the data pipeline script:
     python citibike_data_pipeline.py
  • The Python script will then store the Citibike data in both your GCS bucket and BigQuery.

πŸ” Running the dbt Flow

1. Confirm Prefect Installation in your virtual Environment

    • Create a dbt account and log in using dbt Cloud.

2. Clone the Repository

  • Once logged in, clone the repository for use.

3. Run the dbt Command

  • In the CLI at the bottom, execute the following command:
    dbt run
  • This command will run all the models and create the final dataset called fact_citibike.

4. Verify Successful Execution

  • Upon a successful run, the lineage of fact_citibike will be displayed as shown below:

πŸ“Š Visualization in Looker Studio

1. Utilize the Dataset

  • You can now use the fact_citibike dataset within Looker Studio for creating visualizations.

2. Access the Report

  • You can find the report for the half-year Citibike analysis Report-2024.

5. Citibike Dashboard 2024

Access the NYC Citibike Dashboard - Report-2024

  • Half-Yearly Report NYC Citi Bike User Analysis (Jan - Jul 2024)


    NYC Citi Bike User Analysis

  • Half-Yearly Report: NYC Citi Bike Monthly Ride Analysis (Jan - Jul 2024)


    NYC Citi Bike Monthly Ride Analysis

About

This project implements a batch data pipeline for NYC's Citibike data. It extracts raw data, stores it in Google Cloud Storage and BigQuery, transforms it using DBT, and visualizes insights with Google Looker Data Studio. The pipeline showcases the end-to-end data engineering process.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages