NYC-Citibike-Data-Pipeline

1. Project Description

This project implements a batch data pipeline for NYC's Citibike data. It extracts raw data, stores it in Google Cloud Storage and BigQuery, transforms it using DBT, and visualizes insights with Google Looker Data Studio. The pipeline showcases the end-to-end data engineering process.

2. Dataset

The Citi Bike dataset offers detailed information about bike rides in New York City, including insights into usage patterns, ride durations, and station popularity. You can download the dataset from the following link: Citi Bike Dataset.

3. Tools and Technologies

Google Cloud Platform (GCP): A suite of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products.
- Google Cloud Storage: A scalable, fully-managed object storage service that allows you to store and retrieve any amount of data at any time.
- BigQuery: A fully-managed, serverless data warehouse that enables scalable analysis over petabytes of data.
- Google Looker Data Studio: A business intelligence tool that helps you turn your data into informative dashboards and reports that are easy to read, easy to share, and fully customizable.
DBT (Data Build Tool): A command-line tool that enables data analysts and engineers to transform data in their warehouse more effectively by allowing them to write data transformation code in SQL.
Terraform: An open-source infrastructure as code software tool that provides a consistent CLI workflow to manage hundreds of cloud services.
Prefect: A workflow management system that allows you to build, run, and monitor data pipelines at scale.

4. Citibike Pipeline Architecture

Data Flow

Extraction: Raw data is extracted and stored in Google Cloud Storage.
Loading: Data is loaded into BigQuery for further processing.
Transformation: DBT (Data Build Tool) is used to transform and model the data within BigQuery.
Visualization: Insights are visualized using Google Looker Data Studio.

5. Steps to Execute

To successfully execute this project, follow the steps outlined below to set up the necessary environments and tools:

💻 Code Setup

1. Clone the git repo to your system

git clone <your-repo-url>

2. Python Environment Setup

python3 -m venv .venv
source .venv/bin/activate

3. Install necessary packaged and libraries

  pip install -r requirements.txt

🌐 Google Cloud Environment Setup

1. Log In with the Desired Google Account and Create a Project

Access Google Cloud at Google Cloud Console.

2. Configure Identity and Access Management (IAM) for the Service Account

Assign the following roles:
- BigQuery Admin
- Storage Admin
- Storage Object Admin

3. Authenticate Your Google Account

To authenticate with your Google account, use the following command:
```
gcloud auth login
```

Set the project for your account:

gcloud config set project YOUR_PROJECT_ID

🛠️ Terraform Setup

1. Installing Terraform and Adding it to Your PATH

If you don't have Terraform installed, you can download it from here and then add it to your PATH.

2. Configure Identity and Access Management (IAM) for the Service Account

Assign the following roles:
- BigQuery Admin
- Storage Admin
- Storage Object Admin

3. After step 1 and 2 navigate to the terraform folder

command to navigate to the terraform folder
```
 cd terraform/
```

4. Run the Following Commands to Create Your Project Infrastructure

Terraform commands:

 terraform init
 terraform validate
 terraform plan -var="project=nyc-citibike-data-pipeline"
 terraform apply -var="project=nyc-citibike-data-pipeline"

🧩 Prefect Framework Setup

1. Confirm Prefect Installation in your virtual Environment

command to check the current version of the Prefect CLI
```
 prefect --version
```

2. Start Prefect server

Command to initiate the Prefect server to begin managing and orchestrating your workflows
```
 prefect server start
```

3. Accessing and Configuring Blocks in the Prefect UI

Access the UI at http://127.0.0.1:4200/.
Update the blocks to register them with your credentials for Google Cloud Storage (GCS) and BigQuery. This can be done in the Blocks options.
You can either keep the block names as they appear in the code or rename them. If you choose to rename them, ensure that you update the code to reference the new block names.

4. Running the Prefect Data Pipeline

Return to the terminal and navigate to the prefect/ directory:
```
 cd prefect/
```
Execute the data pipeline script:
```
 python citibike_data_pipeline.py
```
The Python script will then store the Citibike data in both your GCS bucket and BigQuery.

🔍 Running the dbt Flow

1. Confirm Prefect Installation in your virtual Environment

- Create a dbt account and log in using dbt Cloud.

2. Clone the Repository

Once logged in, clone the repository for use.

3. Run the dbt Command

In the CLI at the bottom, execute the following command:
```
dbt run
```
This command will run all the models and create the final dataset called fact_citibike.

4. Verify Successful Execution

Upon a successful run, the lineage of fact_citibike will be displayed as shown below:

📊 Visualization in Looker Studio

1. Utilize the Dataset

You can now use the fact_citibike dataset within Looker Studio for creating visualizations.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
dbt		dbt
images		images
looker		looker
prefect		prefect
terraform		terraform
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

ShreyaJaiswal1604/NYC-Citibike-Data-Pipeline

Folders and files

Latest commit

History

Repository files navigation

NYC-Citibike-Data-Pipeline

1. Project Description

2. Dataset

3. Tools and Technologies

4. Citibike Pipeline Architecture

Data Flow

5. Steps to Execute

💻 Code Setup

1. Clone the git repo to your system

2. Python Environment Setup

3. Install necessary packaged and libraries

🌐 Google Cloud Environment Setup

1. Log In with the Desired Google Account and Create a Project

2. Configure Identity and Access Management (IAM) for the Service Account

3. Authenticate Your Google Account

🛠️ Terraform Setup

1. Installing Terraform and Adding it to Your PATH

2. Configure Identity and Access Management (IAM) for the Service Account

3. After step 1 and 2 navigate to the terraform folder

4. Run the Following Commands to Create Your Project Infrastructure

🧩 Prefect Framework Setup

1. Confirm Prefect Installation in your virtual Environment

2. Start Prefect server

3. Accessing and Configuring Blocks in the Prefect UI

4. Running the Prefect Data Pipeline

🔍 Running the dbt Flow

1. Confirm Prefect Installation in your virtual Environment

2. Clone the Repository

3. Run the dbt Command

4. Verify Successful Execution

📊 Visualization in Looker Studio

1. Utilize the Dataset

2. Access the Report

5. Citibike Dashboard 2024

Half-Yearly Report NYC Citi Bike User Analysis (Jan - Jul 2024)

Half-Yearly Report: NYC Citi Bike Monthly Ride Analysis (Jan - Jul 2024)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages