This project implements a batch data pipeline for NYC's Citibike data. It extracts raw data, stores it in Google Cloud Storage and BigQuery, transforms it using DBT, and visualizes insights with Google Looker Data Studio. The pipeline showcases the end-to-end data engineering process.
The Citi Bike dataset offers detailed information about bike rides in New York City, including insights into usage patterns, ride durations, and station popularity. You can download the dataset from the following link: Citi Bike Dataset.
-
Google Cloud Platform (GCP): A suite of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products.
- Google Cloud Storage: A scalable, fully-managed object storage service that allows you to store and retrieve any amount of data at any time.
- BigQuery: A fully-managed, serverless data warehouse that enables scalable analysis over petabytes of data.
- Google Looker Data Studio: A business intelligence tool that helps you turn your data into informative dashboards and reports that are easy to read, easy to share, and fully customizable.
-
DBT (Data Build Tool): A command-line tool that enables data analysts and engineers to transform data in their warehouse more effectively by allowing them to write data transformation code in SQL.
-
Terraform: An open-source infrastructure as code software tool that provides a consistent CLI workflow to manage hundreds of cloud services.
-
Prefect: A workflow management system that allows you to build, run, and monitor data pipelines at scale.
- Extraction: Raw data is extracted and stored in Google Cloud Storage.
- Loading: Data is loaded into BigQuery for further processing.
- Transformation: DBT (Data Build Tool) is used to transform and model the data within BigQuery.
- Visualization: Insights are visualized using Google Looker Data Studio.
To successfully execute this project, follow the steps outlined below to set up the necessary environments and tools:
git clone <your-repo-url>
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
- Access Google Cloud at Google Cloud Console.
- Assign the following roles:
- BigQuery Admin
- Storage Admin
- Storage Object Admin
- To authenticate with your Google account, use the following command:
gcloud auth login
- Set the project for your account:
gcloud config set project YOUR_PROJECT_ID
- If you don't have Terraform installed, you can download it from here and then add it to your PATH.
- Assign the following roles:
- BigQuery Admin
- Storage Admin
- Storage Object Admin
- command to navigate to the terraform folder
cd terraform/
- Terraform commands:
terraform init terraform validate terraform plan -var="project=nyc-citibike-data-pipeline" terraform apply -var="project=nyc-citibike-data-pipeline"
- command to check the current version of the Prefect CLI
prefect --version
- Command to initiate the Prefect server to begin managing and orchestrating your workflows
prefect server start
- Access the UI at http://127.0.0.1:4200/.
- Update the blocks to register them with your credentials for Google Cloud Storage (GCS) and BigQuery. This can be done in the Blocks options.
- You can either keep the block names as they appear in the code or rename them. If you choose to rename them, ensure that you update the code to reference the new block names.
- Return to the terminal and navigate to the prefect/ directory:
cd prefect/ - Execute the data pipeline script:
python citibike_data_pipeline.py
- The Python script will then store the Citibike data in both your GCS bucket and BigQuery.
-
- Create a dbt account and log in using dbt Cloud.
- Once logged in, clone the repository for use.
- In the CLI at the bottom, execute the following command:
dbt run
- This command will run all the models and create the final dataset called
fact_citibike.
- Upon a successful run, the lineage of
fact_citibikewill be displayed as shown below:
- You can now use the fact_citibike dataset within Looker Studio for creating visualizations.
- You can find the report for the half-year Citibike analysis Report-2024.
Access the NYC Citibike Dashboard - Report-2024



