This project implements a data pipeline using the USGS Earthquake API to ingest, process, and visualize global earthquake data in a dashboard.
The dashboard aids to visualize where the earthquakes occurs around the globe, countries and continents with the most severe earthquakes, and how the earthquakes magnitudes changes over time.
dashboard-preview.mp4
The dashboard consists of four primary visualization components:
- Pie Charts: Visualizes the distribution of earthquakes by magnitude category, country, and continent.
- Earthquake Map: Displays the geographical locations where earthquakes occurred.
-
Time Series Plot: Shows the average earthquake magnitude by continent over time.
-
NOTE: Since the data for this plots is aggregated by continent and country,
the average magnitude for the
$i$ -th continent is defined as the average of the average of the earthquake magnitude weighted by the number of earthquakes:$$\frac{\sum_{j=1}^{n_i} \bar x_{ij} m_{ij}}{\sum_{j=1}^{n_i} m_{ij}},$$ where$n_{i}$ is the number of countries,$\bar x_{ij}$ is the average magnitude for the$j$ -th country in the$i$ -th continent and$m_{ij}$ is the number of earthquakes.
-
NOTE: Since the data for this plots is aggregated by continent and country,
the average magnitude for the
- Bar Plot: Average earthquakes magnitudes by country.
- Earthquake Data: Obtained from the USGS Earthquake Hazards API.
- Geolocation Data: Countries and continents are assigned using reverse geolocation with natural earth shapefiles.
- Due to the data source, there is a notable concentration of recorded earthquakes in the United States and nearby regions.
- USGS is capable of detecting very small earthquakes in the USA region, which distort the magnitude downwards for the country USA and continent North America, which ends up differing greatly from other regions.
The schema for the earthquake data is defined in earthquakes_schema.json.
The table is partitioned daily and clustered by earthquake_id, continent and country, as specified in main.tf.
The daily partition on the earthquake events helps building the incremental table for the time series plots.
The cluster on the earthquake_id improves query performance to insert new rows.
Finally, clustering by country and continent improves query performance for aggregation queries used for the dashboard.
The pipeline runs daily on a Google Cloud Compute instance with Fedora CoreOS.
The instance starts at 00:00 UTC and shuts down at 01:00 UTC.
While active, a systemd service defined in cloud-startup starts the required containers and workflows.
The workflows, implemented as Apache Airflow DAGs, are located in the src/dags directory. The main DAGs are:
get_earthquake_data.py(ELT - Extract, Load, Transform):- Fetches data from the USGS API.
- Stores the
geojsonraw data in a Google Cloud Storage data lake. - Processes and loads cleaned data into BigQuery.
generate_summary_tables.py(Transform & Aggregate):- Uses
dbtto generate precomputed summary statistics for dashboard visualization. - The transformation logic is implemented in
earthquake_analysis.
- Uses
The cloud environment is provisioned using Terraform.
Start by creating an .env file from the example:
cp .env.example .envThen, initialize and apply the Terraform configuration:
terraform init
terraform plan
terraform applyNext, update the .env file with the generated cloud information.
Retrieve the Airflow service account key with:
terraform output -raw airflow_gcs_key | base64 -d > /path/to/your/private/key.jsonSet the GOOGLE_APPLICATION_CREDENTIALS environment variable in .env to the key file path.
Follow the instructions in cloud-setup.md to configure the VM.
The project includes two docker-compose files:
docker-compose.yaml: Used for the compute engine.docker-compose-dev.yaml: Adds local secrets for development.
To start Airflow in production:
docker compose upTo start Airflow in development:
docker compose -f docker-compose.yaml -f docker-compose-dev.yaml upEnsure environment variables are correctly configured and credentials are provided in both environments.
Clone the Superset repository:
git submodule update --init --recursiveStart Superset:
docker compose -f ./superset/docker-compose-image-tag.yml upTo create the dashboard:
- Connect Superset to BigQuery.
- Add the dataset to Superset.
- Enable maps by setting your MAPBOX_API_KEY.
- Apache Airflow: Workflow orchestration.
- Apache Superset: Business Intelligence & data visualization.
- Docker: Containerization.
- Fedora CoreOS: Cloud-optimized OS.
- Google Cloud Platform: Cloud services.
- Mapbox: Geospatial visualization.
- Terraform: Infrastructure as code.
- USGS Earthquake API: Earthquake data source.
- dbt: Data transformation.
