Skip to content

Commit

Permalink
Airflow documentation (#2833)
Browse files Browse the repository at this point in the history
* wip

* overall readme tweaks

* remove unused code

* wip docs updates

* add more dag readmes

* more readmes and delete unused DAG

* more overall docs updates

* fix broken ref

* updates per PR review
  • Loading branch information
lauriemerrell authored Jul 26, 2023
1 parent c60043b commit 7e3935c
Show file tree
Hide file tree
Showing 62 changed files with 131 additions and 3,407 deletions.
46 changes: 0 additions & 46 deletions .github/workflows/build-sentry-error-loader-image.yml

This file was deleted.

3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
# Vim
.*.sw[po]

# PyCharm, etc.
# PyCharm, VSCode, etc.
.idea/
.DS_Store
.vscode/

# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down
18 changes: 11 additions & 7 deletions airflow/README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,22 @@
# Airflow

The following folder contains the project level directory for all our Apache Airflow DAGs, which are deployed automatically to Google Cloud Composer from the `main` branch.
The following folder contains the project level directory for all our [Apache Airflow](https://airflow.apache.org/) [DAGs](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html). Airflow is an orchestration tool that we use to manage our raw data ingest. Airflow DAG tasks are scheduled at regular intervals to perform data processing steps, for example unzipping raw GTFS zipfiles and writing the contents out to Google Cloud Storage.

Our DAGs are deployed automatically to Google Cloud Composer when a PR is merged into the `main` branch; see the section on deployment below for more information.

## Structure

The DAGs for this project are stored and version controlled in the `dags` folder.
The DAGs for this project are stored and version controlled in the `dags` folder. Each DAG has its own `README` with further information about its specific purpose and considerations. We use [gusty](https://github.com/pipeline-tools/gusty) to simplify DAG management.

Each DAG folder contains a [`METADATA.yml` file](https://github.com/pipeline-tools/gusty#metadata) that contains overall DAG settings, including the DAG's schedule (if any).

The logs are stored locally in the `logs` folder. You should be unable to add files here but it is gitkeep'ed so that it is avaliable when testing and debugging.

Finally, Airflow plugins can be found in `plugins`; this includes general utility functions as well as custom operator definitions.
Finally, Airflow plugins can be found in `plugins`; this includes general utility functions as well as custom [operator](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/operators.html) definitions.

## Developing Locally

This project is developed using docker and docker-compose. Before getting started, please make sure you have installed both on your system.
This project is developed using Docker and docker-compose. Before getting started, please make sure you have [installed Docker on your system](https://docs.docker.com/get-docker/).

First, if you're on linux, you'll need to make sure that the UID and GID of the container match, to do so, run

Expand All @@ -22,7 +26,7 @@ mkdir ./dags ./logs ./plugins
echo -e "AIRFLOW_UID=$(id -u)\nAIRFLOW_GID=0" > .env
```

Second, ensure you have a default authentication file, by [installing google sdk](https://cloud.google.com/sdk/docs/install) and running
Second, ensure you have a default authentication file, by [installing Google SDK](https://cloud.google.com/sdk/docs/install) and running

```console
unset GOOGLE_APPLICATION_CREDENTIALS
Expand All @@ -34,7 +38,7 @@ gcloud init
# gcloud auth application-default login
```

Next, run the initial database migration which also creates a default user named `airflow.
Next, run the initial database migration which also creates a default user named `airflow`.
```shell
docker-compose run airflow db init
```
Expand All @@ -61,4 +65,4 @@ Additional reading about this setup can be found on the [Airflow Docs](https://a

## Deploying to production

We have a [GitHub Action](../.github/workflows/deploy_airflow_dags.yml) defined that updates requirements and syncs the [dags](./airflow/dags) and [plugins](./airflow/plugins) directories to the bucket which Composer watches for code/data to parse. As of 2023-04-11, this bucket is `us-west2-calitp-airflow2-pr-171e4e47-bucket`. Our production Composer instance is called [calitp-airflow2-prod](https://console.cloud.google.com/composer/environments/detail/us-west2/calitp-airflow2-prod/monitoring); its configuration (including worker count, Airflow config overrides, and environment variables) is manually managed through the web console.
We have a [GitHub Action](../.github/workflows/deploy_airflow_dags.yml) that runs when PRs touching this directory merge to the `main` branch. The GitHub Action updates requirements and syncs the [DAGs](./airflow/dags) and [plugins](./airflow/plugins) directories to the bucket that Composer watches for code/data to parse. As of 2023-07-18, this bucket is `us-west2-calitp-airflow2-pr-171e4e47-bucket`. Our production Composer instance is called [calitp-airflow2-prod](https://console.cloud.google.com/composer/environments/detail/us-west2/calitp-airflow2-prod/monitoring); its configuration (including worker count, Airflow config overrides, and environment variables) is manually managed through the web console.
6 changes: 6 additions & 0 deletions airflow/dags/airtable_loader_v2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# `airtable_loader_v2`

Type: [Now / Scheduled](https://docs.calitp.org/data-infra/airflow/dags-maintenance.html)

This DAG orchestrates raw data ingest from the Cal-ITP Airtable bases.
To run these DAGs locally, you will need an Airtable API key stored in a `CALITP_AIRTABLE_API_KEY` environment variable. Contact the internal Airtable owner to receive an API key.
8 changes: 0 additions & 8 deletions airflow/dags/amplitude_benefits/METADATA.yml

This file was deleted.

6 changes: 0 additions & 6 deletions airflow/dags/amplitude_benefits/api_to_jsonl.yml

This file was deleted.

21 changes: 0 additions & 21 deletions airflow/dags/check_data_freshness/METADATA.yml

This file was deleted.

48 changes: 0 additions & 48 deletions airflow/dags/check_data_freshness/dbt_source_freshness.yml

This file was deleted.

5 changes: 5 additions & 0 deletions airflow/dags/create_external_tables/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# `create_external_tables`

Type: [Now / Scheduled](https://docs.calitp.org/data-infra/airflow/dags-maintenance.html)

This DAG orchestrates the creation of [external tables](https://cloud.google.com/bigquery/docs/external-data-sources), which serve as the interface between our raw / parsed data (stored in Google Cloud Storage) and our data warehouse (BigQuery). Most of our external tables are [hive-partitioned](https://cloud.google.com/bigquery/docs/hive-partitioned-loads-gcs).
122 changes: 0 additions & 122 deletions airflow/dags/create_external_tables/legacy_benefits_events.yml

This file was deleted.

Loading

0 comments on commit 7e3935c

Please sign in to comment.