Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add instructions to diagnose external tables in GCS deprecation #2863

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions runbooks/data/deprecation-stored-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ Occasionally, we want to assess our Google Cloud Storage buckets for outdatednes

3. For the non-test buckets that constitute the deprecation candidate list, the path forward relies on investigation of internal project configuration and conversation with data stakeholders. Some data may need to be kept in place because it is frequently accessed despite being infrequently updated (NTD data or static website assets, for instance). Some data may need to be cold-stored rather than deleted outright because it represents raw data collected once that can't otherwise be recovered, or to conform with regulatory requirements, or to provide a window for future research access. Each of the following steps should be taken to determine which path to take:
* Search the source code of the [data-infra repository](https://github.com/cal-itp/data-infra), the [data-analyses repository](https://github.com/cal-itp/data-analyses), and the [reports repository](https://github.com/cal-itp/reports) for the name of the bucket, as well as the environment variables [set in Cloud Composer](https://console.cloud.google.com/composer/environments/detail/us-west2/calitp-airflow2-prod/variables?project=cal-itp-data-infra). If you find it referenced anywhere, investigate whether the reference is in active use. For an extra step of safety, you could also search the entire Cal-ITP GitHub organization's source code via GitHub's web user interface.
* Note: [External tables](https://cloud.google.com/bigquery/docs/external-tables) in BigQuery, created from GCS objects via [our `create_external_tables` DAG](https://o1d2fa0877cf3fb10p-tp.appspot.com/dags/create_external_tables/grid) in Airflow, do not produce read or write data that shows up in the GCS request count metric we used in step one. If you find a reference to a deprecation candidate bucket within the [`create_external_tables` subfolder](https://github.com/cal-itp/data-infra/tree/main/airflow/dags/create_external_tables) of the data-infra repository, you should check [BigQuery audit logs](https://cloud.google.com/bigquery/docs/reference/auditlogs/#data_access_data_access) to see whether people are querying the external tables that rely on the deprecation candidate bucket (and if so, eliminate it from the deprecation list).
* Post in `#data-warehouse-devs` and any other relevant channels in Slack (this may vary by domain; for example, if investigating a bucket related to GTFS quality, you may post in `#gtfs-quality`). Ask whether anybody knows of ongoing use of the bucket(s) in question. If there are identifiable stakeholders who aren't active in Slack, like external research partners, reach out to them directly.

4. For each bucket that hasn't been removed from the deprecation list via the investigation in the last step, create a new bucket named "[EXISTING BUCKET NAME]-deprecated" and follow [these steps](https://cloud.google.com/storage/docs/moving-buckets#permissions-console) to transfer the original bucket contents into the newly created bucket(s). Delete the original bucket(s), inform stakeholders about the newly deprecated buckets via `#data-warehouse-devs` and other relevant channels, and monitor for two weeks for any new code or process breakages related to the deletion of the old buckets.
Expand Down
Loading