You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The publication script [publish.py](https://github.com/cal-itp/data-infra/blob/main/warehouse/scripts/publish.py), typically used within the [publish_open_data Airflow workflow](https://o1d2fa0877cf3fb10p-tp.appspot.com/dags/publish_open_data/grid), relies on a [dbt exposure](https://docs.getdbt.com/docs/build/exposures) to determine what to publish - in practice, that exposure is titled `california_open_data`. The tables included in that exposure, their CKAN destinations, and their published descriptions are defined in [_gtfs_schedule_latest.yml](https://github.com/cal-itp/data-infra/blob/main/warehouse/models/mart/gtfs_schedule_latest/_gtfs_schedule_latest.yml) under the `exposures` heading.
26
+
27
+
By default, the columns of a table included in the exposure are _not_ published on the portal. This is to prevent fields that are useful for internal data management but are hard to interpret for public users, like `_is_current`, from being included in the open data portal. Columns meant for publication are explicitly included in publication via the dbt `meta` tag `publish.include: true`, which you can see on various columns of the models in the same YAML file where the exposure itself is defined.
28
+
29
+
The publication script does _not_ read from that YAML file directly when publishing - it reads from the manifest file generated by the `dbt_run_and_upload_artifacts` Airflow job. By default, that manifest file is read from the GCS bucket where `run_and_upload.py` stores it during regular daily runs. If you need to make changes to the Cal-ITP GTFS-Ingest Pipeline Dataset in between runs of the daily `dbt_run_and_upload_artifacts` Airflow job (e.g. if there's a time-sensitive bug you're fixing in open data), you'll need to either kick off an ad hoc run of that job in Airflow or run the script locally to generate a new manifest, and use that manifest to underpin the `publish.py` run. Take care when generating a manifest locally - you don't want any information in your local dbt project to be different than the production project besides the models you're making changes to.
30
+
31
+
## General open data publication process
20
32
21
-
## General process
22
33
### Develop data models
34
+
23
35
Generally, data models should be built in dbt/BigQuery if possible. For example,
24
-
we have [latest-only GTFS schedule models](https://github.com/cal-itp/data-infra/tree/main/warehouse/models/gtfs_schedule_latest_only)
36
+
we have [latest-only GTFS schedule models](https://github.com/cal-itp/data-infra/tree/main/warehouse/models/mart/gtfs_schedule_latest)
25
37
we can use to update and expand the existing [CKAN dataset](https://data.ca.gov/dataset/cal-itp-gtfs-ingest-pipeline-dataset)
California Open Data requires two documentation files for published datasets.
42
+
30
43
1.`metadata.csv` - one row per resource (i.e. file) to be published
31
44
2.`dictionary.csv` - one row per column across all resources
32
45
33
-
If you are using dbt exposure-based data publishing, you can automatically generate
34
-
these two files using the main `publish.py` script (specifically the `generate-exposure-documentation`
46
+
We use dbt exposure-based data publishing to automatically generate
47
+
these two files using the main `publish.py` script (specifically the `document-exposure`
35
48
subcommand). The documentation from the dbt models' corresponding YAML will be
36
-
converted into appropriate CSVs and written out locally.
49
+
converted into appropriate CSVs and written out locally. By default, the script will read the latest `manifest.json` in GCS uploaded by the `dbt_run_and_upload_artifacts` Airflow job.
37
50
38
51
Run this command inside the `warehouse` folder, assuming you have local dbt
39
52
artifacts in `target/` from a `dbt run` or `dbt compile`.
53
+
40
54
```bash
41
55
poetry run python scripts/publish.py document-exposure california_open_data
42
56
```
43
57
58
+
Each day, a new version of `manifest.json` is automatically generated for tables in the production warehouse by the `dbt_run_and_upload_artifacts` job in [the `transform_warehouse` DAG](https://o1d2fa0877cf3fb10p-tp.appspot.com/dags/transform_warehouse/grid), and placed inside the `calitp-dbt-artifacts` GCS bucket. If you intend to generate new documentation locally, you'll need to generate a new `manifest.json` locally first.
59
+
44
60
### Create dataset and metadata
61
+
45
62
Once you've generated the necessary metadata and dictionary CSV, you need to get
46
-
approval from Chad Baker. Send the CSVs via email, preferably CC'ing somebody
47
-
with a Caltrans email.
63
+
approval from the Caltrans Geospatial Data Officer (at the time of writing, Chad Baker) for publication. Send the dictionary and metadata CSVs via email, and explain what changes are coming to the dataset - have columns been added or removed from one of the tables, do you have a new table to add, or is there some other change?
48
64
49
-
Once approved, a CKAN dataset will be created with UUIDs corresponding to each
65
+
For new tables, a CKAN destination will be created with UUIDs corresponding to each
50
66
model that will be published. If you are using dbt exposures, you will need to
51
-
update the `meta` field to map the dbt models to the appropriate UUIDs.
67
+
update the `meta` field [here](https://github.com/cal-itp/data-infra/blob/main/warehouse/models/mart/gtfs_schedule_latest/_gtfs_schedule_latest.yml) to map the dbt models to the appropriate UUIDs.
68
+
69
+
For example:
52
70
53
-
An example from the latest-only GTFS data exposure.
54
71
```yaml
55
72
meta:
56
73
methodology: |
@@ -77,22 +94,22 @@ An example from the latest-only GTFS data exposure.
If you are using dbt-based publishing, the `publish_exposure` subcommand of `publish.py`
82
100
will query BigQuery, write out CSV files, and upload those files to CKAN.
83
-
Either create an Airflow job to refresh/update the data at the specified
84
-
frequency, or do it manually. You can use flags to execute a dry run or write to
85
-
GCS without also uploading to CKAN.
101
+
[An Airflow job](https://o1d2fa0877cf3fb10p-tp.appspot.com/dags/publish_open_data/grid) refreshes/updates the data at a specified
102
+
frequency, or the publication script can be run manually. By default, the `--no-publish` flag is set, executing a dry run. You can also write to GCS without uploading to CKAN by manually entering an arbitrary bucket destination.
86
103
87
-
The publish script supports referencing `gs://` paths for the manifest; by default,
88
-
the script will read the latest manifest in GCS uploaded by an Airflow job.
104
+
The weekly publishing Airflow job supports referencing `gs://` paths for the manifest, which is used to determine which tables and columns to publish; by default, the script will read the latest manifest in GCS uploaded by the `dbt_run_and_upload_artifacts` Airflow job.
89
105
You may also choose to run dbt models and/or run the publish script locally; these
90
106
operations can be mixed-and-matched. If you are running `publish.py` locally, you
91
107
will need to set `$CALITP_CKAN_GTFS_SCHEDULE_KEY` ahead of time.
92
108
93
109
By default, the script will upload artifacts to GCS, but will not actually
94
110
upload data to CKAN. In addition, the script will upload the metadata and dictionary
95
-
files to GCS for eventual sharing with Caltrans.
111
+
files to GCS for eventual sharing with Caltrans employees responsible for the open data portal.
112
+
96
113
```bash
97
114
$ poetry run python scripts/publish.py publish-exposure california_open_data --manifest ./target/manifest.json
98
115
reading manifest from ./target/manifest.json
@@ -109,6 +126,7 @@ You can add the `--publish` flag to actually upload artifacts to CKAN after they
109
126
are written to GCS. You must be using a production bucket to publish, either
110
127
by setting `$CALITP_BUCKET__PUBLISH` or using the `--bucket` flag. In addition,
111
128
you may specify a manifest file in GCS if desired.
0 commit comments