-
-
Notifications
You must be signed in to change notification settings - Fork 124
Dbt setup #4011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dbt setup #4011
Changes from 4 commits
bf40ffb
9aac625
415a113
ba32bd8
dc51c8f
590b02a
63e663a
d428b5d
784cf96
48a16e1
6f45ba5
ac41a41
0ce1648
c19cfd8
6335e94
2585eca
e24af8c
1ed85b3
e92f5be
5de9ebe
da9ae93
7461786
0d120c6
3666360
a3579dc
c98219c
f9b3fa7
79e2153
012ba4a
ff766b3
94267a5
8f660fd
70e6895
389c540
eb0765a
68040fd
58aee27
e6a0884
b49ba78
6bac07e
4f34a58
55d8160
40448c3
5dfce94
0193c53
16346fa
f17f14c
7d0523a
0786bcf
c050114
8ab821e
4f71d1a
3738507
89689d8
f5da280
31313e4
640b750
c41a3ef
244b4da
0244e2a
30de865
b52577d
14cb119
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,7 +7,7 @@ name = "catalystcoop.pudl" | |
description = "An open data processing pipeline for US energy data" | ||
readme = { file = "README.rst", content-type = "text/x-rst" } | ||
authors = [{ name = "Catalyst Cooperative", email = "[email protected]" }] | ||
requires-python = ">=3.12,<3.13" | ||
requires-python = ">=3.10,<3.13" | ||
dynamic = ["version"] | ||
license = { file = "LICENSE.txt" } | ||
dependencies = [ | ||
|
@@ -23,10 +23,12 @@ dependencies = [ | |
"conda-lock>=2.5.7", | ||
"coverage>=7.6", | ||
"dagster>=1.9", | ||
"dagster-dbt>=0.25.6,<1", | ||
"dagster-postgres>=0.24,<1", # Update when dagster-postgres graduates to 1.x | ||
"dask>=2024", | ||
"dask-expr", # Required for dask[dataframe] | ||
"datasette>=0.65", | ||
"dbt-duckdb", | ||
"doc8>=1.1", | ||
"duckdb>=1.1.3", | ||
"email-validator>=1.0.3", # pydantic[email] | ||
|
@@ -83,6 +85,7 @@ dependencies = [ | |
"sphinxcontrib_googleanalytics>=0.4", | ||
"sqlalchemy>=2", | ||
"sqlglot>=25", | ||
"s3fs>=2024", | ||
"timezonefinder>=6.2", | ||
"universal_pathlib>=0.2", | ||
"urllib3>=1.26.18", | ||
|
@@ -343,7 +346,7 @@ nodejs = ">=20" | |
pandoc = ">=2" | ||
pip = ">=24" | ||
prettier = ">=3.0" | ||
python = ">=3.12,<3.13" | ||
python = ">=3.10,<3.13" | ||
sqlite = ">=3.47" | ||
zip = ">=3.0" | ||
|
||
|
zaneselvans marked this conversation as resolved.
Show resolved
Hide resolved
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
|
||
target/ | ||
dbt_packages/ | ||
logs/ |
zaneselvans marked this conversation as resolved.
Show resolved
Hide resolved
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
id: 143b9efc-6985-409a-8029-865947b8f8f1 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
### Overview | ||
This directory contains an initial setup of a `dbt` project meant to write | ||
[data tests](https://docs.getdbt.com/docs/build/data-tests) for PUDL data. The | ||
project is setup with profiles that allow you to select running tests on `nightly` | ||
builds, `etl-full`, or `etl-fast` outputs. The `nightly` profile will operate | ||
directly on parquet files in our S3 bucket, while both the `etl-full` and `etl-fast` | ||
profiles will look for parquet files based on your `PUDL_OUTPUT` environment | ||
variable. See the `Usage` section below for examples using these profiles. | ||
|
||
|
||
### Development | ||
To setup the `dbt` project, simply install the PUDL `conda` environment as normal, | ||
then run the following command from this directory. | ||
|
||
``` | ||
dbt deps | ||
``` | ||
|
||
#### Adding new tables | ||
To add a new table to the project, you must add it as a | ||
[dbt source](https://docs.getdbt.com/docs/build/sources). You can do this by editing | ||
the file `src/pudl/dbt/models/schema.yml`. I've already added the table | ||
`out_vcerare__hourly_available_capacity_factor`, which can be used as a reference. | ||
|
||
#### Adding tests | ||
Once a table is included as a `source`, you can add tests for the table. You can | ||
either add a generic test directly in `src/pudl/dbt/models/schema.yml`, or create | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is required to have one monster There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pretty sure you can just shove more YAMLs into |
||
a `sql` file in the directory `src/pudl/dbt/tests/`, which references the `source`. | ||
When adding `sql` tests like this, you should construct a query that `SELECT`'s rows | ||
that indicate a failure. That is, if the query returns any rows, `dbt` will raise a | ||
failure for that test. | ||
|
||
The project includes [dbt-expectations](https://github.com/calogica/dbt-expectations) | ||
and [dbt-utils](https://github.com/dbt-labs/dbt-utils) as dependencies. These | ||
packages include useful tests out of the box that can be applied to any tables | ||
in the project. There are several examples in `src/pudl/dbt/models/schema.yml` which | ||
use `dbt-expectations`. | ||
|
||
#### Modifying a table before test | ||
In many cases we modify a table slightly before executing a test. There are a couple | ||
ways to accomplish this. First, when creating a `sql` test in `src/pudl/dbt/tests/`, | ||
you can structure your query to modify the table/column before selecting failure | ||
rows. The second method is to create a [model](https://docs.getdbt.com/docs/build/models) in `src/pudl/dbt/models/validation`. Any models created here will create a view | ||
in a `duckdb` database being used by `dbt`. You can then reference this model in | ||
`src/pudl/dbt/models/schema.yml`, and apply tests as you would with `sources`. There's | ||
an example of this pattern which takes the table `out_ferc1__yearly_steam_plants_fuel_by_plant_sched402`, | ||
computes fuel cost per mmbtu in the `sql` model, then applies `dbt_expectations` tests | ||
to this model. | ||
|
||
#### Usage | ||
There are a few ways to execute tests. To run all tests with a single command: | ||
|
||
``` | ||
dbt build | ||
``` | ||
|
||
This command will first run any models, then execute all tests. | ||
|
||
For more finegrained control, first run: | ||
|
||
``` | ||
dbt run | ||
``` | ||
|
||
This will run all models, thus prepairing any `sql` views that will be referenced in | ||
tests. Once you've done this, you can run all tests with: | ||
|
||
``` | ||
dbt test | ||
``` | ||
|
||
To run all tests for a single source table: | ||
|
||
``` | ||
dbt test --select source:pudl.{table_name} | ||
``` | ||
|
||
To run all tests for a model table: | ||
|
||
``` | ||
dbt test --select {model_name} | ||
``` | ||
|
||
##### Selecting target profile | ||
To select between `nightly`, `etl-full`, and `etl-fast` profiles, append | ||
`--target {target_name}` to any of the previous commands. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right now only the How will the |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# Name your project! Project names should contain only lowercase characters | ||
# and underscores. A good package name should reflect your organization's | ||
# name or the intended use of these models | ||
name: "pudl_dbt" | ||
version: "1.0.0" | ||
|
||
# This setting configures which "profile" dbt uses for this project. | ||
profile: "pudl_dbt" | ||
|
||
# These configurations specify where dbt should look for different types of files. | ||
# The `model-paths` config, for example, states that models in this project can be | ||
# found in the "models/" directory. You probably won't need to change these! | ||
model-paths: ["models"] | ||
test-paths: ["tests"] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
version: 2 | ||
|
||
sources: | ||
- name: pudl | ||
meta: | ||
external_location: | | ||
{%- if target.name == "nightly" -%} 'https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/{name}.parquet' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it also work to point this directly at S3 rather than going through the HTTPS interface? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I changed this to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Interestingly, with the |
||
{%- else -%} '{{ env_var('PUDL_OUTPUT') }}/parquet/{name}.parquet' | ||
{%- endif -%} | ||
tables: | ||
- name: out_eia923__boiler_fuel | ||
- name: out_eia923__monthly_boiler_fuel | ||
- name: out_ferc1__yearly_steam_plants_fuel_by_plant_sched402 | ||
- name: out_vcerare__hourly_available_capacity_factor | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In a data warehouse with hundreds of tables, would this file be created and managed by hand? Or would there be some rule-based way to generate it, or parts of it, along the lines of what we're doing with the Pandera schema checks right now? For example, the Or in the case of row counts, is there a clean, non-manual way to update the row counts to reflect whatever the currently observed counts are? Especially if we're trying to regenerate expected row counts for each individual year, filling it all in manually could be pretty tedious and error prone. We've moved toward specifying per-year row counts on the newer assets so that they work transparently in either the fast or full ETL cases, and the asset checks don't need to be aware of which kind of job they're being run in, which seems both more specific and more robust. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks like the "X column is not null" checks are currently defined in I think it would be nice to have auto-generated tests like the non-null tests & row counts defined alongside manually added tests. Then all the tests will be defined in one place, except for the tests that we need to write custom Python code for. That seems pretty doable - YAML is easy to work with, and dbt lets us tag tests, so we could easily tag all the auto-generated tests so our generation scripts know to replace them but leave the manually-added tests alone. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In addition to the field specific constraints I think we automatically add There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems totally possible to auto-generate tests, but I think there's also probably many ways to do accomplish this, so we should figure out what we want from it. For example, when we talk about auto-generating row count/not null tests, will these be generated once and committed into the repo, or will some/all of them be dynamically generated at runtime? It definitely seems tricky to minimize duplication between There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It feels like we may need to clearly define the data tests that are ready to be migrated in a straightforward way, and the things that still need design work, so we can point Margay folks at the stuff that's ready to go and keep thinking about the things that still need some scaffolding? |
||
data_tests: | ||
- dbt_expectations.expect_table_row_count_to_equal: | ||
value: | | ||
{%- if target.name == "etl-fast" -%} 27287400 | ||
{%- else -%} 136437000 | ||
{%- endif -%} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a clean way to specify the expected row counts for each year of data (or some other meaningful subset) within a table, as we've started doing for the newer assets in Dagster asset checks, so we don't have to differentiate between fast and full validations, and can identify where the changes are? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We'd probably need to create a custom macro for this, but that seems totally doable. Big question is how we want to generate/store all of those tests. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The row count tests have functionally become regression tests -- we want to know when they change, and verify that the magnitude and nature of the change is expected based on the code or data that we've changed. Given that there are hundreds of tables (and thousands of table-years) it doesn't seem practical to hand-code all of the expected row counts. It would be nice to have the per table-year row counts stored in (say) YAML somewhere, and be able to generate a new version of that file based on current ETL outputs. Then we could look at the diffs between the old and the new versions of the file when trying to assess changes in the lengths of the outputs. |
||
- dbt_expectations.expect_compound_columns_to_be_unique: | ||
column_list: ["county_id_fips", "datetime_utc"] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could be generated based on the PK that's defined for every table? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should be possible. We can also probably come up with a way to generate foreign key checks so we can actually verify foreign keys for tables only in parquet |
||
row_condition: "county_id_fips is not null" | ||
columns: | ||
- name: capacity_factor_solar_pv | ||
data_tests: | ||
- not_null | ||
- dbt_expectations.expect_column_max_to_be_between: | ||
max_value: 1.02 | ||
- dbt_expectations.expect_column_min_to_be_between: | ||
min_value: 0.00 | ||
- name: capacity_factor_offshore_wind | ||
data_tests: | ||
- not_null | ||
- dbt_expectations.expect_column_max_to_be_between: | ||
max_value: 1.00 | ||
- dbt_expectations.expect_column_min_to_be_between: | ||
min_value: 0.00 | ||
- name: hour_of_year | ||
data_tests: | ||
- not_null | ||
- dbt_expectations.expect_column_max_to_be_between: | ||
min_value: 8759 | ||
max_value: 8761 | ||
- name: datetime_utc | ||
data_tests: | ||
- not_null | ||
- dbt_expectations.expect_column_values_to_not_be_in_set: | ||
value_set: ["{{ dbt_date.date(2020, 12, 31) }}"] | ||
- name: county_or_lake_name | ||
data_tests: | ||
- not_null | ||
- dbt_expectations.expect_column_values_to_not_be_in_set: | ||
value_set: ["bedford_city", "clifton_forge_city"] | ||
models: | ||
- name: ferc1_fbp_cost_per_mmbtu | ||
columns: | ||
- name: gas_cost_per_mmbtu | ||
data_tests: | ||
- dbt_expectations.expect_column_quantile_values_to_be_between: | ||
quantile: 0.05 | ||
min_value: 1.5 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm guessing these are not using the weighted quantiles? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah this are just basic quantiles. It's not too hard to get a |
||
- dbt_expectations.expect_column_quantile_values_to_be_between: | ||
quantile: 0.90 | ||
max_value: 15.0 | ||
- dbt_expectations.expect_column_median_to_be_between: | ||
min_value: 2.0 | ||
max_value: 10.0 | ||
- name: oil_cost_per_mmbtu | ||
data_tests: | ||
- dbt_expectations.expect_column_quantile_values_to_be_between: | ||
quantile: 0.10 | ||
min_value: 3.5 | ||
- dbt_expectations.expect_column_quantile_values_to_be_between: | ||
quantile: 0.90 | ||
max_value: 25.0 | ||
- dbt_expectations.expect_column_median_to_be_between: | ||
min_value: 6.5 | ||
max_value: 17.0 | ||
- name: coal_cost_per_mmbtu | ||
data_tests: | ||
- dbt_expectations.expect_column_quantile_values_to_be_between: | ||
quantile: 0.10 | ||
min_value: 0.75 | ||
- dbt_expectations.expect_column_quantile_values_to_be_between: | ||
quantile: 0.90 | ||
max_value: 4.0 | ||
- dbt_expectations.expect_column_median_to_be_between: | ||
min_value: 1.0 | ||
max_value: 2.5 |
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we do end up needing to define these intermediate tables it seems like we would want to have some kind of clear naming convention for them? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I think that seems like a good idea. Maybe just use a |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
|
||
select | ||
{% for fuel_type in ["gas", "oil", "coal"] %} | ||
{{ fuel_type }}_fraction_cost * fuel_cost / ({{ fuel_type }}_fraction_mmbtu * fuel_mmbtu) as {{ fuel_type }}_cost_per_mmbtu, | ||
{% endfor %} | ||
from {{ source('pudl', 'out_ferc1__yearly_steam_plants_fuel_by_plant_sched402') }} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
packages: | ||
- package: calogica/dbt_expectations | ||
version: 0.10.4 | ||
- package: dbt-labs/dbt_utils | ||
version: 1.3.0 | ||
- package: calogica/dbt_date | ||
version: 0.10.1 | ||
sha1_hash: 29571f46f50e6393ca399c3db7361c22657f2d6b |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
packages: | ||
- package: calogica/dbt_expectations | ||
version: [">=0.10.0", "<0.11.0"] | ||
- package: dbt-labs/dbt_utils | ||
version: [">=1.3.0", "<1.4.0"] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see neither of these are available in |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
pudl_dbt: | ||
outputs: | ||
# Define targets for nightly builds, and local ETL full/fast | ||
# See models/schema.yml for further configuration | ||
nightly: | ||
type: duckdb | ||
path: "{{ env_var('PUDL_OUTPUT') }}/pudl.duckdb" | ||
filesystems: | ||
- fs: s3 | ||
etl-full: | ||
type: duckdb | ||
path: "{{ env_var('PUDL_OUTPUT') }}/pudl.duckdb" | ||
etl-fast: | ||
type: duckdb | ||
path: "{{ env_var('PUDL_OUTPUT') }}/pudl.duckdb" | ||
|
||
target: nightly |
Uh oh!
There was an error while loading. Please reload this page.