Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dbt setup #4011

Merged
merged 63 commits into from
Feb 22, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
bf40ffb
Add basic dbt setup
zschira Jan 9, 2025
9aac625
Update to dagster 1.9.7 & grpcio 1.67.1
zaneselvans Jan 10, 2025
415a113
Setup multiple dbt profiles
zschira Jan 10, 2025
ba32bd8
Merge remote-tracking branch 'refs/remotes/origin/dbt_setup' into dbt…
zaneselvans Jan 10, 2025
dc51c8f
Add all vcerare dbt tests
zschira Jan 10, 2025
590b02a
Add more example dbt tests
zschira Jan 13, 2025
63e663a
Merge branch 'dbt_setup' of github.com:catalyst-cooperative/pudl into…
zaneselvans Jan 13, 2025
d428b5d
Merge changes from main and revert to python 3.12
zaneselvans Jan 13, 2025
784cf96
Bump gdal to v3.10.1 bugfix release.
zaneselvans Jan 14, 2025
48a16e1
Merge branch 'main' into dbt_setup
zaneselvans Jan 14, 2025
6f45ba5
Merge branch 'main' into dbt_setup
zaneselvans Jan 15, 2025
ac41a41
Update to dagster 1.9.9
zaneselvans Jan 19, 2025
0ce1648
Merge branch 'main' into dbt_setup
zaneselvans Jan 20, 2025
c19cfd8
Merge branch 'main' into dbt_setup
zaneselvans Jan 20, 2025
6335e94
Reorganize dbt into multiple schema.yml files
zschira Jan 21, 2025
2585eca
Merge branch 'dbt_setup' of github.com:catalyst-cooperative/pudl into…
zschira Jan 21, 2025
e24af8c
Move dbt project to top level of repo
zschira Jan 22, 2025
1ed85b3
Only set parquet path in dbt project once
zschira Jan 30, 2025
e92f5be
Standardize dbt maning scheme
zschira Jan 30, 2025
5de9ebe
Add more detail to README
zschira Jan 30, 2025
da9ae93
Add script to generate dbt scaffolding and row count tests
zschira Feb 5, 2025
7461786
Add documentation for dbt helper script
zschira Feb 5, 2025
0d120c6
Add out_ferc1__yearly_steam_plants_fuel_by_plant_sched402 to yearly r…
zschira Feb 5, 2025
3666360
Add weighted quantile test (broken)
zschira Feb 5, 2025
a3579dc
Change row count test name
zschira Feb 5, 2025
c98219c
Update dbt initialization process
zschira Feb 5, 2025
f9b3fa7
Make dbt helper script work properly with non-yearly partitioned tables
zschira Feb 5, 2025
79e2153
Update dbt readme
zschira Feb 5, 2025
012ba4a
Regenerate ferc dbt schemas
zschira Feb 6, 2025
ff766b3
Merge branch 'main' into dbt_setup
zaneselvans Feb 10, 2025
94267a5
Improve dbt_helper command line usability
zschira Feb 10, 2025
8f660fd
Merge branch 'dbt_setup' of github.com:catalyst-cooperative/pudl into…
zschira Feb 10, 2025
70e6895
Flesh out test migration command
zschira Feb 13, 2025
389c540
Add test migration tutorial
zschira Feb 14, 2025
eb0765a
Merge branch 'main' into dbt_setup
zschira Feb 14, 2025
68040fd
Use duckdb for getting row counts per partition
zschira Feb 18, 2025
58aee27
Merge branch 'main' into 'dbt_setup'
zschira Feb 19, 2025
e6a0884
Add seperate etl full/fast row count checks
zschira Feb 19, 2025
b49ba78
Update quantile tests to work with both fast and full etl
zschira Feb 19, 2025
6bac07e
Update dbt readme
zschira Feb 19, 2025
4f34a58
Clarify dbt model documentation
zschira Feb 19, 2025
55d8160
Add no null cols test
zschira Feb 20, 2025
40448c3
Fix no_null_rows test
zschira Feb 20, 2025
5dfce94
Add tutorial for migrating a bespoke test
zschira Feb 20, 2025
0193c53
Merge branch 'main' into dbt_setup
zaneselvans Feb 21, 2025
16346fa
Avoid regressing changes on main
zaneselvans Feb 21, 2025
f17f14c
Revert to older mamba Docker image since 2.0.6 was yanked.
zaneselvans Feb 21, 2025
7d0523a
Re-lock dependencies since they've changed both here and on main.
zaneselvans Feb 21, 2025
0786bcf
Merge branch 'dbt_setup' of github.com:catalyst-cooperative/pudl into…
zschira Feb 21, 2025
c050114
Add better docstrings/help messages to dbt_helper script
zschira Feb 21, 2025
8ab821e
Update dbt readme
zschira Feb 21, 2025
4f71d1a
Merge branch 'main' into dbt_setup
zaneselvans Feb 21, 2025
3738507
Minor docs fixes
zschira Feb 21, 2025
89689d8
Remove option from dbt helper script
zschira Feb 21, 2025
f5da280
Add example to dbt helper command description
zschira Feb 21, 2025
31313e4
Merge branch 'dbt_setup' of github.com:catalyst-cooperative/pudl into…
zschira Feb 21, 2025
640b750
Update devtools/dbt_helper.py
zschira Feb 21, 2025
c41a3ef
Update devtools/dbt_helper.py
zschira Feb 21, 2025
244b4da
Update devtools/dbt_helper.py
zschira Feb 21, 2025
0244e2a
Update devtools/dbt_helper.py
zschira Feb 21, 2025
30de865
Update descriptions of row count dbt test
zschira Feb 21, 2025
b52577d
Merge branch 'dbt_setup' of github.com:catalyst-cooperative/pudl into…
zschira Feb 21, 2025
14cb119
Reflow README text; use 4 threads by default.
zaneselvans Feb 22, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ repos:
# Formatters: hooks that re-write Python & documentation files
####################################################################################
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.9.6
rev: v0.9.7
hooks:
- id: ruff
args: [--fix, --exit-non-zero-on-fix]
Expand Down
385 changes: 385 additions & 0 deletions dbt/README.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will want to migrate all of this documentation into our development docs. I realize some of it only pertains to the migration, but other parts will be more durable and we can edit it down as we go.

Original file line number Diff line number Diff line change
@@ -0,0 +1,385 @@
# Overview

This directory contains an initial setup of a `dbt` project meant to write [data
tests](https://docs.getdbt.com/docs/build/data-tests) for PUDL data. The project is
setup with profiles that allow you to select running tests on `nightly` builds,
`etl-full`, or `etl-fast` outputs. The `nightly` profile will operate directly on
parquet files in our S3 bucket, while both the `etl-full` and `etl-fast` profiles will
look for parquet files based on your `PUDL_OUTPUT` environment variable. See the `Usage`
section below for examples using these profiles.

# Development

To setup the `dbt` project, simply install the PUDL `conda` environment as normal,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a link to the section of the development docs that covers creating the pudl-dev environment.

then run the following commands from this directory.

```
dbt deps
dbt seed
```

## Adding new tables

### Helper script

To add a new table to the project, you must add it as a [dbt
source](https://docs.getdbt.com/docs/build/sources). We've included a helper script to
automate the process at `devtools/dbt_helper.py`.

### Usage

#### `add-tables`

The first command provided by the helper script is `add-tables`. This has already been
run for all existing tables, but may be useful when we add new tables to PUDL.

```
python devtools/dbt_helper.py add-tables {table_name(s)}
```

This will add a file called `dbt/models/{data_source}/{table_name}/schema.yml` for each
table. This yaml file tells `dbt` about the table and it's schema. It will also apply
the test `check_row_counts_per_partition`. This test works by storing row counts per
year (or all row counts for time independent tables) in the csv files called
`etl_fast_row_counts.csv` and `etl_full_row_counts.csv` and checking these row counts
against the actual table.

To see all options for this command run:

```
python devtools/dbt_helper.py add-tables --help
```

#### `migrate-tests`

The second command in the helper script is called `migrate-tests`. This command is used
to migrate existing `vs_bounds` tests. There are many of these tests in our existing
validation framework, and the configuration is relatively verbose, so this script should
be very useful for migrating these tests quickly.

The basic usage of this command looks like:

```
python devtools/dbt_helper.py migrate-tests \
--test-config-name {config_variable}
--table-name {table_name}
```

Where `config_variable` points to a variable defined in `src/pudl/validate.py`, which
contains a list of configuration `dict`s for the `vs_bounds` tests, and `table_name`
refers to the table the test should be applied to. See below for an example of using
this command.

## Adding tests

### Default case

Once a table is included as a `source`, you can add tests for the table. You can either
add a generic test directly in `src/pudl/dbt/models/{table_name}/schema.yml`, or create
a `sql` file in the directory `src/pudl/dbt/tests/`, which references the `source`. When
adding `sql` tests like this, you should construct a query that `SELECT`'s rows that
indicate a failure. That is, if the query returns any rows, `dbt` will raise a failure
for that test.

The project includes [dbt-expectations](https://github.com/calogica/dbt-expectations)
and [dbt-utils](https://github.com/dbt-labs/dbt-utils) as dependencies. These packages
include useful tests out of the box that can be applied to any tables in the project.
There are several examples in
`src/pudl/dbt/models/out_vcerare__hourly_available_capacity_factor/schema.yml` which use
`dbt-expectations`.

### Modifying a table before test

In some cases you may want to modify the table before applying tests. There are two ways
to accomplish this. First, you can add the table as a `source` as described above, then
create a SQL file in the `tests/` directory like `tests/{data_source}/{table_name}.yml`.
From here you can construct a SQL query to modify the table and execute a test on the
intermediate table you've created. `dbt` expects a SQL test to be a query that returns 0
rows for a successful test. See the `dbt` [source
function](https://docs.getdbt.com/reference/dbt-jinja-functions/source) for guidance on
how to reference a `source` from a SQL file.

The second method is to create a [model](https://docs.getdbt.com/docs/build/models)
which will produce the intermediate table you want to execute tests on. To use this
approach, simply add a sql file to `dbt/models/{data_source}/{table_name}/`. Now, add a
SQL file to this directory named `validate_{table_name}` and define your model for
producing the intermediate table here. Finally, add the model to the `schema.yml` file
and define tests exactly as you would for a `source` table. See
`models/ferc1/out_ferc1__yearly_steam_plants_fuel_by_plant_sched402` for an example of
this pattern.

Note: when adding a model, it will be stored as a SQL `view` in the file
`{PUDL_OUTPUT}/pudl.duckdb`.

## Running tests

There are a few ways to execute tests. To run all tests with a single command:

```
dbt build
```

This command will first run any models, then execute all tests.

For more finegrained control, you can use the `--select` option to only run tests
on a specific table.

To run all tests for a single source table:

```
dbt build --select source:pudl.{table_name}
```

To run all tests on a table that uses an intermediate `dbt model`, you can do:

```
dbt build --select {model_name}
```

### Selecting target profile

To select between `nightly`, `etl-full`, and `etl-fast` profiles, append `--target
{target_name}` to any of the previous commands.

## Updating a table

### Modify `schema.yml`

Once we have generated an initial `schema.yml` file, we expect this configuration to be
maintained/updated manually in the future. For example, we can add
[data-tests](https://docs.getdbt.com/docs/build/data-tests) as described in the `dbt`
docs, or add/remove columns if the table schema is changed.

In the future we might migrate much of our schema/constraint testing into `dbt` as well.
In this case the easiest approach would be to update the `dbt_helper.py` script to
generate the necessary configuration from our existing metadata structures. In this
case, we should be careful to make the script not overwrite any manual configuration
changes that we make between now and then.

### Update row counts

When we run the `add-tables` command, it generates a test for each table called
`check_row_counts_per_partition`. This test uses row counts that are stored in csv files
called `etl_fast_row_counts.csv` and `etl_full_row_counts.csv` and compares these counts
to the row counts found in the actual table when the test is run. The test partitions
row counts by year, so there are a number of rows in these csv files for each table
(unless the table has no time dimension).

During development row counts often change for normal and expected reasons like adding
new data, updating transformations, etc. When these changes happen, the tests will fail
unless we update the row counts stored in the csv files mentioned above. To see where
these tests failed, you can run:

```
dbt build --select source:pudl.{table_name} --store-failures --target={etl-fast|etl-full}
```

The output of this command should show you a `sql` query you can use to see partitions
where the row count test failed. To see these, you can do:

```
duckdb {PUDL_OUTPUT}/pudl.duckdb
```

Then copy and paste the query into the duckdb CLI (you'll need to add a semicolon to the
end). This should show you the years and the expected and found row counts. If the
changes seem reasonable and expected, you can manually update these files, or you can
run the command:

```
python devtools/dbt_helper.py add-tables {table_name} --row-counts-only --clobber --use-local-tables
```

This will tell the helper script to overwrite the existing row counts with new row
counts from the table in your local `PUDL_OUTPUT` stash. If you want to update the
`etl-fast` row counts, simply append the option `--etl-fast` to the above command.

### Test migration example

This section will walk through migrating an existing collection of validation tests. As
an example, we'll use the tests in `test/validate/mcoe_test.py`. This file contains a
collection of tests which use some common methods defined in `pudl/validate.py`. These
tests are representative of many of the types of tests we're looking to migrate.

### `vs_bounds` tests

We'll start with a set of tests which use the `vs_bounds` method defined in
`validate.py`. These check that a specified quantile of a numeric column is within a
configurable set of bounds. The values to configure the quantiles and the bounds are
defined in config `dict`'s, in `validate.py`. For example, the test,
`test_gas_capacity_factor`, references configuration stored in the variable
`mcoe_gas_capacity_factor`. Once we know where the configuration is stored, we need to
identify the specific tables used in the test. This test is producing tables with the
`PudlTabl` method `mcoe_generators`. This method returns the tables
`out_eia__monthly_generators`, and `out_eia__yearly_generators`. To migrate these tests
we'll first want generate `dbt` configuration for these tables.

```
python devtools/dbt_helper.py add-tables out_eia__monthly_generators out_eia__yearly_generators
```

This will create the files `dbt/models/output/out_eia__{AGG}_generators/schema.yml`,
which will tell `dbt` about the table and generate row count tests. Note the exact
format of this file might be modified by our linting tools, but this does not change
functionality.

Next, we want to update the `dbt` configuration to add tests equivalent to the
`vs_bounds` tests. The `dbt_helper` script contains a command that can automatically
generate this `dbt` configuration from the exisitng python config. To do this you can
use a command like:

```
python devtools/dbt_helper.py migrate-tests \
--table-name out_eia__AGG_generators \
--test-config-name mcoe_gas_capacity_factor
```

This command will add the necessary tests. There is a chance that formatting could get
messed up slightly in the translation, but will work out of the box in the general case.

To see the tests generated by these commands, look to the `schema.yml` files mentioned
above. You can also execute these tests with the following command.

```
dbt test --select source:pudl.out_eia__AGG_generators
```

When we run these tests, there will be a couple of errors as the generated tests are not
a perfect translation of the old tests. Namely, the
`expect_column_weighted_quantile_values_to_be_between` tests will compute a discrete
weighted quantile, while the old python tests compute continuous quantiles. First, the
following test on the yearly table will fail:

```
- expect_column_weighted_quantile_values_to_be_between:
quantile: 0.65
max_value: 0.7
row_condition: fuel_type_code_pudl='gas' and report_date>=CAST('2015-01-01' AS DATE) and capacity_factor<>0.0
weight_column: capacity_mw
```

To debug this test, we can use `duckdb` directly. There are many ways to interact with
`duckdb`, here will use the CLI. See the [here](https://duckdb.org/docs/installation/)
for installation directions. To launch the CLI, navigate to the directory that your
`PUDL_OUTPUT` environment variable points to, and execute:

```
duckdb pudl.duckdb
```

Now we want to execute portions of the compiled SQL produced by `dbt`. To find this,
look at the output of the test failure, and you should see a line under the test failure
that looks like `compiled code at {path_to_sql}`. Looking at this file, we'll pull out
the section

```sql
WITH CumulativeWeights AS (
SELECT
capacity_factor,
capacity_mw,
SUM(capacity_mw) OVER (ORDER BY capacity_factor) AS cumulative_weight,
SUM(capacity_mw) OVER () AS total_weight
FROM 'https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/out_eia__yearly_generators.parquet'
WHERE capacity_factor IS NOT NULL OR capacity_mw IS NOT NULL
),
QuantileData AS (
SELECT
capacity_factor,
capacity_mw,
cumulative_weight,
total_weight,
cumulative_weight / total_weight AS cumulative_probability
FROM CumulativeWeights
)
SELECT capacity_factor
FROM QuantileData
WHERE cumulative_probability >= 0.65
ORDER BY capacity_factor
LIMIT 1
Comment on lines +276 to +298
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do the row_condition criteria get applied such that the query only operates on the right fuel type and date range?

```

This is where the weighted quantile is actually calculated. We can copy this into the
`duckdb` CLI, add a semicolon to the end of the last line and hit `Enter`. This produces
the output:

```
┌─────────────────┐
│ capacity_factor │
│ float │
├─────────────────┤
│ 0.82587963 │
└─────────────────┘
```

This is failing because the `max_value` is set to `0.7`. If we change this value to
0.83, this test should now pass.

### `no_null_cols` tests

The method `test_no_null_cols_mcoe` in `mcoe_test.py` checks a number of tables to
validate that there are no completely null columns. We've added a custom test to the
`dbt` project that can accomplish this. To use this test, simply apply the
`not_all_null` test to a column in question:

```
columns:
- name: report_date
data_tests:
- not_all_null
```

See `dbt/models/_out_eia__monthly_heat_rate_by_unit/schema.yml` for specific examples.

Note: there is also a builtin test in `dbt` called `not_null` that will make sure
there are no null values at all in a column.

### `no_null_rows` tests

The method `test_no_null_rows_mcoe` in `mcoe_test.py` checks a number of tables to
validate that there are no completely null rows. We've added a custom test
to the `dbt` project that can accomplish this. To use this test, simply apply the
`no_null_rows` test to a table:

```
version: 2
sources:
- name: pudl
tables:
- name: _out_eia__monthly_derived_generator_attributes
data_tests:
- no_null_rows
```

See `dbt/models/_out_eia__monthly_derived_generator_attributes/schema.yml` for
a specific example.

### `bespoke` tests

As the name implies, `bespoke` tests will require some degree of custom handling.
That being said, there are common patterns for how/where to add these tests. Some
tests can also potentially make use of builtin tests provided by `dbt_expectations`,
`dbt_utils`, and `dbt` itself.

For an example, we'll demonstrate migrating the test `test_idle_capacity`. This
test is applied to the `out_eia__{freq}_generators` tables. Because this test is applied
to two different tables, we'll define it as a
[generic data test](https://docs.getdbt.com/docs/build/data-tests#generic-data-tests).
This allows us to reuse a single test on both tables. If this test was only used
on a single table, we could create a
[singular data test](https://docs.getdbt.com/docs/build/data-tests#singular-data-tests).

To create a new generic test, we will add a new file called `dbt/macros/test_idle_capacity.sql`.
Then, we develop a SQL query to mimic the behavior of the original test. For details
see the SQL file. It uses `jinja` fairly extensively to set/access bounds. Once this
new test has been developed, we can apply it to the tables by modifying their `schema.yml`
as so:

```
version: 2
sources:
- name: pudl
tables:
- name: out_eia__yearly_generators
data_tests:
- test_idle_capacity
```
Loading
Loading