-
-
Notifications
You must be signed in to change notification settings - Fork 129
Description
Overview
validate/ferc1_test.py:test_filter_for_freshest_data() was originally slated to be migrated to dbt as a bespoke "hard" test, however after discussing the structure of the test we decided it wasn't a good fit for dbt but should still be part of our stable of tests. Ideally, we move this test to an asset check.
The test currently checks a selection of ferc1 tables that use the filter_for_freshest_data()
function to collect and apply the mixed selection of full-record and diff-record updated filings typical of the FERC1 submission process. One quirk of these updates is that if a value was misfiled in the wrong field, the update always includes the value in its correct place, but may or may not include the removal of the value in its old incorrect place. The function uses two heuristics to correct for this effect. The test attempts to quantify how much the two heuristics disagree, ensure that the nature of the disagreement doesn't fundamentally change over time, and ensure that the total disagreement remains below expected thresholds.
The existing test checks nine tables. Checking a table consists of looking up its raw sources in TABLE_NAME_MAP_FERC1[table_name]["xbrl"]
and then running the filter_for_freshest_data()
function for each source, using a special mode that includes the additional checks. The asserts for this test are two call layers deep, which is what makes this test a poor fit for dbt but a possible fit for an asset check.
The tables we currently check are:
[ # some sample wide guys
"core_ferc1__yearly_sales_by_rate_schedules_sched304",
"core_ferc1__yearly_pumped_storage_plants_sched408",
"core_ferc1__yearly_steam_plants_sched402",
"core_ferc1__yearly_hydroelectric_plants_sched406",
"core_ferc1__yearly_transmission_lines_sched422",
# some sample guys found to have higher filtering diffs
"core_ferc1__yearly_utility_plant_summary_sched200",
"core_ferc1__yearly_plant_in_service_sched204",
"core_ferc1__yearly_operating_expenses_sched320",
"core_ferc1__yearly_income_statements_sched114",
]
These tables usually have a single raw source, but sometimes there are many raw tables that go into a single one of the core tables. Currently core_ferc1__yearly_sales_by_rate_schedules_sched304
is the only core table with more than one raw table; it has 9 of them.
Current status
I cannot seem to construct a functional asset check for this.
Things I have tried:
- For each core table, look up its raw sources. Generate an asset check with the core table as the asset name and
additional_ins
mapping raw source table names to anAssetIn
for that table. Return a function whose signature is(pd.DataFrame, **kwargs) -> AssetCheckResult
. Cannot be loaded by dagster, with error messageError loading repository location pudl.ferc_to_sqlite:dagster._core.errors.DagsterInvalidDefinitionError: ‘raw_ferc1_xbrl__sales_of_electricity_by_rate_schedules_account_440_residential_304_instant’ is specified in ‘additional_ins’ but isn’t a parameter.
- For each core table, look up its raw sources. For each raw source, generate an asset check with the core table as the asset name and
additional_ins
mappingraw_table
to anAssetIn
for the raw source table. Yield a function whose signature is(pd.DataFrame, pd.DataFrame) -> AssetCheckResult
. Can be loaded by dagster, but no new checks for the core table appear. - For each core table, generate an asset check with the core table as the asset name. Within the asset check, look up the raw sources, and fetch the asset from
pudl.etl.defs
. Return a function whose signature is(pd.DataFrame) -> AssetCheckResult
. Cannot be loaded by Python, sincepudl.etl.defs
is not sufficiently populated at the time when the file forpudl.output.ferc1
is being loaded.
I am using pudl.output.ferc1:ferc1_output_asset_factory()
as a model for passing in a collection of AssetIn
as **kwargs
. I am confused why kwargs works for asset defs but not asset checks.
Things I have not tried:
- Writing 9 manual asset checks, each with its raw table(s) hard-coded as
additional_ins
, replicating the information inTABLE_NAME_MAP_FERC1[table_name]["xbrl"]
- Writing 1 manual asset check for
core_ferc1__yearly_sales_by_rate_schedules_sched304
(the only one with more than one raw table at the moment) and generating the other 8 usingTABLE_NAME_MAP_FERC1[table_name]["xbrl"]
Metadata
Metadata
Assignees
Labels
Type
Projects
Status