Skip to content

Move ferc1 test_filter_for_freshest_data to asset check #4150

@krivard

Description

@krivard

Overview

validate/ferc1_test.py:test_filter_for_freshest_data() was originally slated to be migrated to dbt as a bespoke "hard" test, however after discussing the structure of the test we decided it wasn't a good fit for dbt but should still be part of our stable of tests. Ideally, we move this test to an asset check.

The test currently checks a selection of ferc1 tables that use the filter_for_freshest_data() function to collect and apply the mixed selection of full-record and diff-record updated filings typical of the FERC1 submission process. One quirk of these updates is that if a value was misfiled in the wrong field, the update always includes the value in its correct place, but may or may not include the removal of the value in its old incorrect place. The function uses two heuristics to correct for this effect. The test attempts to quantify how much the two heuristics disagree, ensure that the nature of the disagreement doesn't fundamentally change over time, and ensure that the total disagreement remains below expected thresholds.

The existing test checks nine tables. Checking a table consists of looking up its raw sources in TABLE_NAME_MAP_FERC1[table_name]["xbrl"] and then running the filter_for_freshest_data() function for each source, using a special mode that includes the additional checks. The asserts for this test are two call layers deep, which is what makes this test a poor fit for dbt but a possible fit for an asset check.

The tables we currently check are:

    [  # some sample wide guys
        "core_ferc1__yearly_sales_by_rate_schedules_sched304",
        "core_ferc1__yearly_pumped_storage_plants_sched408",
        "core_ferc1__yearly_steam_plants_sched402",
        "core_ferc1__yearly_hydroelectric_plants_sched406",
        "core_ferc1__yearly_transmission_lines_sched422",
        # some sample guys found to have higher filtering diffs
        "core_ferc1__yearly_utility_plant_summary_sched200",
        "core_ferc1__yearly_plant_in_service_sched204",
        "core_ferc1__yearly_operating_expenses_sched320",
        "core_ferc1__yearly_income_statements_sched114",
    ]

These tables usually have a single raw source, but sometimes there are many raw tables that go into a single one of the core tables. Currently core_ferc1__yearly_sales_by_rate_schedules_sched304 is the only core table with more than one raw table; it has 9 of them.

Current status

I cannot seem to construct a functional asset check for this.

Things I have tried:

  • For each core table, look up its raw sources. Generate an asset check with the core table as the asset name and additional_ins mapping raw source table names to an AssetIn for that table. Return a function whose signature is (pd.DataFrame, **kwargs) -> AssetCheckResult. Cannot be loaded by dagster, with error message Error loading repository location pudl.ferc_to_sqlite:dagster._core.errors.DagsterInvalidDefinitionError: ‘raw_ferc1_xbrl__sales_of_electricity_by_rate_schedules_account_440_residential_304_instant’ is specified in ‘additional_ins’ but isn’t a parameter.
  • For each core table, look up its raw sources. For each raw source, generate an asset check with the core table as the asset name and additional_ins mapping raw_table to an AssetIn for the raw source table. Yield a function whose signature is (pd.DataFrame, pd.DataFrame) -> AssetCheckResult. Can be loaded by dagster, but no new checks for the core table appear.
  • For each core table, generate an asset check with the core table as the asset name. Within the asset check, look up the raw sources, and fetch the asset from pudl.etl.defs. Return a function whose signature is (pd.DataFrame) -> AssetCheckResult. Cannot be loaded by Python, since pudl.etl.defs is not sufficiently populated at the time when the file for pudl.output.ferc1 is being loaded.

I am using pudl.output.ferc1:ferc1_output_asset_factory() as a model for passing in a collection of AssetIn as **kwargs. I am confused why kwargs works for asset defs but not asset checks.

Things I have not tried:

  • Writing 9 manual asset checks, each with its raw table(s) hard-coded as additional_ins, replicating the information in TABLE_NAME_MAP_FERC1[table_name]["xbrl"]
  • Writing 1 manual asset check for core_ferc1__yearly_sales_by_rate_schedules_sched304 (the only one with more than one raw table at the moment) and generating the other 8 using TABLE_NAME_MAP_FERC1[table_name]["xbrl"]

Metadata

Metadata

Assignees

No one assigned

    Labels

    dagsterIssues related to our use of the Dagster orchestratordata-validationIssues related to checking whether data meets our quality expectations.

    Type

    No type

    Projects

    Status

    Icebox

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions