Skip to content

Create Dagster job & assets that produce FERC-EIA ID mapping inputs #4338

@zaneselvans

Description

@zaneselvans

Overview

Currently, producing the inputs required to map new FERC & EIA Plants and Utilities means running

make unmapped-ids

Which uses pytest to run the full ETL with a single CPU (taking 5+ hours). This is necessary to avoid clobbering the user's existing database / Parquet outputs since the ETL has to be run in a slightly weird way -- without FK constraints, and with a fresh FERC Form 1 DB, etc.

But we could just as easily define a few new assets in Dagster that contain the same outputs, and don't require everything to be re-run -- just the upstream FERC & EIA assets -- and it would probably only take a few minutes.

It could also be used to check for new unmapped IDs on a regular basis automatically.

This would mean converting a bunch of the code in pudl.glue.ferc1_eia into assets rather than janky scripty things that read directly from Parquet files.

The mapping outputs we currently generate are:

  • missing_plant_id_pudl_in_plants_ferc1.csv
  • missing_plants_in_plants_eia.csv
  • missing_plants_in_plants_ferc1.csv
  • missing_utility_id_eia_in_utilities_eia.csv
  • missing_utility_id_ferc1_dbf_in_raw_dbf.csv
  • missing_utility_id_ferc1_in_plants_ferc1.csv
  • missing_utility_id_ferc1_in_utilities_ferc1_dbf.csv
  • missing_utility_id_ferc1_in_utilities_ferc1_xbrl.csv
  • missing_utility_id_ferc1_xbrl_in_raw_xbrl.csv
  • missing_utility_id_pudl_in_utilities_ferc1.csv

Success Criteria

  • We can generate the above outputs as assets in Dagster.
  • Only the necessary upstream dependencies need to be materialized to do it.
  • The dependencies are explicit in pudl.glue.ferc1_eia instead of being implict as they are now.

Metadata

Metadata

Assignees

No one assigned

    Labels

    dagsterIssues related to our use of the Dagster orchestratoreia860Anything having to do with EIA Form 860eia923Anything having to do with EIA Form 923ferc1Anything having to do with FERC Form 1gluePUDL specific structures & metadata. Stuff that connects datasets together.performanceMake PUDL run faster!

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions