-
-
Notifications
You must be signed in to change notification settings - Fork 128
Description
Overview
Currently, producing the inputs required to map new FERC & EIA Plants and Utilities means running
make unmapped-ids
Which uses pytest to run the full ETL with a single CPU (taking 5+ hours). This is necessary to avoid clobbering the user's existing database / Parquet outputs since the ETL has to be run in a slightly weird way -- without FK constraints, and with a fresh FERC Form 1 DB, etc.
But we could just as easily define a few new assets in Dagster that contain the same outputs, and don't require everything to be re-run -- just the upstream FERC & EIA assets -- and it would probably only take a few minutes.
It could also be used to check for new unmapped IDs on a regular basis automatically.
This would mean converting a bunch of the code in pudl.glue.ferc1_eia
into assets rather than janky scripty things that read directly from Parquet files.
The mapping outputs we currently generate are:
missing_plant_id_pudl_in_plants_ferc1.csv
missing_plants_in_plants_eia.csv
missing_plants_in_plants_ferc1.csv
missing_utility_id_eia_in_utilities_eia.csv
missing_utility_id_ferc1_dbf_in_raw_dbf.csv
missing_utility_id_ferc1_in_plants_ferc1.csv
missing_utility_id_ferc1_in_utilities_ferc1_dbf.csv
missing_utility_id_ferc1_in_utilities_ferc1_xbrl.csv
missing_utility_id_ferc1_xbrl_in_raw_xbrl.csv
missing_utility_id_pudl_in_utilities_ferc1.csv
Success Criteria
- We can generate the above outputs as assets in Dagster.
- Only the necessary upstream dependencies need to be materialized to do it.
- The dependencies are explicit in
pudl.glue.ferc1_eia
instead of being implict as they are now.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status