-
-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable multi_asset
subsetting
#2773
base: main
Are you sure you want to change the base?
Changes from all commits
a091e24
718c186
ee19172
eec3db7
a5e4921
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,19 +9,14 @@ | |
logger = pudl.logging_helpers.get_logger(__name__) | ||
|
||
|
||
# TODO (bendnorman): Currently loading all glue tables. Could potentially allow users | ||
# to load subsets of the glue tables, see: https://docs.dagster.io/concepts/assets/multi-assets#subsetting-multi-assets | ||
# Could split out different types of glue tables into different assets. For example the cross walk table could be a separate asset | ||
# that way dagster doesn't think all glue tables depend on generators_entity_eia, boilers_entity_eia. | ||
|
||
|
||
@multi_asset( | ||
outs={ | ||
table_name: AssetOut(io_manager_key="pudl_sqlite_io_manager") | ||
table_name: AssetOut(io_manager_key="pudl_sqlite_io_manager", is_required=False) | ||
for table_name in Package.get_etl_group_tables("glue") | ||
# do not load epacamd_eia glue assets bc they are stand-alone assets below. | ||
if "epacamd_eia" not in table_name | ||
}, | ||
can_subset=True, | ||
required_resource_keys={"datastore", "dataset_settings"}, | ||
) | ||
def create_glue_tables(context): | ||
|
@@ -46,9 +41,9 @@ def create_glue_tables(context): | |
# Ensure they are sorted so they match up with the asset outs | ||
glue_dfs = dict(sorted(glue_dfs.items())) | ||
|
||
return ( | ||
Output(output_name=table_name, value=df) for table_name, df in glue_dfs.items() | ||
) | ||
for table_name, df in glue_dfs.items(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Non-blocking - would a generator expression work here instead? return (
Output(output_name=table_name, value=df)
for table_name, df in glue_dfs.items()
if table_name in context.selected_output_names
) ? I guess that's not really much better, is it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure. I am just working off of the dagster docs examples. IDK if dagster cares if its a return or yield statement. |
||
if table_name in context.selected_output_names: | ||
yield Output(output_name=table_name, value=df) | ||
|
||
|
||
##################### | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -99,7 +99,7 @@ def get_dtypes(page, **partition): | |
|
||
|
||
# TODO (bendnorman): Add this information to the metadata | ||
eia_raw_table_names = ( | ||
raw_table_names = ( | ||
"raw_eia923__boiler_fuel", | ||
"raw_eia923__fuel_receipts_costs", | ||
"raw_eia923__generation_fuel", | ||
|
@@ -109,7 +109,7 @@ def get_dtypes(page, **partition): | |
# from being extracted currently. When we update to a new DOI this problem will | ||
# probably fix itself. See comments on this issue: | ||
# https://github.com/catalyst-cooperative/pudl/issues/2448 | ||
# "raw_emissions_control_eia923", | ||
# "raw_eia923__emissions_control", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we've moved on to a new DOI by now if we want to turn this back on. |
||
) | ||
|
||
|
||
|
@@ -118,7 +118,11 @@ def get_dtypes(page, **partition): | |
|
||
# TODO (bendnorman): Figure out type hint for context keyword and mutli_asset return | ||
@multi_asset( | ||
outs={table_name: AssetOut() for table_name in sorted(eia_raw_table_names)}, | ||
outs={ | ||
table_name: AssetOut(is_required=False) | ||
for table_name in sorted(raw_table_names) | ||
}, | ||
can_subset=True, | ||
required_resource_keys={"datastore", "dataset_settings"}, | ||
) | ||
def extract_eia923(context, eia923_raw_dfs): | ||
|
@@ -137,12 +141,12 @@ def extract_eia923(context, eia923_raw_dfs): | |
|
||
eia923_raw_dfs = dict(sorted(eia923_raw_dfs.items())) | ||
|
||
return ( | ||
Output(output_name=table_name, value=df) | ||
for table_name, df in eia923_raw_dfs.items() | ||
for table_name, df in eia923_raw_dfs.items(): | ||
# There's an issue with the EIA-923 archive for 2018 which prevents this table | ||
# from being extracted currently. When we update to a new DOI this problem will | ||
# probably fix itself. See comments on this issue: | ||
# https://github.com/catalyst-cooperative/pudl/issues/2448 | ||
if table_name != "raw_eia923__emissions_control" | ||
) | ||
if (table_name in context.selected_output_names) and ( | ||
table_name != "raw_eia923__emissions_control" | ||
): | ||
yield Output(output_name=table_name, value=df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jdangerx @zschira is there a reason why y'all didn't use
materialize()
when you initially created this script?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope! This is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I think this whole script could probably be replaced with
dagster asset materialize
, like:$ dagster asset materialize -m pudl.etl --select "raw_eia860__fgd_equipment"
Which works on this branch but fails with the
This AssetsDefinition does not support subsetting.
error ondev
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True but I'm not sure you can feed the CLI command run config like you can with the
materialize()
function which is helpful to running subsets of years:Also, can we use the vs code debugger with the CLI command?