Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable multi_asset subsetting #2773

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Enable multi_asset subsetting #2773

wants to merge 5 commits into from

Conversation

bendnorman
Copy link
Member

@bendnorman bendnorman commented Aug 7, 2023

PR Overview

This PR enables multi_asset subsetting so we can run a single asset that has an upstream dependency created by a multi_asset. Without it enabled we get this kind of error:


Error
dagster._core.errors.DagsterInvalidSubsetError: When building job, the AssetsDefinition 'extract_eia923' contains asset keys [AssetKey(['raw_eia923__boiler_fuel']), AssetKey(['raw_eia923__fuel_receipts_costs']), AssetKey(['raw_eia923__generation_fuel']), AssetKey(['raw_eia923__generator']), AssetKey(['raw_eia923__stocks'])], but attempted to select only [AssetKey(['raw_eia923__fuel_receipts_costs'])]. This AssetsDefinition does not support subsetting. Please select all asset keys produced by this asset.

This makes it annoying/impossible to debug single assets using a debugger.

This PR also simplifies our debugging script called devtools/materialize_asset.py

TODOs

  • Write tests for multi_asset subsetting
  • Add docs on how to run an asset and upstream dependencies
  • Create a script that runs a specific asset and upstream dependencies
  • Add docs on how to profile a specific asset

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@bendnorman bendnorman marked this pull request as draft August 7, 2023 21:04
@codecov
Copy link

codecov bot commented Aug 7, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (e13039a) 88.7% compared to head (a5e4921) 88.7%.
Report is 4 commits behind head on dev.

Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #2773   +/-   ##
=====================================
  Coverage   88.7%   88.7%           
=====================================
  Files         90      90           
  Lines      10994   11000    +6     
=====================================
+ Hits        9758    9764    +6     
  Misses      1236    1236           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bendnorman bendnorman marked this pull request as ready for review November 17, 2023 03:19
@bendnorman bendnorman requested a review from jdangerx November 17, 2023 03:21
},
),
],
materialize(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdangerx @zschira is there a reason why y'all didn't use materialize() when you initially created this script?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope! This is better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I think this whole script could probably be replaced with dagster asset materialize, like:

$ dagster asset materialize -m pudl.etl --select "raw_eia860__fgd_equipment"

Which works on this branch but fails with the This AssetsDefinition does not support subsetting. error on dev.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True but I'm not sure you can feed the CLI command run config like you can with the materialize() function which is helpful to running subsets of years:

Usage: dagster asset materialize [OPTIONS]

  Execute a run to materialize a selection of assets

Options:
  -a, --attribute TEXT          Attribute that is either a 1) repository or job or 2) a function that returns a
                                repository or job
  --package-name TEXT           Specify Python package where repository or job function lives
  -m, --module-name TEXT        Specify module where dagster definitions reside as top-level symbols/variables and
                                load the module as a code location in the current python environment.
  -f, --python-file PATH        Specify python file where dagster definitions reside as top-level symbols/variables
                                and load the file as a code location in the current python environment.
  -d, --working-directory TEXT  Specify working directory to use when loading the repository or job
  --select TEXT                 Asset selection to target  [required]
  --partition TEXT              Asset partition to target
  -h, --help                    Show this message and exit.

Also, can we use the vs code debugger with the CLI command?

# "raw_emissions_control_eia923",
# "raw_eia923__emissions_control",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've moved on to a new DOI by now if we want to turn this back on.

Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whee, I'm glad this is finally getting into dev!

Questions I'd like to see addressed before merging (and once you address them, feel free to dismiss this review and merge!)
I think we can just get rid of materialize_asset.py completely. What do you think?

There are still lots of multi_assets that don't support subsetting on this branch - how did you choose what to change and what to not change?

Non-blocking comment:

It would also be good to follow up on Zane's comment, about the new DOI maybe fixing things.

},
),
],
materialize(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I think this whole script could probably be replaced with dagster asset materialize, like:

$ dagster asset materialize -m pudl.etl --select "raw_eia860__fgd_equipment"

Which works on this branch but fails with the This AssetsDefinition does not support subsetting. error on dev.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can exclude these conda-lock files from rendering by default by marking them as 'generated files': https://docs.github.com/en/repositories/working-with-files/managing-files/customizing-how-changed-files-appear-on-github

return (
Output(output_name=table_name, value=df) for table_name, df in glue_dfs.items()
)
for table_name, df in glue_dfs.items():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking - would a generator expression work here instead?

return (
    Output(output_name=table_name, value=df)
    for table_name, df in glue_dfs.items()
    if table_name in context.selected_output_names
)

? I guess that's not really much better, is it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I am just working off of the dagster docs examples. IDK if dagster cares if its a return or yield statement.

Base automatically changed from dev to main January 5, 2024 04:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

3 participants