-
-
Notifications
You must be signed in to change notification settings - Fork 129
Description
Currently, pudl_etl
and ferc_to_sqlite
cli commands use the dagster.build_reconstructable_job
method for executing multi process dagster jobs. build_reconstructable_job
is an experimental method and is kind of confusing. We can likely completely replace our pudl_etl
and ferc_to_sqlite
cli command code by just creating preconfigured jobs and executing them with the dagster cli:
dagster job execute <name of job>
Jobs will likely be the ones we currently have plus a nightly_build_etl_full
and a nightly_build_ferc_to_sqlite_full
job. If we move to this, we'll need to define the preconfigured jobs in python. We mostly do this now with the exception of the args people to the pudl_etl
and ferc_to_sqlite
cli commands.
How can we incorporate pudl_etl
arguments into the dagster configuration system? Current args that aren't included right now are loglevel
and logfile
. Same args for ferc_to_sqlite
with the addition of the dataset_only
arg.
How do we want to generate the configurations? 90% of our config is generated in pud.etl.__init__.py
via a few strategies:
-
Loading default configuration of dagster resources
Lines 262 to 266 in 548401f
define_asset_job( name="etl_full", description="This job executes all years of all assets.", config=default_config, ), -
Using default configuration + asset selection
Lines 267 to 272 in 548401f
define_asset_job( name="etl_full_no_cems", selection=create_non_cems_selection(default_assets), description="This job executes all years of all assets except the " "core_epacems__hourly_emissions asset and all assets downstream.", ), -
Loading configuration from a yaml file
Lines 273 to 284 in 548401f
define_asset_job( name="etl_fast", config=default_config | { "resources": { "dataset_settings": { "config": load_dataset_settings_from_file("etl_fast") } } }, description="This job executes the most recent year of each asset.", ),
We also have a default_config
dictionary that should shared by all jobs:
Lines 211 to 222 in 548401f
default_tag_concurrency_limits = [ | |
{ | |
"key": "memory-use", | |
"value": "high", | |
"limit": 4, | |
}, | |
] | |
default_config = pudl.helpers.get_dagster_execution_config( | |
tag_concurrency_limits=default_tag_concurrency_limits | |
) | |
default_config |= pudl.analysis.ml_tools.get_ml_models_config() | |
### Tasks
- [ ] Figure out how to specify `loglevel`, `logfile` and `dataset_only` args in dagster config system
- [ ] Create nightly build jobs
- [ ] Test out the dagster CLI with our jobs
- [ ] Rip out our cli code
Metadata
Metadata
Assignees
Labels
Type
Projects
Status