-
-
Notifications
You must be signed in to change notification settings - Fork 128
Enforce more specific PUDL resource name constraints. #4438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the changed lines are fixing names of test tables, both in the unit tests and in the docstring examples / tests in classes.py
Substantiative changes in:
classes.py
(not the docstrings)dbt_helper.py
sources.py
min_length=1, strict=True, pattern=r"^[a-z_][a-z0-9_]*(_[a-z0-9]+)*$" | ||
min_length=1, | ||
strict=True, | ||
pattern=r"^[a-z_][a-z0-9_]*(_[a-z0-9]+)*$", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No changes here -- just forced args onto individual lines.
PudlResourceName = Annotated[ | ||
str, | ||
StringConstraints( | ||
min_length=1, | ||
strict=True, | ||
pattern=rf"^(_?core_|out_)({'|'.join(ALL_PUDL_SOURCES)})__([a-z0-9_]+)$", | ||
), | ||
] | ||
"""A valid PUDL resource name conforming to our naming conventions :class:`str`.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the lightest weight way I could see to do this. However, it means that the runtime enforcement of type conformance needs to be done with the function below. So maybe it would be better if it were a "real" Pydantic model, if it's not being used to type a field in a Pydantic model (as it is in the Resource
class and everything that hangs off of that)
|
||
ALL_PUDL_SOURCES = sorted(set(SOURCES.keys()).union(["epa", "eia", "ferc"])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We use the
eia
andferc
"sources" for tables that are generic to the whole agency (i.e. they come from more than one data source). - However, we haven't been great about really defining what that means -- there are many tables with multiple "sources" listed which nevertheless have a single form listed in the resource name.
- The
epa
source on the other hand is a rogue, used I think becauseepacamd_eia
pertains to two data sources (it's glue...) and is the only data source we have that contains an underscore, which would mess up our table naming if we used it as the data source. - So maybe
epacamd_eia
shouldn't actually be allowed here. Maybe we should rename that source.
def get_data_source(table_name: str) -> str: | ||
def get_data_source(table_name: PudlResourceName) -> str: | ||
"""Return the data source element of the table's name.""" | ||
if not is_valid_pudl_resource_name(table_name): | ||
raise ValueError(f"Invalid PUDL resource name: {table_name}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoids the possibility of a malformed table name resulting in undefined behavior.
Overview
core
_core
andout
layers (since those are the only ones we write to the database or Parquet).Closes #4436
Testing
To-do list