Skip to content

Enforce more specific PUDL resource name constraints. #4438

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

zaneselvans
Copy link
Member

@zaneselvans zaneselvans commented Jul 16, 2025

Overview

  • Adds a more specific Annotated string type that uses Pydantic's StringConstraints to enforce our PUDL table naming conventions (at least, more than just being SnakeCase)
  • Requires the two-part (layer, data source) portion of the table before the double underscore, and only allows valid data sources, and the core _core and out layers (since those are the only ones we write to the database or Parquet).
  • The new type was easy. Tracking down all the tests where we had dummy table names that didn't conform was a little more annoying.
  • I'm not sure if this is really the way to do it, but it's a way to do it. Open to other ideas!

Closes #4436

Testing

  • Ran the unit tests locally

To-do list

  • Review the PR yourself and call out any questions or issues you have.

@zaneselvans zaneselvans requested a review from krivard July 16, 2025 06:04
@zaneselvans zaneselvans self-assigned this Jul 16, 2025
@zaneselvans zaneselvans added metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. data-types Dtype conversions, standardization and implications of data types labels Jul 16, 2025
Copy link
Member Author

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the changed lines are fixing names of test tables, both in the unit tests and in the docstring examples / tests in classes.py Substantiative changes in:

  • classes.py (not the docstrings)
  • dbt_helper.py
  • sources.py

Comment on lines -184 to +188
min_length=1, strict=True, pattern=r"^[a-z_][a-z0-9_]*(_[a-z0-9]+)*$"
min_length=1,
strict=True,
pattern=r"^[a-z_][a-z0-9_]*(_[a-z0-9]+)*$",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No changes here -- just forced args onto individual lines.

Comment on lines +193 to +201
PudlResourceName = Annotated[
str,
StringConstraints(
min_length=1,
strict=True,
pattern=rf"^(_?core_|out_)({'|'.join(ALL_PUDL_SOURCES)})__([a-z0-9_]+)$",
),
]
"""A valid PUDL resource name conforming to our naming conventions :class:`str`."""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the lightest weight way I could see to do this. However, it means that the runtime enforcement of type conformance needs to be done with the function below. So maybe it would be better if it were a "real" Pydantic model, if it's not being used to type a field in a Pydantic model (as it is in the Resource class and everything that hangs off of that)

Comment on lines +946 to +947

ALL_PUDL_SOURCES = sorted(set(SOURCES.keys()).union(["epa", "eia", "ferc"]))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • We use the eia and ferc "sources" for tables that are generic to the whole agency (i.e. they come from more than one data source).
  • However, we haven't been great about really defining what that means -- there are many tables with multiple "sources" listed which nevertheless have a single form listed in the resource name.
  • The epa source on the other hand is a rogue, used I think because epacamd_eia pertains to two data sources (it's glue...) and is the only data source we have that contains an underscore, which would mess up our table naming if we used it as the data source.
  • So maybe epacamd_eia shouldn't actually be allowed here. Maybe we should rename that source.

Comment on lines -260 to +267
def get_data_source(table_name: str) -> str:
def get_data_source(table_name: PudlResourceName) -> str:
"""Return the data source element of the table's name."""
if not is_valid_pudl_resource_name(table_name):
raise ValueError(f"Invalid PUDL resource name: {table_name}")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoids the possibility of a malformed table name resulting in undefined behavior.

@zaneselvans zaneselvans moved this from New to In progress in Catalyst Megaproject Jul 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-types Dtype conversions, standardization and implications of data types metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages.
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

Enforce PUDL table naming conventions in Pydantic Resource model
1 participant