Skip to content

Add macro for group_mean_continuity_check #4092

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

krivard
Copy link
Contributor

@krivard krivard commented Feb 28, 2025

Overview

Three tables require a group_mean_continuity_check, which does the following:

  1. Group by a specified column
  2. Compute the mean of each group
  3. Compute the percent change between successive groups
  4. Check against a per-column threshold

This PR contains a draft for a possible implementation.

What did you change?

  • New macro, group_mean_continuity_check(group_column, max_pct_change)
  • Demo of macro in use for _core_eia923__cooling_system_information, using columns and thresholds from transform/eia923.py

Debatable design choices

  • This is written as a column-level check
    • pros: macro source is simple (no loops); keeps all checks for a column together in the column entry
    • cons: the grouping column and number of acceptable outliers is replicated in each data-test entry
    • options: could rewrite as a table-level check with a list of columns and thresholds; see Add model/table-level macro for group_mean_continuity_check #4116

@krivard krivard added dbt Issues related to the data build tool aka dbt data-validation Issues related to checking whether data meets our quality expectations. labels Feb 28, 2025
@krivard krivard requested a review from marianneke February 28, 2025 18:51
@krivard krivard linked an issue Mar 4, 2025 that may be closed by this pull request
Copy link
Member

@marianneke marianneke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question: you mention pct change between successive groups. This suggests the group column is inherently ordered, right? Is this something you would generally expect to be the case?

I guess what I'm saying is: to me it is not intuitive that a "group" is always an ordered thing. Maybe this variable name should be changed to something that makes it clear that it is both a partition and something that has an inherent order to it

@aesharpe
Copy link
Member

I'm inclined to say that this test should remain a column-level test (rather than a table test as in #4116). I think this because it doesn't have to do with the relationship between columns in a table or the relationship between a table and another table. For that reason, I think it makes sense to have tests that implicate a particular column be column level (unless, perhaps, it's something that implicates every single column in a table). But this to me feels like it should remain at the column-level.

@krivard
Copy link
Contributor Author

krivard commented Mar 12, 2025

@aesharpe even though 2 of the 3 arguments for the test are intended to be identical across all the column tests within a table? and if they ever change, we have to update N lines instead of just one?

I can live with that; I just want to make sure the impact is clear before we decide.

@aesharpe
Copy link
Member

@aesharpe even though 2 of the 3 arguments for the test are intended to be identical across all the column tests within a table? and if they ever change, we have to update N lines instead of just one?

I can live with that; I just want to make sure the impact is clear before we decide.

Yes, this is good clarification. I think it's important to be able to customize things like max_fr_change and n_outliers_allowed to each column, hence, column level test :)

@krivard krivard force-pushed the krivard/dbt-migrations_group_mean_continuity branch from e0da49a to 4c8c677 Compare March 19, 2025 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-validation Issues related to checking whether data meets our quality expectations. dbt Issues related to the data build tool aka dbt
Projects
Status: New
Development

Successfully merging this pull request may close these issues.

Migrate group_mean_continuity_check validation tests to dbt
3 participants