Skip to content

Conversation

@ross-whatnot
Copy link
Contributor

@ross-whatnot ross-whatnot commented Apr 5, 2025

Following discussion in this issue and related draft PRs, taking a simpler approach of adding parents: list[Model | Source | Snapshot] to Model and Snapshot models

This should allow writing rules that compare a node to its parents ("any model with tag tier_1 may only have parents that also have tag tier_1", "any model that does not have the tag beta may not have any parents that do have the tag beta"), to be able to make assertions about model lineage expectation. One could also traverse the graph in full (upstream) with a recursive rule that walks the graph via parents.

Tested with a ~3000 model manifest locally, and things seem to work just fine; will do a bit more poking and prodding, though.

@ross-whatnot ross-whatnot changed the title Embedding parents and children into models, sources, and snapshots Embedding parents into models, sources, and snapshots Apr 5, 2025
@ross-whatnot ross-whatnot changed the title Embedding parents into models, sources, and snapshots Embedding parents into models and snapshots Apr 5, 2025
@jochemvandooren jochemvandooren linked an issue Apr 7, 2025 that may be closed by this pull request
@ross-whatnot ross-whatnot marked this pull request as ready for review April 7, 2025 13:35
Copy link
Contributor

@jochemvandooren jochemvandooren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work @ross-whatnot , thanks a lot!

I have left very small comments, the feature itself is 🥇 . Could you add a CHANGELOG entry?

@ross-whatnot
Copy link
Contributor Author

@jochemvandooren thanks! Think I've addressed everything.

Copy link
Contributor

@jochemvandooren jochemvandooren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙌

Copy link
Contributor

@sercancicek sercancicek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @ross-whatnot! Thank you for your contribution 🙌 The PR already looks very well. I left one comment/suggestion.

Comment on lines +569 to +578
def _populate_parents(self) -> None:
"""Populate `parents` for all models and snapshots."""
for node in list(self.models.values()) + list(self.snapshots.values()):
for parent_id in node.depends_on.get("nodes", []):
if parent_id in self.models:
node.parents.append(self.models[parent_id])
elif parent_id in self.snapshots:
node.parents.append(self.snapshots[parent_id])
elif parent_id in self.sources:
node.parents.append(self.sources[parent_id])
Copy link
Contributor

@sercancicek sercancicek Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _populate_parents(self) -> None:
"""Populate `parents` for all models and snapshots."""
for node in list(self.models.values()) + list(self.snapshots.values()):
for parent_id in node.depends_on.get("nodes", []):
if parent_id in self.models:
node.parents.append(self.models[parent_id])
elif parent_id in self.snapshots:
node.parents.append(self.snapshots[parent_id])
elif parent_id in self.sources:
node.parents.append(self.sources[parent_id])
def _populate_parents(self) -> None:
"""Populate `parents` for all models and snapshots."""
all_parents = {**self.models, **self.snapshots, **self.sources}
for node in list(self.models.values()) + list(self.snapshots.values()):
for parent_id in node.depends_on.get("nodes", []):
if parent := all_parents.get(parent_id):
node.parents.append(parent)

IMO, this looks easier to read and maintain but just a suggestion :)

@matthieucan
Copy link
Contributor

Tested with a ~3000 model manifest locally, and things seem to work just fine; will do a bit more poking and prodding, though.

Pretty curious about performances there: did you measure the delta?

@ross-whatnot
Copy link
Contributor Author

Pretty curious about performances there: did you measure the delta?

@matthieucan no significant difference in the base case; for a project with 3000+ models, a simple lint takes basically the same amount of time (pre-parsed manifest).
Using 0.11:
image

Using my branch:
image

I haven't done anything too complex with the history, but adding a test that compares attributes of a child to attributes of its parents (e.g. below) doesn't change that runtime in any meaningful way (~1.11s)

@rule(severity=Severity.CRITICAL)
def model_has_appropriate_refs(model: Model) -> RuleViolation | None:
    """Models should have appropriate lineage"""
    tiers = ["t1", "t2", "t3", "t4"]
    violations = []
    for ref in model.parents:
        if isinstance(ref, Model):
            if tiers.index(model.meta["service_tier"]) < tiers.index(
                ref.meta["service_tier"]
            ):
                violations.append((ref.name, ref.meta["service_tier"]))
    if len(violations) > 0:
        return RuleViolation(
            message=f"Model {model.name} (tier {model.meta['service_tier']}) may not have parents in lower tiers ({', '.join([f'{v[0]} (tier {v[1]})' for v in violations])})"
        )
    return None

@matthieucan
Copy link
Contributor

@ross-whatnot Awesome, thanks for sharing! 🎉

@jochemvandooren jochemvandooren merged commit edf0563 into PicnicSupermarket:master Apr 30, 2025
4 checks passed
@jochemvandooren
Copy link
Contributor

Amazing work @ross-whatnot , thank you 🙌

@jochemvandooren
Copy link
Contributor

I will release a new version once we merge #110, so you can benefit from this feature asap!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Refs, and rules about model lineage

4 participants