Introduce enforced and versioned data product schemas when they are published for cross-domain governance #8914
fkostadinov
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
A key idea of a data mesh is that domains are superimposed onto a data lake - to avoid creation of a data swamp. In this way, ownership of datasets is always clearly assigned. However, many metadata and data lineage tools simply document schema of datasets, but do not enforce them. The consequence is that when a data product is published and nobody enforces a schema then the downstream consumers of a data product may experience upstream data product producers to break the schema contract at some point.
I was wondering whether we could introduce enforced schemas somehow. As long as a dataset resides within a domain its schema is allowed to change. But in the very moment a data owner publishes a dataset to be consumable beyond a domain then there should be an immutable contract downstream consumers can rely on. The publication of a data product would enforce version control of the schema contract (similar to what e.g. Nessie does). Every change of the schema of a data product would require a sign-off from the data owner, and would notify registered downstream consumers.
This approach would allow domain-specific data product ownership, but provide guarantees to downstream consumers that things do not simply break without being at least notified. Every breaking change could be clearly tracked and we would always know who the assigned owner of the dataset is.
I am not sure if this would be aligned with the idea of what Gravitino tries to be (e.g. only a metadata catalogue, or also a governance platform)? Maybe also my idea should be located at a level higher than Gravitino.
Beta Was this translation helpful? Give feedback.
All reactions