Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] feat: Introduce TableConfiguration to jointly manage metadata, protocol, and column mapping #644

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

OussamaSaoudi
Copy link
Collaborator

@OussamaSaoudi OussamaSaoudi commented Jan 14, 2025

What changes are proposed in this pull request?

This PR introduces the TableConfiguration struct which is used to perform feature enablement checks on both the table's protocol and metadata.

Problem statement

To check that a feature is enabled, you often must check that a certain reader/writer feature is enabled and that a table property is set to true. For example, a writer must check both the delta.enableDeletionVectors table property, and check that the deletionVectors writer feature is present in the table's Protocol. Probing two disparate structs to do a single check is error-prone and may lead to these metadata/protocol checks to become out of sync. Moreover checks are performed in the CDF path, snapshot scan path, and in the read path. Thus there are many ways in which protocol and metadata checks can diverge with one another.

Put simply, the problems are:

  1. Protocol check may diverge with the metadata for the same code path
  2. P&M checks may diverge among different code paths
  3. There is duplicate work between all of these locations
  4. Any update to reader/writer feature check or table property check currently needs to be applied to several locations in the code.

Solution

TableConfiguration consolidates all protocol and metadata checks to one place. It also ensures that the logic for checking feature enablement is kept consistent throughout the codebase. This addresses problem 1-4 outlined above.

Closes: #571

How was this change tested?

existing tests pass.
TODO: add more tests

/// See the documentation of [`TableChanges`] for more details.
///
/// [`TableChanges`]: crate::table_changes::TableChanges
pub fn can_read_cdf(&self) -> DeltaResult<()> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zachschuermann @nicklan wdyt of this sort of API? The general pattern would be can_(read | write)_{table_feature}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we'd only create such functions if/when we need them, and if they make sense. For example if there is a feature that is only relevant to readers, there would be no can_write_{tablefeature}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason it doesn't return DeltaResult<bool>? Seems like there are cases (most of them?) where we don't want to necessarily throw, we just want to learn (but if something goes wrong while trying to learn, an error response still warranted).

Can always add a corresponding assert_can_XXX method if the error version is actually needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using the pattern of Ok(()) => true, Err(_) => false, with an explanation why. We use a similar pattern here. Do you think we should rethink this approach?

Copy link

codecov bot commented Jan 14, 2025

Codecov Report

Attention: Patch coverage is 42.01681% with 69 lines in your changes missing coverage. Please review.

Project coverage is 83.41%. Comparing base (c1c1dbe) to head (50ad379).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/table_configuration.rs 37.73% 63 Missing and 3 partials ⚠️
kernel/src/snapshot.rs 72.72% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #644      +/-   ##
==========================================
- Coverage   83.67%   83.41%   -0.27%     
==========================================
  Files          75       76       +1     
  Lines       16950    17055     +105     
  Branches    16950    17055     +105     
==========================================
+ Hits        14183    14226      +43     
- Misses       2100     2163      +63     
+ Partials      667      666       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

impl TableConfiguration {
pub fn new(metadata: Metadata, protocol: Protocol) -> DeltaResult<Self> {
// important! before a read/write to the table we must check it is supported
protocol.ensure_read_supported()?;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this and validate_schema_column_mapping belong here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add a new TableConfiguration struct to bundle protocol/metadata cross-validation
3 participants