Skip to content

feat: validate_deleted_data_files #1938

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

jayceslesar
Copy link
Contributor

@jayceslesar jayceslesar commented Apr 19, 2025

Comment on lines 118 to 128
if entry.snapshot_id not in new_snapshot_ids:
continue

if entry.status != ManifestEntryStatus.DELETED:
continue

if data_filter is not None and not evaluator(entry.data_file):
continue

if partition_set is not None and (entry.data_file.spec_id, entry.data_file.partition) not in partition_set:
continue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably cleaner to put these each on a line and wrap in an any call

@jayceslesar jayceslesar changed the title [wip] feat: validate_deleted_data_files feat: validate_deleted_data_files May 1, 2025
@jayceslesar jayceslesar marked this pull request as ready for review May 2, 2025 00:40
Comment on lines +84 to +85
parent_snapshot: Optional[Snapshot],
partition_set: Optional[set[Record]],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks out of order:

Suggested change
parent_snapshot: Optional[Snapshot],
partition_set: Optional[set[Record]],
partition_set: Optional[set[Record]],
parent_snapshot: Optional[Snapshot],

It looks like the function call in validate_deleted_data_files is assuming that parent_snapshot is the last argument

    conflicting_entries = deleted_data_files(table, starting_snapshot, data_filter, None, parent_snapshot)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs some more work. Next to @sungwy's comment, we in the code below:

(entry.data_file.spec_id, entry.data_file.partition) not in partition_set

Where it checks if a tuple is in a partition_set, but the partition_set only contains the Record according to the signature.

This triggered me, because if you do:

ALTER TABLE prod.db.taxis REPLACE PARTITION FIELD pickup_timestamp WITH day(pickup_timestamp);

-- and then
ALTER TABLE prod.db.taxis REPLACE PARTITION FIELD pickup_timestamp WITH day(dropoff_timestamp);

Both of the partitioning strategies will produce a Record[int] because it will contain the number of days since epoch. But the meaning is completely different.

Comment on lines +105 to +106
starting_snapshot,
parent_snapshot,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks out of order to, because validation_history has to_snapshot as the second argument, and from_snapshot as the third argument.

Suggested change
starting_snapshot,
parent_snapshot,
starting_snapshot,
parent_snapshot,

Do you think it would be better to update validation_history function to use the following function signature instead? I think it's a lot more expected to have from_snapshot then to_snapshot

def validation_history(
    table: Table,
    from_snapshot: Snapshot,
    to_snapshot: Snapshot,
    matching_operations: set[Operation],
    manifest_content_filter: ManifestContent,
)

)

if data_filter is not None:
evaluator = _StrictMetricsEvaluator(table.schema(), data_filter).eval
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too sure if this is correct, because ManifestGroup.entries seems to be using inclusive projection.

Should we be using inclusive_projection here instead?

Summoning @Fokko for a second review

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that this should be inclusive projection, since we want to know if there are any matches. Inclusive projection returns rows_might_match and rows_cannot_match. If they cannot be matched, then we can skip it :)

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for following up on the previous PR @jayceslesar. This looks like a great start, but there are some missing elements. It would be good to add some more tests as well, since this is pretty critical code since it might affect correctness.

@@ -69,3 +75,74 @@ def validation_history(
raise ValidationException("No matching snapshot found.")

return manifests_files, snapshots


def deleted_data_files(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we mark these as private? I don't think we want to expose this directly to the user, and marking it private makes it easier to change the signature later on (since we don't have to go through the deprecation cycle).

Suggested change
def deleted_data_files(
def _deleted_data_files(

Comment on lines +84 to +85
parent_snapshot: Optional[Snapshot],
partition_set: Optional[set[Record]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs some more work. Next to @sungwy's comment, we in the code below:

(entry.data_file.spec_id, entry.data_file.partition) not in partition_set

Where it checks if a tuple is in a partition_set, but the partition_set only contains the Record according to the signature.

This triggered me, because if you do:

ALTER TABLE prod.db.taxis REPLACE PARTITION FIELD pickup_timestamp WITH day(pickup_timestamp);

-- and then
ALTER TABLE prod.db.taxis REPLACE PARTITION FIELD pickup_timestamp WITH day(dropoff_timestamp);

Both of the partitioning strategies will produce a Record[int] because it will contain the number of days since epoch. But the meaning is completely different.

"""
conflicting_entries = deleted_data_files(table, starting_snapshot, data_filter, None, parent_snapshot)
if any(conflicting_entries):
raise ValidationException("Deleted data files were found matching the filter.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe nice to also add to the Exception which snapshot-id(s) are conflicting

parent_snapshot: Ending snapshot on the branch being validated

Returns:
List of deleted data files matching the filter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
List of deleted data files matching the filter
List of conflicting manifest-entries

)

if data_filter is not None:
evaluator = _StrictMetricsEvaluator(table.schema(), data_filter).eval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that this should be inclusive projection, since we want to know if there are any matches. Inclusive projection returns rows_might_match and rows_cannot_match. If they cannot be matched, then we can skip it :)

if entry.status != ManifestEntryStatus.DELETED:
continue

if data_filter is not None and not evaluator(entry.data_file):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to make it explicit:

Suggested change
if data_filter is not None and not evaluator(entry.data_file):
if data_filter is not None and not evaluator(entry.data_file) is ROWS_CANNOT_MATCH:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Concurrency Safety Validation: Implement validateDeletedDataFiles
3 participants