Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark] Validate computed state against checksum on checkpoint #3846

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dhruvarya-db
Copy link
Collaborator

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Follow up for #3828.

This PR adds checksum validation logic. On every checkpoint, we will take the computed state of the table as per the deltas and the previous checkpoint and compare it against the checksum that was written at that version. The same methods can potentially be used to validate more frequently (if needed).

How was this patch tested?

Added a new test case in ChecksumSuite that tests that all logically corrupted fields are being caught by the validation logic.

Does this PR introduce any user-facing changes?

No

Comment on lines +611 to +613
if (spark.conf.get(DeltaSQLConf.DELTA_WRITE_CHECKSUM_ENABLED)) {
snapshot.validateChecksum(Map("context" -> "writeCheckpoint"))
}
Copy link
Collaborator

@prakharjain09 prakharjain09 Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich @dhruvarya-db Should this check be done in the beginning or in the end of writing checkpoint?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that it would be a waste of time to write a checkpoint only to find out that there is something wrong with the delta log.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prakharjain09 Since most of this PR is about the validation logic itself, maybe we can address the question of when validation should be triggered in a follow up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants