Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Port cdf tests from delta-spark to kernel #611

Merged
merged 12 commits into from
Jan 15, 2025

Conversation

OussamaSaoudi-db
Copy link
Collaborator

@OussamaSaoudi-db OussamaSaoudi-db commented Dec 20, 2024

What changes are proposed in this pull request?

This PR adds several CDF tests from delta-spark. We check the following:

  • CDF over various version ranges
  • Update operations are read correctly from cdc files
  • data_change=false means the action is skipped
  • A range with start > end is an error.
  • Start version greater than latest table version is an error
  • CDF works on partition tables
  • CDF works on tables with backticks in the column names
  • CDF is correct in deletion cases with unconditional deletes, conditional deletes that remove all rows, and selective conditional deletes.

Table-changes construction is also changed so that CDF version error is checked before snapshots are created. This makes the error message clearer in the case that the start version is beyond the end of the table.

@OussamaSaoudi-db OussamaSaoudi-db changed the title Cdf delta spark tests test: Cdf delta spark tests Dec 20, 2024
Copy link

codecov bot commented Dec 20, 2024

Codecov Report

Attention: Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.

Project coverage is 83.70%. Comparing base (c1c1dbe) to head (8156f9b).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/table_changes/mod.rs 85.71% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #611      +/-   ##
==========================================
+ Coverage   83.67%   83.70%   +0.02%     
==========================================
  Files          75       75              
  Lines       16950    16951       +1     
  Branches    16950    16951       +1     
==========================================
+ Hits        14183    14188       +5     
  Misses       2100     2100              
+ Partials      667      663       -4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@OussamaSaoudi-db OussamaSaoudi-db changed the title test: Cdf delta spark tests test: Port cdf tests from delta-spark to kernel Dec 20, 2024
@OussamaSaoudi-db OussamaSaoudi-db marked this pull request as ready for review December 20, 2024 00:47
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great thanks! One small thing, could you recreate the archives with --no-xattrs passed to tar? Otherwise if you look at these zstd files on a non-macos machine you get lots of:
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.provenance'

tar -c --no-xattrs [etc] should do the trick

#[test]
fn simple_cdf_version_ranges() -> DeltaResult<()> {
let batches = read_cdf_for_table("cdf-table-simple", 0, 0, None)?;
let mut expected = vec![
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to prefer putting the expected output along with the data (like we do with dat) rather than spreading it out like this. I think this is okay for now though, and maybe we can make an issue to port each test by recreating the archive with an expected data parquet.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done: #626

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this reminds me: we should make an issue (or perhaps we already have one?) to fix all the tests and move away from string matching and instead compare underlying (sorted) data/metadata

cc @nicklan

@OussamaSaoudi-db
Copy link
Collaborator Author

@nicklan rebuilt with --no-xattrs. Seems to have caused a diff on github, but I'm still seeing com.apple.provenance in when calling xattr. Does it still cause the error?

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did all the expected output come from delta-spark? if yes, then LGTM

#[test]
fn simple_cdf_version_ranges() -> DeltaResult<()> {
let batches = read_cdf_for_table("cdf-table-simple", 0, 0, None)?;
let mut expected = vec![
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this reminds me: we should make an issue (or perhaps we already have one?) to fix all the tests and move away from string matching and instead compare underlying (sorted) data/metadata

cc @nicklan

Comment on lines +332 to +334
// Note: `update_pre` and `update_post` are technically not part of the delta spec, and instead
// should be `update_preimage` and `update_postimage` respectively. However, the tests in
// delta-spark use the post and pre.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are we observing update_pre and update_post here then? aren't we reading the CDF and then filling in our own update_preimage etc.?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All update _change_types come directly from a cdc file. We don't insert or modify them. In this test, delta-spark wrote update_pre and update_post directly into the cdc file.

@OussamaSaoudi-db
Copy link
Collaborator Author

@zachschuermann regarding creating an issue, I already put one up for CDF here #626, but I can expand its scope to remove all string matching from our testing.

@OussamaSaoudi-db
Copy link
Collaborator Author

@zachschuermann My methodology is as follows: I used their expected results that looked like this (sample below), constructed the tables by hand, and only then did I verify it against our implementation.

        checkCDCAnswer(
          log,
          CDCReader.changesToBatchDF(log, 0, 2, spark).filter("_change_type = 'insert'"),
          Range(0, 6).map { i => Row(i, "old", i % 2, "insert", 0) })
        checkCDCAnswer(
          log,
          CDCReader.changesToBatchDF(log, 0, 2, spark).filter("_change_type = 'delete'"),
          Seq(0, 2, 3, 4).map { i => Row(i, "old", i % 2, "delete", if (i % 2 == 0) 2 else 1) })
        // ...

@OussamaSaoudi OussamaSaoudi merged commit 606db20 into delta-io:main Jan 15, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants