Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incompatible schema changes break file skipping #712

Open
scovich opened this issue Feb 21, 2025 · 1 comment
Open

Incompatible schema changes break file skipping #712

scovich opened this issue Feb 21, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@scovich
Copy link
Collaborator

scovich commented Feb 21, 2025

Describe the bug

Suppose a query includes a skipping-eligible predicate over LONG column c.

Then we expect the add.stats column to include min/max stats that can parse as LONG.

However, it is possible that a recent table replace operation changed the schema of c -- previously it was a STRING column. This is incompatible with c new type. In that case, every file that had been in the table will be canceled by a remove (table replacement always truncates the original table), ensuring that no incompatible file actions survive log replay.

Unfortunately, kernel currently attempts to parse the entire add.stats column before deduplicating (in order to avoid tracking pruned files), and is thus exposed to parsing failures for rows that contain canceled add actions (e.g. add.stats.minValues.c = 'A' cannot parse as LONG).

This issue has a second aspect: Data skipping doesn't track or exclude partition columns directly. So we attempt data skipping over c, with the same risk of parsing failures, even if it's now a partition column. Fixing the general problem would make this harmless, but it's probably worth specifically tracking and excluding partition columns from the data skipping machinery so we don't waste time trying to parse (usually non-existent) stats and evaluating (provably useless) data skipping expressions for partition columns.

NOTE: Ideally, this issue should not arise if column mapping is enabled, because the physical names of the new columns should differ from the originals even if their logical names still seem to match.

To Reproduce

Invoke LogReplayScanner::process_scan_batch twice -- once with a batch containing an incompatible remove (to mark the file as "seen"), and again with a batch containing a matching incompatible add. It will fail with e.g.

Arrow(JsonError("whilst decoding field 'minValues': whilst decoding field 'c': failed to parse \"A\" as Int64"))

Expected behavior

The previously-seen remove should eliminate the add before it gets a chance to cause trouble.

Additional context

No response

@scovich scovich added the bug Something isn't working label Feb 21, 2025
@scovich
Copy link
Collaborator Author

scovich commented Feb 24, 2025

NOTE: The java kernel avoids this issue by deduplicating file actions before attempting to parse add.stats, and also because the json parser honors selection vectors and ignores unselected rows.

The rust kernel json parser also ignores null rows, but we don't currently (have a way to) update the null mask based on the deduplication kernel performed. We'll need to figure out how to do that. Additionally, we would want to split the deduplication into "check" and "update" passes, so that we can do:

  1. Sanitize the rows of a batch (eliminate non-file action rows, eliminate previously seen files, etc
  2. Parse stats and partition values of surviving rows, apply further pruning
  3. Update the "seen" set only for files that survived pruning

That way, we get the best of both worlds: pruning minimizes the cardinality of the "seen" set, but the "seen" set can still protect pruning attempts from incompatible schema changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant