Incompatible schema changes break file skipping #712

scovich · 2025-02-21T22:36:37Z

Describe the bug

Suppose a query includes a skipping-eligible predicate over LONG column c.

Then we expect the add.stats column to include min/max stats that can parse as LONG.

However, it is possible that a recent table replace operation changed the schema of c -- previously it was a STRING column. This is incompatible with c new type. In that case, every file that had been in the table will be canceled by a remove (table replacement always truncates the original table), ensuring that no incompatible file actions survive log replay.

Unfortunately, kernel currently attempts to parse the entire add.stats column before deduplicating (in order to avoid tracking pruned files), and is thus exposed to parsing failures for rows that contain canceled add actions (e.g. add.stats.minValues.c = 'A' cannot parse as LONG).

This issue has a second aspect: Data skipping doesn't track or exclude partition columns directly. So we attempt data skipping over c, with the same risk of parsing failures, even if it's now a partition column. Fixing the general problem would make this harmless, but it's probably worth specifically tracking and excluding partition columns from the data skipping machinery so we don't waste time trying to parse (usually non-existent) stats and evaluating (provably useless) data skipping expressions for partition columns.

NOTE: Ideally, this issue should not arise if column mapping is enabled, because the physical names of the new columns should differ from the originals even if their logical names still seem to match.

To Reproduce

Invoke LogReplayScanner::process_scan_batch twice -- once with a batch containing an incompatible remove (to mark the file as "seen"), and again with a batch containing a matching incompatible add. It will fail with e.g.

Arrow(JsonError("whilst decoding field 'minValues': whilst decoding field 'c': failed to parse \"A\" as Int64"))

Expected behavior

The previously-seen remove should eliminate the add before it gets a chance to cause trouble.

Additional context

No response

The text was updated successfully, but these errors were encountered:

scovich · 2025-02-24T18:43:48Z

NOTE: The java kernel avoids this issue by deduplicating file actions before attempting to parse add.stats, and also because the json parser honors selection vectors and ignores unselected rows.

The rust kernel json parser also ignores null rows, but we don't currently (have a way to) update the null mask based on the deduplication kernel performed. We'll need to figure out how to do that. Additionally, we would want to split the deduplication into "check" and "update" passes, so that we can do:

Sanitize the rows of a batch (eliminate non-file action rows, eliminate previously seen files, etc
Parse stats and partition values of surviving rows, apply further pruning
Update the "seen" set only for files that survived pruning

That way, we get the best of both worlds: pruning minimizes the cardinality of the "seen" set, but the "seen" set can still protect pruning attempts from incompatible schema changes.

scovich added the bug Something isn't working label Feb 21, 2025

scovich mentioned this issue Feb 21, 2025

feat: Add basic partition pruning support #713

Open

scovich mentioned this issue Mar 3, 2025

Option for JSON parser to return NULL field values on type mismatch apache/arrow-rs#7230

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incompatible schema changes break file skipping #712

Incompatible schema changes break file skipping #712

scovich commented Feb 21, 2025 •

edited

Loading

scovich commented Feb 24, 2025 •

edited

Loading

Incompatible schema changes break file skipping #712

Incompatible schema changes break file skipping #712

Comments

scovich commented Feb 21, 2025 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

scovich commented Feb 24, 2025 • edited Loading

scovich commented Feb 21, 2025 •

edited

Loading

scovich commented Feb 24, 2025 •

edited

Loading