-
Notifications
You must be signed in to change notification settings - Fork 566
Description
Is your feature request related to a problem?
PR #4111 adds scan_files_where_matches() to identify files containing matching rows for DML rewrites (UPDATE now, MERGE/DELETE next). The second scan is restricted to matched files via a plan-time IN (...) predicate:
let file_list = valid_files
.iter()
.flat_map(|arr| arr.iter().flatten().map(|v| v.to_string()))
.map(|v| lit(wrap_partition_value_in_dict(ScalarValue::Utf8(Some(v)))))
.collect_vec();
let plan = LogicalPlanBuilder::scan("source", table_source, None)?
.filter(col(FILE_ID_COLUMN_DEFAULT).in_list(file_list, false))?
// ...This builds one Expr::Literal per matched file. For rewrites touching thousands of files, that means large expression trees, slow planning, and memory overhead. This is a file level restriction implemented as a DF expression over the synthetic file-id column, not a Parquet row predicate.
Describe the solution you'd like
Scan-planning allowlist:
Instead of constructing Expr::InList with one literal per matched file, plumb an allowlist of matched file ids into the scan planning path and filter the candidate files before building Parquet FileGroups / PartitionedFiles. This avoids plan-time bloat while preserving true file pruning (don't read Parquet for excluded files).
Optional safety net:
exec-time membership filtering in DeltaScanExec (e.g., per run via split_by_file_id_runs() from #4112) to guard against mixed batch edge cases. However, this happens after Parquet reads and doesn't provide I/O savings.
Something like:
- Add
file_id_allowlist: Option<Arc<HashSet<String>>>toDeltaScanConfig - Add
with_file_id_allowlist()builder method - Filter in scan planning when assembling file list / file groups (before Parquet read plan)
- Optionally also guard in
DeltaScanExecfor mixed-batch correctness
Describe alternatives you've considered
Alternatives considered:
- Semi-join against MemTable - avoids literal list, uses DF join machinery.
- IN (subquery) - depends on DF planner behavior
- Chunked IN (...) - still builds large expressions, just delays the cliff
Priority
Medium - Would be helpful
Additional context
Related:
- refactor: move files scan to separate function #4111
- fix(datafusion): handle coalesced multi-file batches in next-scan #4112
Contribution
- I'm willing to submit a pull request for this feature
- I can help with testing this feature
- I can help with documentation for this feature
Metadata
Metadata
Assignees
Labels
Type
Projects
Status