Skip to content

[Feature]: Matched file filtering via IN (...) doesn't scale for large file sets #4113

@ethan-tyler

Description

@ethan-tyler

Is your feature request related to a problem?

PR #4111 adds scan_files_where_matches() to identify files containing matching rows for DML rewrites (UPDATE now, MERGE/DELETE next). The second scan is restricted to matched files via a plan-time IN (...) predicate:

let file_list = valid_files
    .iter()
    .flat_map(|arr| arr.iter().flatten().map(|v| v.to_string()))
    .map(|v| lit(wrap_partition_value_in_dict(ScalarValue::Utf8(Some(v)))))
    .collect_vec();

let plan = LogicalPlanBuilder::scan("source", table_source, None)?
    .filter(col(FILE_ID_COLUMN_DEFAULT).in_list(file_list, false))?
    // ...

This builds one Expr::Literal per matched file. For rewrites touching thousands of files, that means large expression trees, slow planning, and memory overhead. This is a file level restriction implemented as a DF expression over the synthetic file-id column, not a Parquet row predicate.

Describe the solution you'd like

Scan-planning allowlist:

Instead of constructing Expr::InList with one literal per matched file, plumb an allowlist of matched file ids into the scan planning path and filter the candidate files before building Parquet FileGroups / PartitionedFiles. This avoids plan-time bloat while preserving true file pruning (don't read Parquet for excluded files).

Optional safety net:

exec-time membership filtering in DeltaScanExec (e.g., per run via split_by_file_id_runs() from #4112) to guard against mixed batch edge cases. However, this happens after Parquet reads and doesn't provide I/O savings.

Something like:

  1. Add file_id_allowlist: Option<Arc<HashSet<String>>> to DeltaScanConfig
  2. Add with_file_id_allowlist() builder method
  3. Filter in scan planning when assembling file list / file groups (before Parquet read plan)
  4. Optionally also guard in DeltaScanExec for mixed-batch correctness

Describe alternatives you've considered

Alternatives considered:

  • Semi-join against MemTable - avoids literal list, uses DF join machinery.
  • IN (subquery) - depends on DF planner behavior
  • Chunked IN (...) - still builds large expressions, just delays the cliff

Priority

Medium - Would be helpful

Additional context

Related:

Contribution

  • I'm willing to submit a pull request for this feature
  • I can help with testing this feature
  • I can help with documentation for this feature

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions