Skip to content

Conversation

@16pierre
Copy link

This PR adds Parquet read configuration for the previous hard-coded constant MAX_LIST_VALUE_SIZE_REWRITE, which applies an upper-bound to the number of elements in InList exprs when pruning is applied.

Closes #8609

@github-actions github-actions bot added common Related to common crate proto Related to proto crate datasource Changes to the datasource crate labels Jan 21, 2026
}

#[derive(Debug, Clone)]
pub struct PruningPredicateConfig {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went for a struct to ease up possible subsequent refactors if we add extra args; not sure if aligned with standards.

let expected_expr = "true";
let predicate_expr =
test_build_predicate_expression(&expr, &schema, &mut RequiredColumns::new());
test_build_predicate_expression_with_pruning_predicate_config(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

row_group_predicate_in_list_to_many_values felt like an obvious utest to verify the configuration is applied, I may be missing some extra test coverage though; in particular, this change is missing integration tests that verify the end-to-end integration of the new Parquet configuration. Not sure if extra coverage is required, and if so where to implement it, happy to get guidance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put it in datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried f6624d7, happy to add more coverage if necessary. To some degree the test is testing the formatter logic rather than the internal logic because fmt re-does the predicate pruning logic:

if let (Some(pruning_predicate), _) = build_pruning_predicates(
Some(predicate),
self.table_schema.table_schema(),
&predicate_creation_errors,
&pruning_predicate_config,
) {
let mut guarantees = pruning_predicate
.literal_guarantees()
.iter()
.map(|item| format!("{item}"))
.collect_vec();
guarantees.sort();
write!(
f,
", pruning_predicate={}, required_guarantees=[{}]",
pruning_predicate.predicate_expr(),
guarantees.join(", ")
)?;

@github-actions github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) labels Jan 21, 2026
triggers=trigger-run-backend-e2e-tests-against-postgres,trigger-backend-e2e-singlestore,trigger-backend-e2e-tests-data-proxy,trigger-backend-e2e-tests-data-proxy-sql-fallback,trigger-data-proxy-integration-tests
Comment on lines 109 to 110
/// Maximum number of elements (inclusive) in InList exprs to be eligible for pruning
pub pruning_max_inlist_limit: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could just make PruningPredicateConfig a field here instead of polluting with more fields. Also can this be pub(crate) instead of pub?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what you had in mind ? 0f6cdeb

Comment on lines +818 to +821
#[prost(oneof = "parquet_options::PruningMaxInlistLimitOpt", tags = "35")]
pub pruning_max_inlist_limit_opt: ::core::option::Option<
parquet_options::PruningMaxInlistLimitOpt,
>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here could we make PruningPredicateConfig serializable and send that across the wire?

Copy link
Author

@16pierre 16pierre Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general in this PR I hesitated at which level we should starting wiring PruningPredicateConfig vs wiring the individual usize pruning_max_inlist_limit.

On the ParquetOptions Rust struct I tried to follow the existing standards of "flat types", so I mirrored the exact same structure on the protobuf objects.

If I understand correctly, you're suggesting we introduce an explicit protobuf type for PruningPredicateConfig in the ParquetOptions proto format, without changing the current rust-side ParquetOptions struct ?

16pierre and others added 4 commits January 22, 2026 15:09
@adriangb
Copy link
Contributor

@16pierre just in case it's helpful you can update an slt by running cargo test --test sqllogictests -- parquet_filter_pushdown --complete and such

@16pierre
Copy link
Author

16pierre commented Jan 22, 2026

it's helpful you can update an slt by running cargo test --test sqllogictests -- parquet_filter_pushdown --complete and such

Yeah I discovered this for parquet_filter_pushdown but unfortunately for information_schema.slt I got local errors (which I manually patched)

Thanks for the pointer (and cool tooling btw)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate datasource Changes to the datasource crate documentation Improvements or additions to documentation proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Config the length of list when using In_list on parquet, rather than a const of 20.

2 participants