-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Add Parquet read pruning configuration for max elements in inList #19928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| } | ||
|
|
||
| #[derive(Debug, Clone)] | ||
| pub struct PruningPredicateConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went for a struct to ease up possible subsequent refactors if we add extra args; not sure if aligned with standards.
| let expected_expr = "true"; | ||
| let predicate_expr = | ||
| test_build_predicate_expression(&expr, &schema, &mut RequiredColumns::new()); | ||
| test_build_predicate_expression_with_pruning_predicate_config( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
row_group_predicate_in_list_to_many_values felt like an obvious utest to verify the configuration is applied, I may be missing some extra test coverage though; in particular, this change is missing integration tests that verify the end-to-end integration of the new Parquet configuration. Not sure if extra coverage is required, and if so where to implement it, happy to get guidance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put it in datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried f6624d7, happy to add more coverage if necessary. To some degree the test is testing the formatter logic rather than the internal logic because fmt re-does the predicate pruning logic:
datafusion/datafusion/datasource-parquet/src/source.rs
Lines 650 to 667 in 118dc6f
| if let (Some(pruning_predicate), _) = build_pruning_predicates( | |
| Some(predicate), | |
| self.table_schema.table_schema(), | |
| &predicate_creation_errors, | |
| &pruning_predicate_config, | |
| ) { | |
| let mut guarantees = pruning_predicate | |
| .literal_guarantees() | |
| .iter() | |
| .map(|item| format!("{item}")) | |
| .collect_vec(); | |
| guarantees.sort(); | |
| write!( | |
| f, | |
| ", pruning_predicate={}, required_guarantees=[{}]", | |
| pruning_predicate.predicate_expr(), | |
| guarantees.join(", ") | |
| )?; |
fe3ed24 to
29c84fb
Compare
triggers=trigger-run-backend-e2e-tests-against-postgres,trigger-backend-e2e-singlestore,trigger-backend-e2e-tests-data-proxy,trigger-backend-e2e-tests-data-proxy-sql-fallback,trigger-data-proxy-integration-tests
| /// Maximum number of elements (inclusive) in InList exprs to be eligible for pruning | ||
| pub pruning_max_inlist_limit: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could just make PruningPredicateConfig a field here instead of polluting with more fields. Also can this be pub(crate) instead of pub?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what you had in mind ? 0f6cdeb
| #[prost(oneof = "parquet_options::PruningMaxInlistLimitOpt", tags = "35")] | ||
| pub pruning_max_inlist_limit_opt: ::core::option::Option< | ||
| parquet_options::PruningMaxInlistLimitOpt, | ||
| >, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here could we make PruningPredicateConfig serializable and send that across the wire?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general in this PR I hesitated at which level we should starting wiring PruningPredicateConfig vs wiring the individual usize pruning_max_inlist_limit.
On the ParquetOptions Rust struct I tried to follow the existing standards of "flat types", so I mirrored the exact same structure on the protobuf objects.
If I understand correctly, you're suggesting we introduce an explicit protobuf type for PruningPredicateConfig in the ParquetOptions proto format, without changing the current rust-side ParquetOptions struct ?
Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>
|
@16pierre just in case it's helpful you can update an slt by running |
Yeah I discovered this for Thanks for the pointer (and cool tooling btw) |
This PR adds Parquet read configuration for the previous hard-coded constant
MAX_LIST_VALUE_SIZE_REWRITE, which applies an upper-bound to the number of elements inInListexprs when pruning is applied.Closes #8609