-
Notifications
You must be signed in to change notification settings - Fork 1.5k
POC: Test DataFusion with experimental Parquet Filter Pushdown (try 2) #16562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
35d9940
to
b44ba97
Compare
🤖 |
🤖: Benchmark completed Details
|
🤔 the results are not that far away. Queries to review
|
Analysis of Q30I took a quick look at q30 (which is now so much easier after @pepijnve broke the queries into their own files) SELECT "SearchEngineID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "ClientIP" ORDER BY c DESC LIMIT 10; Here is the comparison branch
Here is this branch (definitely slower) andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ ./datafusion-cli-alamb_new_parquet_result_caching -f q30-times-10.sql | grep Elapsed
Elapsed 0.501 seconds.
Elapsed 0.488 seconds.
Elapsed 0.510 seconds.
Elapsed 0.527 seconds.
Elapsed 0.477 seconds.
Elapsed 0.480 seconds.
Elapsed 0.476 seconds.
Elapsed 0.484 seconds.
Elapsed 0.492 seconds.
Elapsed 0.489 seconds.
Elapsed 0.491 seconds. Here is the plan > explain SELECT "SearchEngineID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "ClientIP" ORDER BY c DESC LIMIT 10;
+---------------+-------------------------------+
| plan_type | plan |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
| | │ SortPreservingMergeExec │ |
| | │ -------------------- │ |
| | │ c DESClimit: 10 │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ SortExec(TopK) │ |
| | │ -------------------- │ |
| | │ c@2 DESC │ |
| | │ │ |
| | │ limit: 10 │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ ProjectionExec │ |
| | │ -------------------- │ |
| | │ ClientIP: ClientIP │ |
| | │ │ |
| | │ SearchEngineID: │ |
| | │ SearchEngineID │ |
| | │ │ |
| | │ avg(hits.ResolutionWidth):│ |
| | │ avg(hits.ResolutionWidth) │ |
| | │ │ |
| | │ c: count(Int64(1)) │ |
| | │ │ |
| | │ sum(hits.IsRefresh): │ |
| | │ sum(hits.IsRefresh) │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ AggregateExec │ |
| | │ -------------------- │ |
| | │ aggr: │ |
| | │ count(1), sum(hits │ |
| | │ .IsRefresh), avg │ |
| | │ (hits.ResolutionWidth) │ |
| | │ │ |
| | │ group_by: │ |
| | │ SearchEngineID, ClientIP │ |
| | │ │ |
| | │ mode: │ |
| | │ FinalPartitioned │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ CoalesceBatchesExec │ |
| | │ -------------------- │ |
| | │ target_batch_size: │ |
| | │ 8192 │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ RepartitionExec │ |
| | │ -------------------- │ |
| | │ partition_count(in->out): │ |
| | │ 16 -> 16 │ |
| | │ │ |
| | │ partitioning_scheme: │ |
| | │ Hash([SearchEngineID@0, │ |
| | │ ClientIP@1], 16) │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ AggregateExec │ |
| | │ -------------------- │ |
| | │ aggr: │ |
| | │ count(1), sum(hits │ |
| | │ .IsRefresh), avg │ |
| | │ (hits.ResolutionWidth) │ |
| | │ │ |
| | │ group_by: │ |
| | │ SearchEngineID, ClientIP │ |
| | │ │ |
| | │ mode: Partial │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ DataSourceExec │ |
| | │ -------------------- │ |
| | │ files: 115 │ |
| | │ format: parquet │ |
| | │ │ |
| | │ predicate: │ |
| | │ SearchPhrase != │ |
| | └───────────────────────────┘ |
| | |
+---------------+-------------------------------+ I poked around using Instruments and I think the breakdown is:
Of the 90% left,
Interestingly in this case So my conclusion here is that the overhead of skip/scanning in the decoder takes longer than decoding the entire column and then applying a filter. My next plan will be
|
To start with this analysis, I have run clickbench 30 and saved the results of evaluating filters (boolean arrays). There are 325 parquet files corresponding to the 325 row groups in the 100 clickbench files |
Which issue does this PR close?
filter_pushdown
) by default #3463Rationale for this change
I am doing end to end testing of new parquet pushdown techniques
My plan is to use this PR and analysis to guide additional work needed to get filter pushdown on by default
What changes are included in this PR?
Test Plan
filter_pushdown
andreorder_filters
to trueProfiling Anaylsis
(in progress)