Skip to content

Conversation

r7raul1984
Copy link

What changes were proposed in this pull request?

By pushing down the filter condition in the count_if function, we can reduce the volume of data that needs to be processed.

spark.sql("create table t1(a int, b int, c int) using parquet")
spark.sql("select count_if(a <>1) from t1").explain("cost") 

Current:

== Optimized Logical Plan ==
Aggregate [count(if (NOT _common_expr_0#6) null else _common_expr_0#6) AS count_if((NOT (a = 1)))#4L], Statistics(sizeInBytes=16.0 B, rowCount=1)
+- Project [NOT (a#0 = 1) AS _common_expr_0#6], Statistics(sizeInBytes=1.0 B)
   +- Relation spark_catalog.default.t1[a#0,b#1,c#2] parquet, Statistics(sizeInBytes=0.0 B) 

After this pr:

== Optimized Logical Plan ==
Aggregate [count(if (NOT _common_expr_2#22) null else _common_expr_2#22) AS count_if((NOT (a = 1)))#21L], Statistics(sizeInBytes=16.0 B, rowCount=1)
+- Project [NOT (a#3 = 1) AS _common_expr_2#22], Statistics(sizeInBytes=1.0 B)
   +- Filter (isnotnull(a#3) AND NOT (a#3 = 1)), Statistics(sizeInBytes=1.0 B)
      +- Relation spark_catalog.default.t1[a#3,b#4,c#5] parquet, Statistics(sizeInBytes=0.0 B) 

Why are the changes needed?

Push down the conditional filter expressions in COUNT_IF to improve query performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Oct 9, 2025
*
* Rewritten Query:
* {{{
* SELECT count_if(c1 > 1) FROM ut1 where c1 > 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not rewrite to SELECT count(_) FROM ut1 where c1 > 1 ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there is only one count_if, it can indeed be rewritten as SELECT count(*) FROM ut1 WHERE c1 > 1. However, considering the consistency of the code logic, I personally feel that it's unnecessary to check the number of count_if statements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants