[SPARK-51046][SQL][TEST] Reduce `numCols` in `withFilter()` to prevent `SubExprEliminationBenchmark` from failing due to a Codegen error #49938

wayneguow · 2025-02-13T16:20:04Z

What changes were proposed in this pull request?

This PR aims to reduce numCols in withFilter() to prevent SubExprEliminationBenchmark from failing due to a Codegen error.

I did some debug investigation and found that in the current master branch, the codegen code has a huge processNext method, but in branch-3.5, it is split into many small methods. Because the logic of codegen is different, the previous benchmark cannot run normally.

master branch:

protected void processNext() throws java.io.IOException {

	while ( inputadapter_input_0.hasNext()) {
                 ...
        }
}

3.5 branch:

        public boolean eval(InternalRow i) {
		boolean value_2997 = Or_498(i);
		return !globalIsNull_498 && value_2997;
	}
        private boolean Or_498(InternalRow i) {
                ...
        }
        ...
        private boolean Or_xxxx(InternalRow i) {
               ...
        }

Why are the changes needed?

If we run SubExprEliminationBenchmark:

build/sbt "sql/Test/runMain org.apache.spark.sql.execution.SubExprEliminationBenchmark"

It fails at CodeGenerator, details:

[info] Running benchmark: from_json as subExpr in Filter
[info] Running case: subExprElimination false, codegen: true
[info] 00:08:31.509 ERROR org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: Failed to compile the generated Java code.
[info] org.codehaus.commons.compiler.InternalCompilerException: Compiling "GeneratedClass" in File 'generated.java', Line 1, Column 1: File 'generated.java', Line 24, Column 16: Compiling "processNext()"

The root cause is:

[error] Caused by: org.codehaus.commons.compiler.InternalCompilerException: Code grows beyond 64 KB
[error] at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:699)
[error] at org.codehaus.janino.CodeContext.write(CodeContext.java:558)
[error] at org.codehaus.janino.UnitCompiler.write(UnitCompiler.java:13079)
[error] at org.codehaus.janino.UnitCompiler.store(UnitCompiler.java:12752)
[error] at org.codehaus.janino.UnitCompiler.store(UnitCompiler.java:12730)
[error] at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2742)
[error] ... 164 more

Does this PR introduce any user-facing change?

No, just fix a test benchmark failure.

How was this patch tested?

Run benchmark tests manually.

Was this patch authored or co-authored using generative AI tooling?

No.

wayneguow · 2025-02-13T16:20:49Z

Benchmark jdk17: https://github.com/wayneguow/spark/actions/runs/13310975547
Benchmark jdk21: https://github.com/wayneguow/spark/actions/runs/13310979208

wayneguow · 2025-02-13T16:23:49Z

sql/core/benchmarks/SubExprEliminationBenchmark-results.txt

-subExprElimination true, codegen: false            2053           2079          33          0.0    20526629.8       3.5X
+subExprElimination false, codegen: true            2474           2512          44          0.0    24744107.3       1.0X
+subExprElimination false, codegen: false           2231           2246          20          0.0    22306061.2       1.1X
+subExprElimination true, codegen: true             2408           2509         100          0.0    24084091.2       1.0X


What confuses me is that when subExprElimination set to true, codegen set to true, there is an obvious regression in performance compared to before?

Yes, it's true. The performance regression was the actual root cause why we didn't make a decision. Here, FYI.

[SPARK-50767][SQL] Remove codegen of from_json #49411

wayneguow · 2025-02-13T16:26:49Z

cc @dongjoon-hyun , Could you take a look when you have some time? maybe we can have some discussion.

dongjoon-hyun · 2025-02-13T16:39:15Z

cc @panbingkun , @cloud-fan , @LuciferYang

It seems that we need to make a decision. Are we good with this codeine perf regression of from_json?

dongjoon-hyun · 2025-02-13T16:43:13Z

BTW, Thank you for your active contributions, @wayneguow !

wayneguow · 2025-02-13T16:55:10Z

BTW, Thank you for your active contributions, @wayneguow !

Happy to contribute！😀 Keep going！

LuciferYang · 2025-02-14T03:35:55Z

cc @panbingkun , @cloud-fan , @LuciferYang

It seems that we need to make a decision. Are we good with this codeine perf regression of from_json?

@panbingkun Can #49573 be completed before the 4.0 release? If so, we can wait until the optimization is finished before refreshing this result. Additionally, if #49573 is completed, will we still need to change numCols from 500 to 330?

wayneguow added 2 commits February 13, 2025 23:02

reduce

9dde5ec

benchmark

0ef5c8b

github-actions bot added the SQL label Feb 13, 2025

wayneguow commented Feb 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-51046][SQL][TEST] Reduce `numCols` in `withFilter()` to prevent `SubExprEliminationBenchmark` from failing due to a Codegen error #49938

[SPARK-51046][SQL][TEST] Reduce `numCols` in `withFilter()` to prevent `SubExprEliminationBenchmark` from failing due to a Codegen error #49938

wayneguow commented Feb 13, 2025

wayneguow commented Feb 13, 2025

wayneguow Feb 13, 2025

dongjoon-hyun Feb 13, 2025

wayneguow commented Feb 13, 2025

dongjoon-hyun commented Feb 13, 2025 •

edited

Loading

dongjoon-hyun commented Feb 13, 2025

wayneguow commented Feb 13, 2025

LuciferYang commented Feb 14, 2025

[SPARK-51046][SQL][TEST] Reduce numCols in withFilter() to prevent SubExprEliminationBenchmark from failing due to a Codegen error #49938

Are you sure you want to change the base?

[SPARK-51046][SQL][TEST] Reduce numCols in withFilter() to prevent SubExprEliminationBenchmark from failing due to a Codegen error #49938

Conversation

wayneguow commented Feb 13, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

wayneguow commented Feb 13, 2025

wayneguow Feb 13, 2025

Choose a reason for hiding this comment

dongjoon-hyun Feb 13, 2025

Choose a reason for hiding this comment

wayneguow commented Feb 13, 2025

dongjoon-hyun commented Feb 13, 2025 • edited Loading

dongjoon-hyun commented Feb 13, 2025

wayneguow commented Feb 13, 2025

LuciferYang commented Feb 14, 2025

[SPARK-51046][SQL][TEST] Reduce `numCols` in `withFilter()` to prevent `SubExprEliminationBenchmark` from failing due to a Codegen error #49938

[SPARK-51046][SQL][TEST] Reduce `numCols` in `withFilter()` to prevent `SubExprEliminationBenchmark` from failing due to a Codegen error #49938

dongjoon-hyun commented Feb 13, 2025 •

edited

Loading