[SPARK-53573][SQL] Use Pre-processor for generalized parameter marker handling #52334

srielau · 2025-09-13T20:42:50Z

What changes were proposed in this pull request?

This PR introduces a comprehensive parameter substitution system that significantly expands parameter marker support and implements major performance optimizations. The current implementation of parameter markers (:param and ?) is restricted to expressions and queries. This PR enables parameter markers to work universally across all SQL constructs.

Key Changes

Enhanced Parameter Substitution Architecture:

Added ParameterHandler class with centralized, optimized parameter substitution logic
Integrated parameter substitution directly into SparkSqlParser with parser-aware error context
Implemented ParameterSubstitutionContext with thread-local substitution information
Added sophisticated position mapping for accurate error reporting in original SQL text

EXECUTE IMMEDIATE Integration & Error Context:

Enhanced ResolveExecuteImmediate with comprehensive parameter validation and local variable isolation
Added HybridParameterContext supporting both named and positional parameters with strict validation
Implemented parser-aware error context that correctly maps errors back to original SQL before parameter substitution
Added "firewall" mechanism preventing EXECUTE IMMEDIATE from accessing outer SQL scripting variables

Legacy Mode Support:

Maintained full backward compatibility with spark.sql.legacy.parameterSubstitution.constantsOnly
In legacy mode: parameter substitution disabled, analyzer-based binding preserved
In new mode: full text-based parameter substitution with enhanced validation

Why are the changes needed?

Functional Limitations:

Limited scope: Parameter markers only worked in expressions/queries, not DDL/utility statements
Analyzer dependency: Literals bypass analyzer, making parameter expansion impossible

The enhanced architecture solves these by:

Universal parameter support through pre-parsing substitution
Parser-aware error context maintaining accurate error positions
High-performance algorithms optimized for large SQL texts
Comprehensive validation preventing common parameter usage errors

Does this PR introduce any user-facing changes?

Yes - Significantly Expanded Functionality:

New Parameter Support:

DDL statements: CREATE VIEW v(c1) AS (SELECT :p)
Utility statements: SHOW TABLES FROM schema LIKE :pattern
Type definitions: DECIMAL(?, ?)
Table properties: TBLPROPERTIES (':key' = ':value')
All SQL constructs where literals are valid

Full Backward Compatibility:

All existing parameter usage continues working unchanged
Legacy mode behavior completely preserved
No breaking changes to existing APIs

How was this patch tested?

Comprehensive Unit Testing:

Enhanced ParametersSuite with 50+ test cases covering all parameter scenarios
Added ParameterSubstitutionSuite for algorithm-specific testing
SqlScriptingExecutionSuite tests for EXECUTE IMMEDIATE isolation
Position mapping accuracy tests ensuring correct error reporting

Integration Testing:

Cross-validated legacy vs modern mode behavior consistency
EXECUTE IMMEDIATE parameter validation across all scenarios
Error context accuracy for nested parameter substitution
Thread safety validation for concurrent parameter usage

Manual Verification:

DDL parameter substitution: CREATE TABLE :name (id INT)
Complex parameter scenarios: EXECUTE IMMEDIATE 'SELECT typeof(:p), :p' USING 5::INT AS p
Error position accuracy in multi-line SQL with parameters
Performance benchmarking on large SQL scripts

Example Enhanced Capabilities:

// Universal parameter support
spark.sql("CREATE TABLE :table_name (id :type)", 
  Map("table_name" -> "users", "type" -> "BIGINT"))

// Enhanced validation with accurate error context
spark.sql("EXECUTE IMMEDIATE 'SELECT :p' USING 1, 2 AS p") 
// Fails with ALL_PARAMETERS_MUST_BE_NAMED, showing exact position

// Performance: handles large SQL with many parameters efficiently
spark.sql(largeSqlWithManyParameters, parameterMap) // Now O(k) instead of O(n²)

Quality Assurance:

All scalastyle checks pass
No compilation warnings or deprecation issues
Thread safety verified through concurrent testing
Memory usage profiling confirms optimization effectiveness

sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4

sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala

sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/SubstituteParmsAstBuilder.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParameterContext.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParameterHandler.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/LiteralToSqlConverter.scala

sql/core/src/main/scala/org/apache/spark/sql/classic/SparkSession.scala

This commit implements a comprehensive parameter substitution system for Spark SQL that provides detailed error context for EXECUTE IMMEDIATE statements, similar to how views handle errors. Key features: - Pre-parser approach with position mapping for accurate error reporting - Thread-local parameter context for parser-aware error handling - Support for both named and positional parameters across all SQL APIs - Optimized position mapping algorithm (O(n²) → O(k) where k = substitutions) - Comprehensive test coverage including edge cases and error scenarios - Backward compatibility with legacy parameter substitution mode The implementation includes: - ParameterHandler for unified parameter processing - PositionMapper for efficient error position translation - LiteralToSqlConverter for type-safe SQL generation - Integration with SparkSqlParser, SparkSession, and EXECUTE IMMEDIATE - Enhanced error messages showing both outer and inner query context This addresses the user request for detailed error context in EXECUTE IMMEDIATE statements while maintaining performance and compatibility.

dtenedor

Did another full review -- this approach is looking a lot simpler without the callbacks :)

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/LiteralToSqlConverter.scala

sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkParserUtils.scala

sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/PositionMapper.scala

sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/SubstituteParmsAstBuilder.scala

dtenedor

Thanks for all the work on this, it will make our SQL language more usable.

sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/origin.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParameterHandler.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

cloud-fan · 2025-10-13T05:18:49Z

sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4

-    : STRING_LITERAL
-    | {!double_quoted_identifiers}? DOUBLEQUOTED_STRING
+    : stringLitWithoutMarker                                                                   #stringLiteralInContext
+    | parameterMarker                                                                          #parameterStringValue


Correct me if I'm wrong: the pre-parser only allow parameter markers in integerValue and stringLit, and we still need the previous framework to bind parameters at analysis time for arbitrary expressions.

You are wrong :-) These grammar changes provide coverage for all (hopefully) places where literals are allowed.
For example: Wherever we allow an DOUBLE or DECIMAL we also allow INTEGER and thus have coverage for parameter markers. Wherever we allow a DATE/TIMESTAMP we also allow a STRING and thus allow parameter markers. And of course expressions include literals, so any generic expression allows parameter markers already in the grammar.

Note that I have specifically added testcases for "interesting" places. We can add more if you feel there is a gap.

sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/PositionMapper.scala

cloud-fan · 2025-10-13T08:51:16Z

sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/SubstituteParmsAstBuilder.scala

+   * Collect information about a positional parameter in a literal context. Note: The return value
+   * is not used; this method operates via side effects.
+   */
+  override def visitPosParameterLiteral(ctx: PosParameterLiteralContext): AnyRef =


A high level question: for parameter markers that already work today (in the expression parser rule), do we need to handle it in the pre-parser? Or we leave it and still let it be bound during analysis time?

It is easiest to let them all be handled in the preparser. The original code remains for now for safetyy. But I see no reason to keep it. If at soem point we want to move into plan sharing thsi code gets more interesting again. But then we we would introduce templates/markers that weren't even given by the user.

sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/SubstituteParmsAstBuilder.scala

sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/PositionMapper.scala

sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/parsers.scala

cloud-fan · 2025-10-13T09:49:43Z

sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/origin.scala

    stackTrace: Option[Array[StackTraceElement]] = None,
-    pysparkErrorContext: Option[(String, String)] = None) {
+    pysparkErrorContext: Option[(String, String)] = None,
+    parameterSubstitutionInfo: Option[ParameterSubstitutionInfo] = None) {


why do we need this new field? In the pre-parser, we can update the sqlText and other position/index fields and produce a new Origin.

I suggested for Serge to add this new field to avoid creating a new ThreadLocal. If we can produce the new Origin and store it in some existing place like e.g. CurrentOrigin instead of this new field, that sounds OK too -- up to you guys.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AbstractSqlParser.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParameterHandler.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/SubstituteParamsParser.scala

cloud-fan · 2025-10-13T11:39:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/LiteralToSqlConverter.scala

+ * @note Only supports Literal expressions - all parameter values must be pre-evaluated.
+ * @see [[ParameterHandler]] for the main parameter handling entry point
+ */
+object LiteralToSqlConverter {


I'm a bit worried about puting complex SQL string into the original SQL statement. The previous way of parameter binding happens at the analysis phase and changes the plan tree directly, which is 100% safe from SQL injection. The pre-parser accepts parameter markers in two more places: integer literal and string literal. I think it's better to only handle this two new places by the pre-parser, and still use the previous analysis-time parameter binding for others. It's safe to do SQL string manipulation only for integer and string literals, but arbitrary SQL generated by complex literals looks risky to me.

cc @dtenedor as well

I do not undersstand teh issue. A literal is a literal. e.g. is TIMESTAMP'2025-12-01 00:00:00' a complex literal?
Where do we draw teh line and why?
Can you giv ean example opf SQL injection through a literal?
Or are you worried that, what is passed here is not actually a literal?

@srielau @cloud-fan Are we re-parsing the injected literal string or just using it for error message generation?

cloud-fan · 2025-10-13T11:42:00Z

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

  import sessionHolder.session.toRichColumn

  private[connect] def parser = session.sessionState.sqlParser
+  private val parameterHandler = new ParameterHandler()


shall we create a new instance for each SQL?

The handler is stateless. What would we gain from creating new instances?

cloud-fan · 2025-10-13T11:45:54Z

sql/core/src/main/scala/org/apache/spark/sql/classic/SparkSession.scala

+        val parsedPlan = if (args.nonEmpty) {
+          // Use parameter context directly for parsing
+          val paramContext = PositionalParameterContext(args.map(lit(_).expr).toSeq)
+          val parsed = sessionState.sqlParser.parsePlanWithParameters(sqlText, paramContext)


when shall we call parsePlanWithParameters and when shall we create ParameterHandler?

SparkSession needs pasrePlanWithParameters
SparkConnect has a more complex protocol and nneeds teh ParameterHandler separate.
It also allows for modular unit testing.

github-actions bot added SQL CONNECT labels Sep 13, 2025

srielau changed the title ~~[WIP][SQL] Use Pre-processor for generalized parameter marker handling~~ [SPARK-53573][WIP][SQL] Use Pre-processor for generalized parameter marker handling Sep 13, 2025

srielau marked this pull request as draft September 13, 2025 21:51

srielau mentioned this pull request Sep 16, 2025

[WIP] Preparser #52085

Closed

cloud-fan reviewed Sep 24, 2025

View reviewed changes

sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4 Show resolved Hide resolved

cloud-fan reviewed Sep 24, 2025

View reviewed changes

sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala Show resolved Hide resolved

cloud-fan reviewed Sep 24, 2025

View reviewed changes

sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 24, 2025

View reviewed changes

sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/SubstituteParmsAstBuilder.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 24, 2025

View reviewed changes

sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/SubstituteParmsAstBuilder.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 24, 2025

View reviewed changes

sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/SubstituteParmsAstBuilder.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 24, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala Show resolved Hide resolved

cloud-fan reviewed Sep 24, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParameterContext.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 24, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParameterHandler.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 24, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParameterHandler.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 24, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParameterHandler.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 24, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/LiteralToSqlConverter.scala Show resolved Hide resolved

cloud-fan reviewed Sep 24, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/LiteralToSqlConverter.scala Show resolved Hide resolved

cloud-fan reviewed Sep 24, 2025

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/classic/SparkSession.scala Outdated Show resolved Hide resolved

srielau marked this pull request as ready for review September 30, 2025 15:04

srielau changed the title ~~[SPARK-53573][WIP][SQL] Use Pre-processor for generalized parameter marker handling~~ [SPARK-53573][SQL] Use Pre-processor for generalized parameter marker handling Sep 30, 2025

github-actions bot added ML STRUCTURED STREAMING BUILD DOCS CORE INFRA PYTHON PANDAS API ON SPARK labels Sep 30, 2025

dtenedor reviewed Oct 10, 2025

View reviewed changes

Address comments by Daniel

312d540

srielau force-pushed the preparser-squashed branch from 44ff2f0 to d46165d Compare October 11, 2025 00:11

dtenedor approved these changes Oct 11, 2025

View reviewed changes

Fix query plan suite

289f6c0

srielau force-pushed the preparser-squashed branch from d46165d to 289f6c0 Compare October 11, 2025 06:10

More review xomments

2f00926