Skip to content

Conversation

srielau
Copy link
Contributor

@srielau srielau commented Sep 13, 2025

What changes were proposed in this pull request?

This PR introduces a comprehensive parameter substitution system that significantly expands parameter marker support and implements major performance optimizations. The current implementation of parameter markers (:param and ?) is restricted to expressions and queries. This PR enables parameter markers to work universally across all SQL constructs.

Key Changes

Enhanced Parameter Substitution Architecture:

  • Added ParameterHandler class with centralized, optimized parameter substitution logic
  • Integrated parameter substitution directly into SparkSqlParser with parser-aware error context
  • Implemented ParameterSubstitutionContext with thread-local substitution information
  • Added sophisticated position mapping for accurate error reporting in original SQL text

EXECUTE IMMEDIATE Integration & Error Context:

  • Enhanced ResolveExecuteImmediate with comprehensive parameter validation and local variable isolation
  • Added HybridParameterContext supporting both named and positional parameters with strict validation
  • Implemented parser-aware error context that correctly maps errors back to original SQL before parameter substitution
  • Added "firewall" mechanism preventing EXECUTE IMMEDIATE from accessing outer SQL scripting variables

Legacy Mode Support:

  • Maintained full backward compatibility with spark.sql.legacy.parameterSubstitution.constantsOnly
  • In legacy mode: parameter substitution disabled, analyzer-based binding preserved
  • In new mode: full text-based parameter substitution with enhanced validation

Why are the changes needed?

Functional Limitations:

  1. Limited scope: Parameter markers only worked in expressions/queries, not DDL/utility statements
  2. Analyzer dependency: Literals bypass analyzer, making parameter expansion impossible

The enhanced architecture solves these by:

  • Universal parameter support through pre-parsing substitution
  • Parser-aware error context maintaining accurate error positions
  • High-performance algorithms optimized for large SQL texts
  • Comprehensive validation preventing common parameter usage errors

Does this PR introduce any user-facing changes?

Yes - Significantly Expanded Functionality:

New Parameter Support:

  • DDL statements: CREATE VIEW v(c1) AS (SELECT :p)
  • Utility statements: SHOW TABLES FROM schema LIKE :pattern
  • Type definitions: DECIMAL(?, ?)
  • Table properties: TBLPROPERTIES (':key' = ':value')
  • All SQL constructs where literals are valid

Full Backward Compatibility:

  • All existing parameter usage continues working unchanged
  • Legacy mode behavior completely preserved
  • No breaking changes to existing APIs

How was this patch tested?

Comprehensive Unit Testing:

  • Enhanced ParametersSuite with 50+ test cases covering all parameter scenarios
  • Added ParameterSubstitutionSuite for algorithm-specific testing
  • SqlScriptingExecutionSuite tests for EXECUTE IMMEDIATE isolation
  • Position mapping accuracy tests ensuring correct error reporting

Integration Testing:

  • Cross-validated legacy vs modern mode behavior consistency
  • EXECUTE IMMEDIATE parameter validation across all scenarios
  • Error context accuracy for nested parameter substitution
  • Thread safety validation for concurrent parameter usage

Manual Verification:

  • DDL parameter substitution: CREATE TABLE :name (id INT)
  • Complex parameter scenarios: EXECUTE IMMEDIATE 'SELECT typeof(:p), :p' USING 5::INT AS p
  • Error position accuracy in multi-line SQL with parameters
  • Performance benchmarking on large SQL scripts

Example Enhanced Capabilities:

// Universal parameter support
spark.sql("CREATE TABLE :table_name (id :type)", 
  Map("table_name" -> "users", "type" -> "BIGINT"))

// Enhanced validation with accurate error context
spark.sql("EXECUTE IMMEDIATE 'SELECT :p' USING 1, 2 AS p") 
// Fails with ALL_PARAMETERS_MUST_BE_NAMED, showing exact position

// Performance: handles large SQL with many parameters efficiently
spark.sql(largeSqlWithManyParameters, parameterMap) // Now O(k) instead of O(n²)

Quality Assurance:

  • All scalastyle checks pass
  • No compilation warnings or deprecation issues
  • Thread safety verified through concurrent testing
  • Memory usage profiling confirms optimization effectiveness

@srielau srielau changed the title [WIP][SQL] Use Pre-processor for generalized parameter marker handling [SPARK-53573][WIP][SQL] Use Pre-processor for generalized parameter marker handling Sep 13, 2025
@srielau srielau marked this pull request as draft September 13, 2025 21:51
@srielau srielau mentioned this pull request Sep 16, 2025
@srielau srielau marked this pull request as ready for review September 30, 2025 15:04
@srielau srielau changed the title [SPARK-53573][WIP][SQL] Use Pre-processor for generalized parameter marker handling [SPARK-53573][SQL] Use Pre-processor for generalized parameter marker handling Sep 30, 2025
This commit implements a comprehensive parameter substitution system for Spark SQL
that provides detailed error context for EXECUTE IMMEDIATE statements, similar to
how views handle errors.

Key features:
- Pre-parser approach with position mapping for accurate error reporting
- Thread-local parameter context for parser-aware error handling
- Support for both named and positional parameters across all SQL APIs
- Optimized position mapping algorithm (O(n²) → O(k) where k = substitutions)
- Comprehensive test coverage including edge cases and error scenarios
- Backward compatibility with legacy parameter substitution mode

The implementation includes:
- ParameterHandler for unified parameter processing
- PositionMapper for efficient error position translation
- LiteralToSqlConverter for type-safe SQL generation
- Integration with SparkSqlParser, SparkSession, and EXECUTE IMMEDIATE
- Enhanced error messages showing both outer and inner query context

This addresses the user request for detailed error context in EXECUTE IMMEDIATE
statements while maintaining performance and compatibility.
Copy link
Contributor

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did another full review -- this approach is looking a lot simpler without the callbacks :)

@srielau srielau force-pushed the preparser-squashed branch from 44ff2f0 to d46165d Compare October 11, 2025 00:11
Copy link
Contributor

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work on this, it will make our SQL language more usable.

@srielau srielau force-pushed the preparser-squashed branch from d46165d to 289f6c0 Compare October 11, 2025 06:10
: STRING_LITERAL
| {!double_quoted_identifiers}? DOUBLEQUOTED_STRING
: stringLitWithoutMarker #stringLiteralInContext
| parameterMarker #parameterStringValue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong: the pre-parser only allow parameter markers in integerValue and stringLit, and we still need the previous framework to bind parameters at analysis time for arbitrary expressions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are wrong :-) These grammar changes provide coverage for all (hopefully) places where literals are allowed.
For example: Wherever we allow an DOUBLE or DECIMAL we also allow INTEGER and thus have coverage for parameter markers. Wherever we allow a DATE/TIMESTAMP we also allow a STRING and thus allow parameter markers. And of course expressions include literals, so any generic expression allows parameter markers already in the grammar.

Note that I have specifically added testcases for "interesting" places. We can add more if you feel there is a gap.

* Collect information about a positional parameter in a literal context. Note: The return value
* is not used; this method operates via side effects.
*/
override def visitPosParameterLiteral(ctx: PosParameterLiteralContext): AnyRef =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A high level question: for parameter markers that already work today (in the expression parser rule), do we need to handle it in the pre-parser? Or we leave it and still let it be bound during analysis time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is easiest to let them all be handled in the preparser. The original code remains for now for safetyy. But I see no reason to keep it. If at soem point we want to move into plan sharing thsi code gets more interesting again. But then we we would introduce templates/markers that weren't even given by the user.

stackTrace: Option[Array[StackTraceElement]] = None,
pysparkErrorContext: Option[(String, String)] = None) {
pysparkErrorContext: Option[(String, String)] = None,
parameterSubstitutionInfo: Option[ParameterSubstitutionInfo] = None) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this new field? In the pre-parser, we can update the sqlText and other position/index fields and produce a new Origin.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested for Serge to add this new field to avoid creating a new ThreadLocal. If we can produce the new Origin and store it in some existing place like e.g. CurrentOrigin instead of this new field, that sounds OK too -- up to you guys.

* @note Only supports Literal expressions - all parameter values must be pre-evaluated.
* @see [[ParameterHandler]] for the main parameter handling entry point
*/
object LiteralToSqlConverter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried about puting complex SQL string into the original SQL statement. The previous way of parameter binding happens at the analysis phase and changes the plan tree directly, which is 100% safe from SQL injection. The pre-parser accepts parameter markers in two more places: integer literal and string literal. I think it's better to only handle this two new places by the pre-parser, and still use the previous analysis-time parameter binding for others. It's safe to do SQL string manipulation only for integer and string literals, but arbitrary SQL generated by complex literals looks risky to me.

cc @dtenedor as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not undersstand teh issue. A literal is a literal. e.g. is TIMESTAMP'2025-12-01 00:00:00' a complex literal?
Where do we draw teh line and why?
Can you giv ean example opf SQL injection through a literal?
Or are you worried that, what is passed here is not actually a literal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srielau @cloud-fan Are we re-parsing the injected literal string or just using it for error message generation?

import sessionHolder.session.toRichColumn

private[connect] def parser = session.sessionState.sqlParser
private val parameterHandler = new ParameterHandler()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we create a new instance for each SQL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The handler is stateless. What would we gain from creating new instances?

val parsedPlan = if (args.nonEmpty) {
// Use parameter context directly for parsing
val paramContext = PositionalParameterContext(args.map(lit(_).expr).toSeq)
val parsed = sessionState.sqlParser.parsePlanWithParameters(sqlText, paramContext)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when shall we call parsePlanWithParameters and when shall we create ParameterHandler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SparkSession needs pasrePlanWithParameters
SparkConnect has a more complex protocol and nneeds teh ParameterHandler separate.
It also allows for modular unit testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants