-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-53573][SQL] Use Pre-processor for generalized parameter marker handling #52334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
Show resolved
Hide resolved
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala
Show resolved
Hide resolved
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala
Outdated
Show resolved
Hide resolved
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/SubstituteParmsAstBuilder.scala
Outdated
Show resolved
Hide resolved
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/SubstituteParmsAstBuilder.scala
Outdated
Show resolved
Hide resolved
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/SubstituteParmsAstBuilder.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParameterContext.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParameterHandler.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParameterHandler.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParameterHandler.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/LiteralToSqlConverter.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/LiteralToSqlConverter.scala
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/classic/SparkSession.scala
Outdated
Show resolved
Hide resolved
This commit implements a comprehensive parameter substitution system for Spark SQL that provides detailed error context for EXECUTE IMMEDIATE statements, similar to how views handle errors. Key features: - Pre-parser approach with position mapping for accurate error reporting - Thread-local parameter context for parser-aware error handling - Support for both named and positional parameters across all SQL APIs - Optimized position mapping algorithm (O(n²) → O(k) where k = substitutions) - Comprehensive test coverage including edge cases and error scenarios - Backward compatibility with legacy parameter substitution mode The implementation includes: - ParameterHandler for unified parameter processing - PositionMapper for efficient error position translation - LiteralToSqlConverter for type-safe SQL generation - Integration with SparkSqlParser, SparkSession, and EXECUTE IMMEDIATE - Enhanced error messages showing both outer and inner query context This addresses the user request for detailed error context in EXECUTE IMMEDIATE statements while maintaining performance and compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did another full review -- this approach is looking a lot simpler without the callbacks :)
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/LiteralToSqlConverter.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/LiteralToSqlConverter.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/LiteralToSqlConverter.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/LiteralToSqlConverter.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/LiteralToSqlConverter.scala
Outdated
Show resolved
Hide resolved
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkParserUtils.scala
Outdated
Show resolved
Hide resolved
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/PositionMapper.scala
Outdated
Show resolved
Hide resolved
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/PositionMapper.scala
Show resolved
Hide resolved
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/SubstituteParmsAstBuilder.scala
Outdated
Show resolved
Hide resolved
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/SubstituteParmsAstBuilder.scala
Outdated
Show resolved
Hide resolved
44ff2f0
to
d46165d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work on this, it will make our SQL language more usable.
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/origin.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParameterHandler.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala
Outdated
Show resolved
Hide resolved
d46165d
to
289f6c0
Compare
: STRING_LITERAL | ||
| {!double_quoted_identifiers}? DOUBLEQUOTED_STRING | ||
: stringLitWithoutMarker #stringLiteralInContext | ||
| parameterMarker #parameterStringValue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong: the pre-parser only allow parameter markers in integerValue
and stringLit
, and we still need the previous framework to bind parameters at analysis time for arbitrary expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are wrong :-) These grammar changes provide coverage for all (hopefully) places where literals are allowed.
For example: Wherever we allow an DOUBLE or DECIMAL we also allow INTEGER and thus have coverage for parameter markers. Wherever we allow a DATE/TIMESTAMP we also allow a STRING and thus allow parameter markers. And of course expressions include literals, so any generic expression allows parameter markers already in the grammar.
Note that I have specifically added testcases for "interesting" places. We can add more if you feel there is a gap.
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/PositionMapper.scala
Outdated
Show resolved
Hide resolved
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/PositionMapper.scala
Show resolved
Hide resolved
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/PositionMapper.scala
Outdated
Show resolved
Hide resolved
* Collect information about a positional parameter in a literal context. Note: The return value | ||
* is not used; this method operates via side effects. | ||
*/ | ||
override def visitPosParameterLiteral(ctx: PosParameterLiteralContext): AnyRef = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A high level question: for parameter markers that already work today (in the expression parser rule), do we need to handle it in the pre-parser? Or we leave it and still let it be bound during analysis time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is easiest to let them all be handled in the preparser. The original code remains for now for safetyy. But I see no reason to keep it. If at soem point we want to move into plan sharing thsi code gets more interesting again. But then we we would introduce templates/markers that weren't even given by the user.
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/SubstituteParmsAstBuilder.scala
Outdated
Show resolved
Hide resolved
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/SubstituteParmsAstBuilder.scala
Outdated
Show resolved
Hide resolved
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/SubstituteParmsAstBuilder.scala
Outdated
Show resolved
Hide resolved
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/SubstituteParmsAstBuilder.scala
Outdated
Show resolved
Hide resolved
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/PositionMapper.scala
Outdated
Show resolved
Hide resolved
sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/parsers.scala
Outdated
Show resolved
Hide resolved
stackTrace: Option[Array[StackTraceElement]] = None, | ||
pysparkErrorContext: Option[(String, String)] = None) { | ||
pysparkErrorContext: Option[(String, String)] = None, | ||
parameterSubstitutionInfo: Option[ParameterSubstitutionInfo] = None) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this new field? In the pre-parser, we can update the sqlText
and other position/index fields and produce a new Origin
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested for Serge to add this new field to avoid creating a new ThreadLocal. If we can produce the new Origin
and store it in some existing place like e.g. CurrentOrigin
instead of this new field, that sounds OK too -- up to you guys.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AbstractSqlParser.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParameterHandler.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParameterHandler.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParameterHandler.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/SubstituteParamsParser.scala
Outdated
Show resolved
Hide resolved
* @note Only supports Literal expressions - all parameter values must be pre-evaluated. | ||
* @see [[ParameterHandler]] for the main parameter handling entry point | ||
*/ | ||
object LiteralToSqlConverter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worried about puting complex SQL string into the original SQL statement. The previous way of parameter binding happens at the analysis phase and changes the plan tree directly, which is 100% safe from SQL injection. The pre-parser accepts parameter markers in two more places: integer literal and string literal. I think it's better to only handle this two new places by the pre-parser, and still use the previous analysis-time parameter binding for others. It's safe to do SQL string manipulation only for integer and string literals, but arbitrary SQL generated by complex literals looks risky to me.
cc @dtenedor as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not undersstand teh issue. A literal is a literal. e.g. is TIMESTAMP'2025-12-01 00:00:00' a complex literal?
Where do we draw teh line and why?
Can you giv ean example opf SQL injection through a literal?
Or are you worried that, what is passed here is not actually a literal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srielau @cloud-fan Are we re-parsing the injected literal string or just using it for error message generation?
import sessionHolder.session.toRichColumn | ||
|
||
private[connect] def parser = session.sessionState.sqlParser | ||
private val parameterHandler = new ParameterHandler() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we create a new instance for each SQL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The handler is stateless. What would we gain from creating new instances?
val parsedPlan = if (args.nonEmpty) { | ||
// Use parameter context directly for parsing | ||
val paramContext = PositionalParameterContext(args.map(lit(_).expr).toSeq) | ||
val parsed = sessionState.sqlParser.parsePlanWithParameters(sqlText, paramContext) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when shall we call parsePlanWithParameters
and when shall we create ParameterHandler
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SparkSession needs pasrePlanWithParameters
SparkConnect has a more complex protocol and nneeds teh ParameterHandler separate.
It also allows for modular unit testing.
What changes were proposed in this pull request?
This PR introduces a comprehensive parameter substitution system that significantly expands parameter marker support and implements major performance optimizations. The current implementation of parameter markers (
:param
and?
) is restricted to expressions and queries. This PR enables parameter markers to work universally across all SQL constructs.Key Changes
Enhanced Parameter Substitution Architecture:
ParameterHandler
class with centralized, optimized parameter substitution logicSparkSqlParser
with parser-aware error contextParameterSubstitutionContext
with thread-local substitution informationEXECUTE IMMEDIATE Integration & Error Context:
ResolveExecuteImmediate
with comprehensive parameter validation and local variable isolationHybridParameterContext
supporting both named and positional parameters with strict validationEXECUTE IMMEDIATE
from accessing outer SQL scripting variablesLegacy Mode Support:
spark.sql.legacy.parameterSubstitution.constantsOnly
Why are the changes needed?
Functional Limitations:
The enhanced architecture solves these by:
Does this PR introduce any user-facing changes?
Yes - Significantly Expanded Functionality:
New Parameter Support:
CREATE VIEW v(c1) AS (SELECT :p)
SHOW TABLES FROM schema LIKE :pattern
DECIMAL(?, ?)
TBLPROPERTIES (':key' = ':value')
Full Backward Compatibility:
How was this patch tested?
Comprehensive Unit Testing:
ParametersSuite
with 50+ test cases covering all parameter scenariosParameterSubstitutionSuite
for algorithm-specific testingSqlScriptingExecutionSuite
tests for EXECUTE IMMEDIATE isolationIntegration Testing:
Manual Verification:
CREATE TABLE :name (id INT)
EXECUTE IMMEDIATE 'SELECT typeof(:p), :p' USING 5::INT AS p
Example Enhanced Capabilities:
Quality Assurance: