Open
0 of 6 issues completedDescription
Is your feature request related to a problem?
OpenSearch 3.0 introduces a new PPL (Piped Processing Language) engine built on Apache Calcite, which unifies logical planning and enables advanced query optimization across multiple execution backends. However, this engine currently exists only within the OpenSearch plugin. As a result, users running PPL queries in Spark (e.g., via Flint) are limited to a legacy PPL implementation that diverges in behavior and lacks recent improvements.
This inconsistency creates several challenges:
- Semantic divergence: Query results and behaviors may differ between OpenSearch and Spark for the same PPL query due to differences in parsing, type coercion, null handling, and exception propagation.
- Engineering duplication: Maintaining two independent implementations of PPL logic (in OpenSearch and Spark) increases development effort and risk of regression.
- Feature lag: New PPL features added to OpenSearch's Calcite engine are not automatically available in Spark, limiting functionality and user adoption.
- Optimization gaps: Advanced query rewrites and pushdowns enabled by Calcite's rule-based planning are not leveraged in Spark, reducing performance potential.
What solution would you like?
High-Level Design Considerations
- Approach 1 – Reuse Spark SQL Catalyst Optimizer: Leverage Spark SQL’s Catalyst optimizer and execution engine to minimize integration overhead. More details in design discussion.
Detailed Design Breakdown
- Deployment Model: Option A – Embedded Mode that embeds Calcite directly within Spark’s runtime to simplify integration and enable reuse of Calcite classes like
OpenSearchSchema
. - Integration Interface: Option A – Spark SQL Interface as the lowest-effort path by routing PPL queries through Spark SQL and leveraging Calcite’s
RelToSqlConverter
. - Integration Layer: Option A – Logical Plan Integration to enable faster iteration on language unification, while deferring OpenSearch-specific execution unification to Phase 2.
See further design context in this comment.
Milestones
- Phase 1: PPL language unification
- M1: PPL-Calcite library ready for reuse
- M2: PPL-Calcite integration with Spark
- M3: Spark PPL compatibility
- [FEATURE] Evaluate semantic compatibility between PPL in Spark and OpenSearch #1208
- Unify UDT/UDF/UDAFs support
- Phase 2: OpenSearch-specific unification [TBD]
- OpenSearch schema, DSL pushdown and index scan logic unification.
Dependencies
- PPL-Calcite backport to 2.19-dev branch: Required to resolve the versioning conflicts highlighted in [FEATURE] Make PPL-Calcite engine library reusable in Spark sql#3598.
- PPL Compatibility Test Framework: Needed to verify behavioral consistency between PPL OpenSearch and Spark. Ideally, there is supposed to be a formal framework [FEATURE] Create a PPL compatibility verification framework piped-processing-language#32.
- Calcite Function Contract: Ensure every PPL-Calcite function in library:
- Can be translated into a valid SQL expression
- Has a Java static method definition that supports registration in SparkSQL’s function catalog
What alternatives have you considered?
- Continue extending the legacy PPL implementation in Spark: This approach involves adding new commands or functions directly to the existing PPL Spark codebase. However, it results in duplicated logic and a growing risk of divergence from the Calcite-based PPL engine in OpenSearch. As it continues to evolve, maintaining consistency and semantic alignment would become increasingly difficult and error-prone.
Do you have any additional context?
Sub-issues
Metadata
Metadata
Assignees
Type
Projects
Status
New