Skip to content

[FEATURE] PPL Unification in Spark #1136

Open
0 of 6 issues completed
Open
0 of 6 issues completed
@dai-chen

Description

@dai-chen

Is your feature request related to a problem?

OpenSearch 3.0 introduces a new PPL (Piped Processing Language) engine built on Apache Calcite, which unifies logical planning and enables advanced query optimization across multiple execution backends. However, this engine currently exists only within the OpenSearch plugin. As a result, users running PPL queries in Spark (e.g., via Flint) are limited to a legacy PPL implementation that diverges in behavior and lacks recent improvements.

This inconsistency creates several challenges:

  • Semantic divergence: Query results and behaviors may differ between OpenSearch and Spark for the same PPL query due to differences in parsing, type coercion, null handling, and exception propagation.
  • Engineering duplication: Maintaining two independent implementations of PPL logic (in OpenSearch and Spark) increases development effort and risk of regression.
  • Feature lag: New PPL features added to OpenSearch's Calcite engine are not automatically available in Spark, limiting functionality and user adoption.
  • Optimization gaps: Advanced query rewrites and pushdowns enabled by Calcite's rule-based planning are not leveraged in Spark, reducing performance potential.

What solution would you like?

High-Level Design Considerations

  • Approach 1 – Reuse Spark SQL Catalyst Optimizer: Leverage Spark SQL’s Catalyst optimizer and execution engine to minimize integration overhead. More details in design discussion.

Detailed Design Breakdown

  • Deployment Model: Option A – Embedded Mode that embeds Calcite directly within Spark’s runtime to simplify integration and enable reuse of Calcite classes like OpenSearchSchema.
  • Integration Interface: Option A – Spark SQL Interface as the lowest-effort path by routing PPL queries through Spark SQL and leveraging Calcite’s RelToSqlConverter.
  • Integration Layer: Option A – Logical Plan Integration to enable faster iteration on language unification, while deferring OpenSearch-specific execution unification to Phase 2.

See further design context in this comment.

Milestones

Dependencies

  1. PPL-Calcite backport to 2.19-dev branch: Required to resolve the versioning conflicts highlighted in [FEATURE] Make PPL-Calcite engine library reusable in Spark sql#3598.
  2. PPL Compatibility Test Framework: Needed to verify behavioral consistency between PPL OpenSearch and Spark. Ideally, there is supposed to be a formal framework [FEATURE] Create a PPL compatibility verification framework piped-processing-language#32.
  3. Calcite Function Contract: Ensure every PPL-Calcite function in library:
    • Can be translated into a valid SQL expression
    • Has a Java static method definition that supports registration in SparkSQL’s function catalog

What alternatives have you considered?

  • Continue extending the legacy PPL implementation in Spark: This approach involves adding new commands or functions directly to the existing PPL Spark codebase. However, it results in duplicated logic and a growing risk of divergence from the Calcite-based PPL engine in OpenSearch. As it continues to evolve, maintaining consistency and semantic alignment would become increasingly difficult and error-prone.

Do you have any additional context?

Sub-issues

Metadata

Metadata

Assignees

Labels

MetaMeta issue, not directly linked to a PRenhancementNew feature or request

Type

No type

Projects

Status

New

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions