[FEATURE] PPL Unification in Spark

**Is your feature request related to a problem?**

OpenSearch 3.0 introduces a new PPL (Piped Processing Language) engine built on Apache Calcite, which unifies logical planning and enables advanced query optimization across multiple execution backends. However, this engine currently exists only within the OpenSearch plugin. As a result, users running PPL queries in Spark (e.g., via Flint) are limited to a legacy PPL implementation that diverges in behavior and lacks recent improvements.

This inconsistency creates several challenges:

- **Semantic divergence**: Query results and behaviors may differ between OpenSearch and Spark for the same PPL query due to differences in parsing, type coercion, null handling, and exception propagation.
- **Engineering duplication**: Maintaining two independent implementations of PPL logic (in OpenSearch and Spark) increases development effort and risk of regression.
- **Feature lag**: New PPL features added to OpenSearch's Calcite engine are not automatically available in Spark, limiting functionality and user adoption.
- **Optimization gaps**: Advanced query rewrites and pushdowns enabled by Calcite's rule-based planning are not leveraged in Spark, reducing performance potential.

**What solution would you like?**

#### High-Level Design Considerations

- **Approach 1 – Reuse Spark SQL Catalyst Optimizer**: Leverage Spark SQL’s Catalyst optimizer and execution engine to minimize integration overhead. More details in [design discussion](https://github.com/opensearch-project/opensearch-spark/issues/1136#issuecomment-2941041521).

#### Detailed Design Breakdown  

- **Deployment Model**: Option A – Embedded Mode that embeds Calcite directly within Spark’s runtime to simplify integration and enable reuse of Calcite classes like `OpenSearchSchema`.
- **Integration Interface**: Option A – Spark SQL Interface as the lowest-effort path by routing PPL queries through Spark SQL and leveraging Calcite’s `RelToSqlConverter`.
- **Integration Layer**: Option A – Logical Plan Integration to enable faster iteration on language unification, while deferring OpenSearch-specific execution unification to Phase 2.

See further design context in [this comment](https://github.com/opensearch-project/opensearch-spark/issues/1136#issuecomment-2941064860).

#### Milestones

- Phase 1: PPL language unification
    - [ ] M1: PPL-Calcite library ready for reuse
        - [x] https://github.com/opensearch-project/sql/issues/3598
        - [x] https://github.com/opensearch-project/sql/issues/3734
    - [ ] M2: PPL-Calcite integration with Spark
        - [ ] https://github.com/opensearch-project/opensearch-spark/issues/1202
        - [ ] https://github.com/opensearch-project/opensearch-spark/issues/1203
    - [ ] M3: Spark PPL compatibility
        - [ ] https://github.com/opensearch-project/opensearch-spark/issues/1208
        - [ ] Unify UDT/UDF/UDAFs support
- Phase 2: OpenSearch-specific unification [TBD]
    - [ ] OpenSearch schema, DSL pushdown and index scan logic unification.

#### Dependencies

1. **PPL-Calcite backport to 2.19-dev branch**: Required to resolve the versioning conflicts highlighted in https://github.com/opensearch-project/sql/issues/3598.
2. **PPL Compatibility Test Framework**: Needed to verify behavioral consistency between PPL OpenSearch and Spark. Ideally, there is supposed to be a formal framework https://github.com/opensearch-project/piped-processing-language/issues/32.
3. **Calcite Function Contract**: Ensure every PPL-Calcite function in library:
   - Can be translated into a valid SQL expression
   - Has a Java static method definition that supports registration in SparkSQL’s function catalog

---
**What alternatives have you considered?**

- **Continue extending the legacy PPL implementation in Spark**: This approach involves adding new commands or functions directly to the existing PPL Spark codebase. However, it results in duplicated logic and a growing risk of divergence from the Calcite-based PPL engine in OpenSearch. As it continues to evolve, maintaining consistency and semantic alignment would become increasingly difficult and error-prone.

**Do you have any additional context?**

- Proof of concepts
    - https://github.com/opensearch-project/opensearch-spark/issues/1136#issuecomment-2845378212
    - https://github.com/opensearch-project/opensearch-spark/pull/993

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE] PPL Unification in Spark #1136

High-Level Design Considerations

Detailed Design Breakdown

Milestones

Dependencies

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE] PPL Unification in Spark #1136

Description

High-Level Design Considerations

Detailed Design Breakdown

Milestones

Dependencies

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions