Description
Is your feature request related to a problem?
Yes. Although PPL queries in both Spark and OpenSearch will be handled through the shared PPL-Calcite frontend in the near future, we lack validation to ensure their semantics are consistent across both backends. Since Calcite plans are translated to SparkSQL, we assume semantic parity in common SQL operators — however, potential differences in type systems, function behavior, or error handling may still lead to inconsistencies.
What solution would you like?
Design and execute a test suite to evaluate semantic compatibility between PPL queries running in Spark by Legacy PPL Spark (current implementation) and the new PPL Spark via Calcite.
Goal
The outcome of this task will be a documented set of compatibility findings that serve as input for Unify UDT/UDF/UDAFs (TODO: create Github issue).
Deliverable
The outcome of this task will be a documented set of compatibility findings that serve as input for TODO.
- A list of standard PPL functions whose behavior differs in SparkSQL (e.g., semantics, return type, null handling).
- Identification of missing functions in SparkSQL that are supported in OpenSearch.
- Notes on whether it requires user-defined type (UDT) support in SparkSQL to enable those functions.
Tasks
- Create a test plan outlining the goal, scope and expectations.
- Leverage the test framework developed in piped-processing-language#32 or define a standalone test suite consisting of representative PPL queries, execute them against both Spark and OpenSearch backends, and compare the results to identify any inconsistencies in output.
What alternatives have you considered?
N/A
Do you have any additional context?
- Functions dependent on OpenSearch-specific UDTs, such as:
- IP-related functions
- Geo-point functions
- Other domain-specific types not supported in Spark
These will be evaluated in a future phase for OpenSearch-specific unification.