[SPARK-53845] SDP Sinks #52563

JiaqiWang18 · 2025-10-09T23:09:10Z

What changes were proposed in this pull request?

Create the API and implementation to allow users to define a sink in a SDP pipeline as outlined in the SPIP. A sink is an external location the pipeline can write data to.

dp.create_sink(
  "myParquetSink",
  format = "parquet",
  options = {"path": "${dir.getPath}"}
)

@dp.append_flow(
  target = "myParquetSink",
)
def mySinkFlow():
  return spark.readStream.table("src")

Why are the changes needed?

New Feature

Does this PR introduce any user-facing change?

New API for unrelased SDP

How was this patch tested?

New and existing tests to ensure sinks work e2e

Was this patch authored or co-authored using generative AI tooling?

No

JiaqiWang18 · 2025-10-10T05:28:14Z

@sryza ready for review

sryza · 2025-10-10T13:55:46Z

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/elements.scala

+   identifier: TableIdentifier,
+   format: String,
+   options: Map[String, String],
+   normalizedPath: Option[String],


Does a sink need a normalizedPath?

yeah not needed, removed

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/FlowExecution.scala

sryza · 2025-10-10T13:57:44Z

...pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/GraphRegistrationContext.scala

    override def toString: String = "VIEW"
  }
+
+  private object SinkType extends DatasetType {


Sinks aren't datasets. I.e. the hierarchy is something like:

A table is a dataset

A view is a dataset

A dataset is an output

A sink is an output

Yeah I think we can just rename DatasetType to OutputType since the usage of this object in assertOutputIdentifierIsUnique does not distinguish between an output and a dataset.
Also renamed error message PIPELINE_DUPLICATE_IDENTIFIERS.DATASET to PIPELINE_DUPLICATE_IDENTIFIERS.OUTPUT since we also need to check for sink duplicate id.

sql/pipelines/src/test/scala/org/apache/spark/sql/pipelines/graph/SystemMetadataSuite.scala

…aph/SystemMetadataSuite.scala Co-authored-by: Sandy Ryza <[email protected]>

sryza · 2025-10-10T19:53:26Z

python/pyspark/pipelines/api.py

    get_active_graph_element_registry().register_output(table)
+
+
+def create_sink(


Oops missed this on first pass: this needs docstring

sryza

LGTM!

jackywang-db added 11 commits October 9, 2025 15:16

proto

96d2da7

backend

3759e6a

python

fb1b856

wip

522e226

e2e test

929e558

merge

f08bb2a

use define dataset

cd4d0a8

fix merge

72e1668

python test

0255ae7

fmt

b62ac58

unneeded changes

1048ef5

github-actions bot added SQL PYTHON CONNECT labels Oct 9, 2025

e2e test multi format

4522a37

sryza reviewed Oct 10, 2025

View reviewed changes

Update sql/pipelines/src/test/scala/org/apache/spark/sql/pipelines/gr…

eee35a9

…aph/SystemMetadataSuite.scala Co-authored-by: Sandy Ryza <[email protected]>

sryza reviewed Oct 10, 2025

View reviewed changes

jackywang-db added 3 commits October 10, 2025 12:55

address feedback

e4ffe87

doc str

76db7b5

comment

400d2eb

JiaqiWang18 requested a review from sryza October 10, 2025 20:20

sryza approved these changes Oct 12, 2025

View reviewed changes

sryza closed this in e38a651 Oct 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-53845] SDP Sinks #52563

[SPARK-53845] SDP Sinks #52563

JiaqiWang18 commented Oct 9, 2025 •

edited

Loading

Uh oh!

JiaqiWang18 commented Oct 10, 2025

Uh oh!

sryza Oct 10, 2025

Uh oh!

JiaqiWang18 Oct 10, 2025

Uh oh!

Uh oh!

sryza Oct 10, 2025

Uh oh!

JiaqiWang18 Oct 10, 2025

Uh oh!

Uh oh!

sryza Oct 10, 2025

Uh oh!

sryza left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		get_active_graph_element_registry().register_output(table)


		def create_sink(

[SPARK-53845] SDP Sinks #52563

[SPARK-53845] SDP Sinks #52563

Conversation

JiaqiWang18 commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

JiaqiWang18 commented Oct 10, 2025

Uh oh!

sryza Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

JiaqiWang18 Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sryza Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

JiaqiWang18 Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sryza Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

sryza left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JiaqiWang18 commented Oct 9, 2025 •

edited

Loading