-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-53845] SDP Sinks #52563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-53845] SDP Sinks #52563
Conversation
@sryza ready for review |
identifier: TableIdentifier, | ||
format: String, | ||
options: Map[String, String], | ||
normalizedPath: Option[String], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does a sink need a normalizedPath
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah not needed, removed
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/FlowExecution.scala
Outdated
Show resolved
Hide resolved
override def toString: String = "VIEW" | ||
} | ||
|
||
private object SinkType extends DatasetType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sinks aren't datasets. I.e. the hierarchy is something like:
- A table is a dataset
- A view is a dataset
- A dataset is an output
- A sink is an output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think we can just rename DatasetType
to OutputType
since the usage of this object in assertOutputIdentifierIsUnique
does not distinguish between an output and a dataset.
Also renamed error message PIPELINE_DUPLICATE_IDENTIFIERS.DATASET
to PIPELINE_DUPLICATE_IDENTIFIERS.OUTPUT
since we also need to check for sink duplicate id.
sql/pipelines/src/test/scala/org/apache/spark/sql/pipelines/graph/SystemMetadataSuite.scala
Outdated
Show resolved
Hide resolved
…aph/SystemMetadataSuite.scala Co-authored-by: Sandy Ryza <[email protected]>
get_active_graph_element_registry().register_output(table) | ||
|
||
|
||
def create_sink( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops missed this on first pass: this needs docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
What changes were proposed in this pull request?
Create the API and implementation to allow users to define a sink in a SDP pipeline as outlined in the SPIP. A sink is an external location the pipeline can write data to.
Why are the changes needed?
New Feature
Does this PR introduce any user-facing change?
New API for unrelased SDP
How was this patch tested?
New and existing tests to ensure sinks work e2e
Was this patch authored or co-authored using generative AI tooling?
No