Skip to content

[KYUUBI #7126][LINEAGE] Support merge into syntax in row level catalog #7127

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

yabola
Copy link
Contributor

@yabola yabola commented Jul 7, 2025

Why are the changes needed?

In Catalog which supports row level interface (iceberg etc.), merge into syntax will be rewritten as WriteDelta or ReplaceData operator through rule. We should support the extraction of lineage relationship under this type.

How was this patch tested?

add new tests for row-level catalog

Was this patch authored or co-authored using generative AI tooling?

no


private def extractInstructionOutputs(instruction: Expression): Seq[Expression] = {
instruction match {
case p if p.nodeName == "Split" => getField[Seq[Expression]](p, "otherOutput")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -46,6 +47,9 @@ trait LineageParser {
val SUBQUERY_COLUMN_IDENTIFIER = "__subquery__"
val AGGREGATE_COUNT_COLUMN_IDENTIFIER = "__count__"
val LOCAL_TABLE_IDENTIFIER = "__local__"
val METADATA_COL_ATTR_KEY = "__metadata_col"
val ORIGINAL_ROW_ID_VALUE_PREFIX: String = "__original_row_id_"
private val LOG = LoggerFactory.getLogger(classOf[LineageParser])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import org.apache.spark.internal.Logging

... with Logging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry , I didn't use this LOG, so I deleted it. And the trait LineageParser doesn't seem to work with Logging

override def catalogName: String = {
if (SPARK_RUNTIME_VERSION <= "3.1") {
"org.apache.spark.sql.connector.InMemoryTableCatalog"
} else if (SPARK_RUNTIME_VERSION <= "3.2") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we only support 3.3 and above now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kyuubi Spark Listener Extension

I see lineage plugin can support 3.1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have officially removed support prior Spark 3.2, some code/docs are left over to clean up, such code is unreachable since CI was removed

@pan3793 pan3793 requested a review from wForget July 7, 2025 10:25
@yabola yabola force-pushed the master-listener branch from 4b1933d to 295941a Compare July 7, 2025 12:11
@yabola yabola force-pushed the master-listener branch from 295941a to 85de5b0 Compare July 7, 2025 12:43
@codecov-commenter
Copy link

codecov-commenter commented Jul 7, 2025

Codecov Report

Attention: Patch coverage is 0% with 34 lines in your changes missing coverage. Please review.

Project coverage is 0.00%. Comparing base (cad5a39) to head (00660f8).
Report is 19 commits behind head on master.

Files with missing lines Patch % Lines
...in/lineage/helper/SparkSQLLineageParseHelper.scala 0.00% 34 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master   #7127    +/-   ##
=======================================
  Coverage    0.00%   0.00%            
=======================================
  Files         697     700     +3     
  Lines       43203   43411   +208     
  Branches     5854    5886    +32     
=======================================
- Misses      43203   43411   +208     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@@ -307,7 +310,35 @@ trait LineageParser {
extractColumnsLineage(getQuery(plan), parentColumnsLineage).map { case (k, v) =>
k.withName(s"$table.${k.name}") -> v
}
case p if p.nodeName == "MergeRows" =>
val instructionsOutputs =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}.collect {
case (keyAttr: Attribute, instructionsOutput)
if instructionsOutput
.exists(!_.references.isEmpty) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.exists(_.references.nonEmpty)

.map(extractInstructionOutputs)
val nextColumnsLineage = ListMap(p.output.indices.map { index =>
val keyAttr = p.output(index)
val instructionOutputs = instructionsOutputs.map(_(index))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this index match always correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am busy these days. I'll confirm later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants