Skip to content

[BUG] MERGE operationMetrics numTargetRowsInserted reports inflated count compared to actual rows inserted #5864

@zhukovgreen

Description

@zhukovgreen

Bug Description

During a MERGE operation, the numTargetRowsInserted metric in operationMetrics reports significantly more rows than were actually inserted into the target table. The actual row count difference between table versions confirms the correct number of rows were inserted, but the metric is incorrect.

Environment

  • Delta Lake version: 2.4.0
  • Spark version: 3.4.1
  • Databricks Runtime version: 13.3 LTS

Steps to Reproduce

It is hard to reproduce as I can't provide real data. Moreover this happened only once within previously many successful runs

  1. Execute a MERGE operation with whenMatchedUpdateAll() and whenNotMatchedInsertAll() clauses on a large table (~185M rows)
  2. Compare operationMetrics.numTargetRowsInserted with actual row count difference between versions

Observed Results

MERGE operation metrics from DESCRIBE HISTORY:

operationMetrics:
  numTargetRowsCopied: "0"
  numTargetRowsDeleted: "0"
  numTargetRowsUpdated: "0"
  numTargetRowsMatchedUpdated: "0"
  numTargetRowsInserted: "805044"      <-- INCORRECT
  numOutputRows: "805044"
  numSourceRows: "733735"
  numTargetFilesAdded: "15"
  numTargetFilesRemoved: "0"
  numTargetRowsNotMatchedBySourceDeleted: "0"
  numTargetRowsNotMatchedBySourceUpdated: "0"

Merge predicate:

predicate: "(ParsedResumeJsonId#4703 = ParsedResumeJsonId#4437)"
matchedPredicates: "[{\"predicate\":\"NOT (sha1(...) = sha1(...))\",\"actionType\":\"update\"}]"
notMatchedPredicates: "[{\"actionType\":\"insert\"}]"
notMatchedBySourcePredicates: "[]"

Actual row counts (verified via SQL):

-- Version 751 (after merge): 186,204,997 rows
SELECT COUNT(*) FROM table;

-- Version 750 (before merge): 185,471,263 rows  
SELECT COUNT(*) FROM table VERSION AS OF 750;

-- Actual rows inserted: 186,204,997 - 185,471,263 = 733,734

Expected vs Actual

Metric Reported Value Actual Value Discrepancy
numSourceRows 733,735 733,735 ✅ Correct
numTargetRowsInserted 805,044 ~733,734 +71,310 inflated (~9.7%)
numOutputRows 805,044 ~733,734 +71,310 inflated (~9.7%)

Additional Verification Performed

  • ✅ No duplicates in source (bronze) table on merge key
  • ✅ No duplicates in target table on merge key after merge
  • ✅ No concurrent writes to source during merge
  • ✅ No duplicate merge operations in table history
  • ✅ Schema matches between source and target tables
  • ✅ Merge key (ParsedResumeJsonId) has same data type in both tables
  • ✅ Source table is a regular Delta table (not DLT streaming table)
  • ✅ Source table was written before merge operation started

Impact

This bug causes false positives in data quality checks that rely on the relationship:

numTargetRowsInserted <= numSourceRows

Since numTargetRowsInserted > numSourceRows is mathematically impossible in a standard MERGE (each source row can only be inserted once), monitoring systems flag this as a "duplicates in target" issue when no actual data problem exists.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions