-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Bug Description
During a MERGE operation, the numTargetRowsInserted metric in operationMetrics reports significantly more rows than were actually inserted into the target table. The actual row count difference between table versions confirms the correct number of rows were inserted, but the metric is incorrect.
Environment
- Delta Lake version: 2.4.0
- Spark version: 3.4.1
- Databricks Runtime version: 13.3 LTS
Steps to Reproduce
It is hard to reproduce as I can't provide real data. Moreover this happened only once within previously many successful runs
- Execute a MERGE operation with
whenMatchedUpdateAll()andwhenNotMatchedInsertAll()clauses on a large table (~185M rows) - Compare
operationMetrics.numTargetRowsInsertedwith actual row count difference between versions
Observed Results
MERGE operation metrics from DESCRIBE HISTORY:
operationMetrics:
numTargetRowsCopied: "0"
numTargetRowsDeleted: "0"
numTargetRowsUpdated: "0"
numTargetRowsMatchedUpdated: "0"
numTargetRowsInserted: "805044" <-- INCORRECT
numOutputRows: "805044"
numSourceRows: "733735"
numTargetFilesAdded: "15"
numTargetFilesRemoved: "0"
numTargetRowsNotMatchedBySourceDeleted: "0"
numTargetRowsNotMatchedBySourceUpdated: "0"
Merge predicate:
predicate: "(ParsedResumeJsonId#4703 = ParsedResumeJsonId#4437)"
matchedPredicates: "[{\"predicate\":\"NOT (sha1(...) = sha1(...))\",\"actionType\":\"update\"}]"
notMatchedPredicates: "[{\"actionType\":\"insert\"}]"
notMatchedBySourcePredicates: "[]"
Actual row counts (verified via SQL):
-- Version 751 (after merge): 186,204,997 rows
SELECT COUNT(*) FROM table;
-- Version 750 (before merge): 185,471,263 rows
SELECT COUNT(*) FROM table VERSION AS OF 750;
-- Actual rows inserted: 186,204,997 - 185,471,263 = 733,734Expected vs Actual
| Metric | Reported Value | Actual Value | Discrepancy |
|---|---|---|---|
numSourceRows |
733,735 | 733,735 | ✅ Correct |
numTargetRowsInserted |
805,044 | ~733,734 | ❌ +71,310 inflated (~9.7%) |
numOutputRows |
805,044 | ~733,734 | ❌ +71,310 inflated (~9.7%) |
Additional Verification Performed
- ✅ No duplicates in source (bronze) table on merge key
- ✅ No duplicates in target table on merge key after merge
- ✅ No concurrent writes to source during merge
- ✅ No duplicate merge operations in table history
- ✅ Schema matches between source and target tables
- ✅ Merge key (
ParsedResumeJsonId) has same data type in both tables - ✅ Source table is a regular Delta table (not DLT streaming table)
- ✅ Source table was written before merge operation started
Impact
This bug causes false positives in data quality checks that rely on the relationship:
numTargetRowsInserted <= numSourceRows
Since numTargetRowsInserted > numSourceRows is mathematically impossible in a standard MERGE (each source row can only be inserted once), monitoring systems flag this as a "duplicates in target" issue when no actual data problem exists.