Move implementation of upsert from Table to Transaction #1817

koenvo · 2025-03-19T12:55:16Z

Rationale for this change

Previously, the upsert functionality was implemented at the table level, which meant it always initiated a new Transaction. This change moves the upsert implementation to the Transaction level while keeping table.upsert(...) as an entry point.

With this refactor, end users now have the flexibility to call upsert in two ways:

table.upsert(...) – which still starts a new transaction.
transaction.upsert(...) – allowing upserts within an existing transaction.

Are these changes tested?

Using existing tests.

Are there any user-facing changes?

Yes. This change enables users to perform upserts within an existing transaction using transaction.upsert(...), in addition to the existing table.upsert(...) method.

mattmartin14 · 2025-03-22T14:13:53Z

I think since the transaction wrapper has been moved out, there should be a unit test added to do partial upsert and then throw an error and ensure the rollback occurs and we are not left in a state where a partial upsert succeeded.

Example:

start an upsert
let the update succeed
force an error on the insert component
rollback the transaction and make sure the update did not persist

Just my thoughts 😃. Thanks,
Matt

koenvo · 2025-03-22T14:41:03Z

Agree! I will work on the test.

With "update" you mean "delete", right?

mattmartin14 · 2025-03-26T16:28:52Z

Agree! I will work on the test.

With "update" you mean "delete", right?

Hey sorry; just saw this; when i mean update, i mean it invokes an "overwrite" operation, which i believe is what delete's also trigger under the covers. 😀

koenvo · 2025-03-27T20:20:14Z

There is a nice edgecase here..

tbl = catalog.create_table(identifier, schema=schema)

# Define exact schema: required int32 and required string
arrow_schema = pa.schema([
    pa.field("id", pa.int32(), nullable=False),
    pa.field("name", pa.string(), nullable=False),
])

tbl.append(pa.Table.from_pylist([{"id": 1, "name": "Alice"}], schema=arrow_schema))

df = pa.Table.from_pylist([{"id": 2, "name": "Bob"}, {"id": 1, "name": "Alicia"}], schema=arrow_schema)

with tbl.transaction() as txn:
    txn.upsert(df, join_cols=["id"])

    # This will re-insert Bob, instead of reading the uncommitted changes and ignore Bob
    txn.upsert(df, join_cols=["id"])

@Fokko should it be possible to read uncommitted changes?

Fokko · 2025-04-17T19:44:58Z

@Fokko should it be possible to read uncommitted changes?

Yes, it should. If you do a subsequent upsert with the same data, it should be a no-op. This should be the case today, otherwise, #1903 will fix this.

kevinjqliu · 2025-04-19T18:17:40Z

now that #1903 is merged, could you rebase this PR?

koenvo · 2025-05-13T08:57:53Z

pyiceberg/table/__init__.py

+        matched_predicate = upsert_util.create_match_filter(df, join_cols)
+
+        # We must use Transaction.table_metadata for the scan. This includes all uncommitted - but relevant - changes.
+        matched_iceberg_table = DataScan(


Most important change. Required #1903

Nice, thanks for pointing out, and also covering this in the tests 👍

koenvo · 2025-05-13T08:58:24Z

tests/table/test_upsert.py

+        txn.delete(delete_filter="id = 1")
+        txn.append(df)
+
+        # This should read the uncommitted changes


Test uncommitted changes are read

Fokko

Looks great @koenvo

Fokko · 2025-05-13T14:33:52Z

pyiceberg/table/__init__.py

+        matched_predicate = upsert_util.create_match_filter(df, join_cols)
+
+        # We must use Transaction.table_metadata for the scan. This includes all uncommitted - but relevant - changes.
+        matched_iceberg_table = DataScan(


Nice, thanks for pointing out, and also covering this in the tests 👍

### Summary This PR updates the upsert logic to use batch processing. The main goal is to prevent out-of-memory (OOM) issues when updating large tables by avoiding loading all data at once. **Note:** This has only been tested against the unit tests—no real-world datasets have been evaluated yet. This PR partially depends on functionality introduced in [#1817](apache/iceberg#1817). --- ### Notes - Duplicate detection across multiple batches is **not** possible with this approach. - ~All data is read sequentially, which may be slower than the parallel read used by `to_arrow`.~ fixed using `concurrent_tasks` parameter --- ### Performance Comparison In setups with many small files, network and metadata overhead become the dominant factor. This impacts batch reading performance, as each file contributes relatively more overhead than payload. In the test setup used here, metadata access was the largest cost. #### Using `to_arrow_batch_reader` (sequential): - **Scan:** 9993.50 ms - **To list:** 19811.09 ms #### Using `to_arrow` (parallel): - **Scan:** 10607.88 ms --------- Co-authored-by: Fokko Driesprong <[email protected]>

### Summary This PR updates the upsert logic to use batch processing. The main goal is to prevent out-of-memory (OOM) issues when updating large tables by avoiding loading all data at once. **Note:** This has only been tested against the unit tests—no real-world datasets have been evaluated yet. This PR partially depends on functionality introduced in [apache#1817](apache/iceberg#1817). --- ### Notes - Duplicate detection across multiple batches is **not** possible with this approach. - ~All data is read sequentially, which may be slower than the parallel read used by `to_arrow`.~ fixed using `concurrent_tasks` parameter --- ### Performance Comparison In setups with many small files, network and metadata overhead become the dominant factor. This impacts batch reading performance, as each file contributes relatively more overhead than payload. In the test setup used here, metadata access was the largest cost. #### Using `to_arrow_batch_reader` (sequential): - **Scan:** 9993.50 ms - **To list:** 19811.09 ms #### Using `to_arrow` (parallel): - **Scan:** 10607.88 ms --------- Co-authored-by: Fokko Driesprong <[email protected]>

koenvo marked this pull request as ready for review March 19, 2025 13:32

koenvo changed the title ~~Move actual implementation of upsert from Table to Transaction~~ Move implementation of upsert from Table to Transaction Mar 19, 2025

koenvo added 5 commits May 13, 2025 09:36

Move actual implementation of upsert from Table to Transaction

7abfee9

Fix some incorrect usage of schema

db334ae

Write a test for upsert transaction

cebfda3

Add failing test for multiple upserts in same transaction

52fd35e

Fix test

f336c0b

koenvo force-pushed the feat/move-upsert-to-transaction branch from 56888d3 to f336c0b Compare May 13, 2025 07:40

koenvo added 2 commits May 13, 2025 10:47

Add failing test

07890ac

Use Transaction.table_metadata when doing the data scan in upsert

ae0e60f

koenvo commented May 13, 2025

View reviewed changes

Remove as it's resolved

ce8d9ef

koenvo mentioned this pull request May 13, 2025

Improve upsert memory pressure #1995

Merged

Fokko approved these changes May 13, 2025

View reviewed changes

Fokko merged commit d9f3a07 into apache:main May 13, 2025
10 checks passed

koenvo deleted the feat/move-upsert-to-transaction branch May 13, 2025 17:28

rudolfix mentioned this pull request Jun 2, 2025

Upsert merge strategy for iceberg dlt-hub/dlt#2671

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move implementation of upsert from Table to Transaction #1817

Move implementation of upsert from Table to Transaction #1817

Uh oh!

koenvo commented Mar 19, 2025 •

edited

Loading

Uh oh!

mattmartin14 commented Mar 22, 2025

Uh oh!

koenvo commented Mar 22, 2025

Uh oh!

mattmartin14 commented Mar 26, 2025

Uh oh!

koenvo commented Mar 27, 2025

Uh oh!

Fokko commented Apr 17, 2025

Uh oh!

kevinjqliu commented Apr 19, 2025

Uh oh!

koenvo May 13, 2025

Uh oh!

Fokko May 13, 2025

Uh oh!

koenvo May 13, 2025

Uh oh!

Fokko left a comment

Uh oh!

Fokko May 13, 2025

Uh oh!

Uh oh!

Uh oh!

Move implementation of upsert from Table to Transaction #1817

Move implementation of upsert from Table to Transaction #1817

Uh oh!

Conversation

koenvo commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

mattmartin14 commented Mar 22, 2025

Uh oh!

koenvo commented Mar 22, 2025

Uh oh!

mattmartin14 commented Mar 26, 2025

Uh oh!

koenvo commented Mar 27, 2025

Uh oh!

Fokko commented Apr 17, 2025

Uh oh!

kevinjqliu commented Apr 19, 2025

Uh oh!

koenvo May 13, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko May 13, 2025

Choose a reason for hiding this comment

Uh oh!

koenvo May 13, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Fokko May 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

koenvo commented Mar 19, 2025 •

edited

Loading