Metadata `entries` table breaks when the table configured as Merge-on-Read and has Delete Files #1884

guptaakashdeep · 2025-04-06T05:42:19Z

Apache Iceberg version

0.9.0 (latest release)

Please describe the bug 🐞

Issue:

table.inspect.entries() fails when table is MOR table and has Delete Files present in it. Iceberg MOR Table is created via Apache Spark 3.5.0 with Iceberg 1.5.0 and it's being read via PyIceberg 0.9.0 using StaticTable.from_metadata().

Stacktrace:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 table.inspect.entries()

File ~/Documents/project-repos/git-repos/lakehouse-health-analyzer/venv/lib/python3.12/site-packages/pyiceberg/table/inspect.py:208, in InspectTable.entries(self, snapshot_id)
    188         partition = entry.data_file.partition
    189         partition_record_dict = {
    190             field.name: partition[pos]
    191             for pos, field in enumerate(self.tbl.metadata.specs()[manifest.partition_spec_id].fields)
    192         }
    194         entries.append(
    195             {
    196                 "status": entry.status.value,
    197                 "snapshot_id": entry.snapshot_id,
    198                 "sequence_number": entry.sequence_number,
    199                 "file_sequence_number": entry.file_sequence_number,
    200                 "data_file": {
    201                     "content": entry.data_file.content,
    202                     "file_path": entry.data_file.file_path,
    203                     "file_format": entry.data_file.file_format,
    204                     "partition": partition_record_dict,
    205                     "record_count": entry.data_file.record_count,
    206                     "file_size_in_bytes": entry.data_file.file_size_in_bytes,
    207                     "column_sizes": dict(entry.data_file.column_sizes),
--> 208                     "value_counts": dict(entry.data_file.value_counts),
    209                     "null_value_counts": dict(entry.data_file.null_value_counts),
    210                     "nan_value_counts": dict(entry.data_file.nan_value_counts),
    211                     "lower_bounds": entry.data_file.lower_bounds,
    212                     "upper_bounds": entry.data_file.upper_bounds,
    213                     "key_metadata": entry.data_file.key_metadata,
    214                     "split_offsets": entry.data_file.split_offsets,
    215                     "equality_ids": entry.data_file.equality_ids,
    216                     "sort_order_id": entry.data_file.sort_order_id,
    217                     "spec_id": entry.data_file.spec_id,
    218                 },
    219                 "readable_metrics": readable_metrics,
    220             }
    221         )
    223 return pa.Table.from_pylist(
    224     entries,
    225     schema=entries_schema,
    226 )

TypeError: 'NoneType' object is not iterable

Replication

This issue can be replicated by following the instructions below:

Create an Iceberg MOR table using Spark 3.5.0 with Iceberg 1.5.0

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, array, rand

DW_PATH='../warehouse'
spark = SparkSession.builder \
    .master("local[4]") \
    .appName("iceberg-mor-test") \
    .config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0,org.apache.spark:spark-avro_2.12:3.5.0')\
    .config('spark.sql.extensions','org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')\
    .config('spark.sql.catalog.local','org.apache.iceberg.spark.SparkCatalog') \
    .config('spark.sql.catalog.local.type','hadoop') \
    .config('spark.sql.catalog.local.warehouse',DW_PATH) \
    .getOrCreate()

t1 = spark.range(10000).withColumn("year", lit(2023))
t1 = t1.withColumn("business_vertical", 
                array(lit("Retail"), lit("SME"), lit("Cor"), lit("Analytics"))\
                        .getItem((rand()*4).cast("int")))

t1.coalesce(1).writeTo('local.db.pyic_mor_test').partitionedBy('year').using('iceberg')\
    .tableProperty('format-version','2')\
    .tableProperty('write.delete.mode','merge-on-read')\
    .tableProperty('write.update.mode','merge-on-read')\
    .tableProperty('write.merge.mode','merge-on-read')\
    .create()

Checking Table Properties to make sure table is MOR

spark.sql("SHOW TBLPROPERTIES local.db.pyic_mor_test").show(truncate=False)

+-------------------------------+-------------------+
|key                            |value              |
+-------------------------------+-------------------+
|current-snapshot-id            |2543645387796664537|
|format                         |iceberg/parquet    |
|format-version                 |2                  |
|write.delete.mode              |merge-on-read      |
|write.merge.mode               |merge-on-read      |
|write.parquet.compression-codec|zstd               |
|write.update.mode              |merge-on-read      |
+-------------------------------+-------------------+

Running an UPDATE statement to generate a Delete File

spark.sql(f"UPDATE local.db.pyic_mor_test SET business_vertical = 'DataEngineering' WHERE id > 7000")

Checking if Delete File is generated

spark.table(f"local.db.pyic_mor_test.delete_files").show()

+-------+--------------------+-----------+-------+---------+------------+------------------+--------------------+------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+
|content|           file_path|file_format|spec_id|partition|record_count|file_size_in_bytes|        column_sizes|value_counts|null_value_counts|nan_value_counts|        lower_bounds|        upper_bounds|key_metadata|split_offsets|equality_ids|sort_order_id|    readable_metrics|
+-------+--------------------+-----------+-------+---------+------------+------------------+--------------------+------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+
|      1|/Users/akashdeepg...|    PARQUET|      0|   {2023}|        2999|              4878|{2147483546 -> 21...|        NULL|             NULL|            NULL|{2147483546 -> [2...|{2147483546 -> [2...|        NULL|          [4]|        NULL|         NULL|{{NULL, NULL, NUL...|
+-------+--------------------+-----------+-------+---------+------------+------------------+--------------------+------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+

Reading Spark created table from PyIceberg

from pyiceberg.table import StaticTable

# Using latest metadata.json path
metadata_path = "./warehouse/db/pyic_mor_test/metadata/v2.metadata.json"

table = StaticTable.from_metadata(metadata_path)

# This will break with the stacktrace provided above
table.inspect.entries()

Issue found after debugging

I did some debugging and figured out the inspect.entries() break for MOR tables while reading the *-delete.parquet files present in table.

While reading the Delete file, value_counts is coming as null. I can see that ManifestEntryStatus is ADDED and DataFile content is DataFileContent.POSITION_DELETES which seems to be correct.
I further looked into the manifest.avro file that holds the entry for delete parquet files. And well, value_counts populated there itself is NULL. That's the reason entry.data_file.value_counts is coming as null.

value_counts as null can also be seen in above in the output of query of delete_files table.

Willingness to contribute

I can contribute a fix for this bug independently
I would be willing to contribute a fix for this bug with guidance from the Iceberg community
I cannot contribute a fix for this bug at this time

The text was updated successfully, but these errors were encountered:

guptaakashdeep · 2025-04-06T05:56:13Z

@kevinjqliu Please let me know if we need to add more details that needs to be added in here.

I further looked into the code to fix the issue and it seems to be a simple fix in inspect.py for InspectTable.entries method. I believe we require NULL handling while appending the entries list here: inspect.py

In addition to value_counts, I believe we need null handling for null_value_counts and nan_value_counts

#1902)   Closes #1884 # Rationale for this change table.inspect.entries() fails when table is MOR table and has Delete Files present in it. Iceberg MOR Table is created via Apache Spark 3.5.0 with Iceberg 1.5.0 and it's being read via PyIceberg 0.9.0 using StaticTable.from_metadata() # Are these changes tested? Yes # Are there any user-facing changes? No

guptaakashdeep mentioned this issue Apr 9, 2025

Fix for metadata entries table for MOR tables containing Delete Files. #1902

Merged

Fokko closed this as completed in #1902 Apr 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata `entries` table breaks when the table configured as Merge-on-Read and has Delete Files #1884

Metadata `entries` table breaks when the table configured as Merge-on-Read and has Delete Files #1884

guptaakashdeep commented Apr 6, 2025

guptaakashdeep commented Apr 6, 2025 •

edited

Loading

Metadata entries table breaks when the table configured as Merge-on-Read and has Delete Files #1884

Metadata entries table breaks when the table configured as Merge-on-Read and has Delete Files #1884

Comments

guptaakashdeep commented Apr 6, 2025

Apache Iceberg version

Please describe the bug 🐞

Issue:

Stacktrace:

Replication

Reading Spark created table from PyIceberg

Issue found after debugging

Willingness to contribute

guptaakashdeep commented Apr 6, 2025 • edited Loading

Metadata `entries` table breaks when the table configured as Merge-on-Read and has Delete Files #1884

Metadata `entries` table breaks when the table configured as Merge-on-Read and has Delete Files #1884

guptaakashdeep commented Apr 6, 2025 •

edited

Loading