Skip to content

Metadata entries table breaks when the table configured as Merge-on-Read and has Delete Files #1884

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 3 tasks
guptaakashdeep opened this issue Apr 6, 2025 · 1 comment · Fixed by #1902
Closed
1 of 3 tasks

Comments

@guptaakashdeep
Copy link
Contributor

Apache Iceberg version

0.9.0 (latest release)

Please describe the bug 🐞

Issue:

table.inspect.entries() fails when table is MOR table and has Delete Files present in it. Iceberg MOR Table is created via Apache Spark 3.5.0 with Iceberg 1.5.0 and it's being read via PyIceberg 0.9.0 using StaticTable.from_metadata().

Stacktrace:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 table.inspect.entries()

File ~/Documents/project-repos/git-repos/lakehouse-health-analyzer/venv/lib/python3.12/site-packages/pyiceberg/table/inspect.py:208, in InspectTable.entries(self, snapshot_id)
    188         partition = entry.data_file.partition
    189         partition_record_dict = {
    190             field.name: partition[pos]
    191             for pos, field in enumerate(self.tbl.metadata.specs()[manifest.partition_spec_id].fields)
    192         }
    194         entries.append(
    195             {
    196                 "status": entry.status.value,
    197                 "snapshot_id": entry.snapshot_id,
    198                 "sequence_number": entry.sequence_number,
    199                 "file_sequence_number": entry.file_sequence_number,
    200                 "data_file": {
    201                     "content": entry.data_file.content,
    202                     "file_path": entry.data_file.file_path,
    203                     "file_format": entry.data_file.file_format,
    204                     "partition": partition_record_dict,
    205                     "record_count": entry.data_file.record_count,
    206                     "file_size_in_bytes": entry.data_file.file_size_in_bytes,
    207                     "column_sizes": dict(entry.data_file.column_sizes),
--> 208                     "value_counts": dict(entry.data_file.value_counts),
    209                     "null_value_counts": dict(entry.data_file.null_value_counts),
    210                     "nan_value_counts": dict(entry.data_file.nan_value_counts),
    211                     "lower_bounds": entry.data_file.lower_bounds,
    212                     "upper_bounds": entry.data_file.upper_bounds,
    213                     "key_metadata": entry.data_file.key_metadata,
    214                     "split_offsets": entry.data_file.split_offsets,
    215                     "equality_ids": entry.data_file.equality_ids,
    216                     "sort_order_id": entry.data_file.sort_order_id,
    217                     "spec_id": entry.data_file.spec_id,
    218                 },
    219                 "readable_metrics": readable_metrics,
    220             }
    221         )
    223 return pa.Table.from_pylist(
    224     entries,
    225     schema=entries_schema,
    226 )

TypeError: 'NoneType' object is not iterable

Replication

This issue can be replicated by following the instructions below:

  1. Create an Iceberg MOR table using Spark 3.5.0 with Iceberg 1.5.0
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, array, rand

DW_PATH='../warehouse'
spark = SparkSession.builder \
    .master("local[4]") \
    .appName("iceberg-mor-test") \
    .config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0,org.apache.spark:spark-avro_2.12:3.5.0')\
    .config('spark.sql.extensions','org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')\
    .config('spark.sql.catalog.local','org.apache.iceberg.spark.SparkCatalog') \
    .config('spark.sql.catalog.local.type','hadoop') \
    .config('spark.sql.catalog.local.warehouse',DW_PATH) \
    .getOrCreate()

t1 = spark.range(10000).withColumn("year", lit(2023))
t1 = t1.withColumn("business_vertical", 
                array(lit("Retail"), lit("SME"), lit("Cor"), lit("Analytics"))\
                        .getItem((rand()*4).cast("int")))

t1.coalesce(1).writeTo('local.db.pyic_mor_test').partitionedBy('year').using('iceberg')\
    .tableProperty('format-version','2')\
    .tableProperty('write.delete.mode','merge-on-read')\
    .tableProperty('write.update.mode','merge-on-read')\
    .tableProperty('write.merge.mode','merge-on-read')\
    .create()
  1. Checking Table Properties to make sure table is MOR
spark.sql("SHOW TBLPROPERTIES local.db.pyic_mor_test").show(truncate=False)
+-------------------------------+-------------------+
|key                            |value              |
+-------------------------------+-------------------+
|current-snapshot-id            |2543645387796664537|
|format                         |iceberg/parquet    |
|format-version                 |2                  |
|write.delete.mode              |merge-on-read      |
|write.merge.mode               |merge-on-read      |
|write.parquet.compression-codec|zstd               |
|write.update.mode              |merge-on-read      |
+-------------------------------+-------------------+
  1. Running an UPDATE statement to generate a Delete File
spark.sql(f"UPDATE local.db.pyic_mor_test SET business_vertical = 'DataEngineering' WHERE id > 7000")
  1. Checking if Delete File is generated
spark.table(f"local.db.pyic_mor_test.delete_files").show()
+-------+--------------------+-----------+-------+---------+------------+------------------+--------------------+------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+
|content|           file_path|file_format|spec_id|partition|record_count|file_size_in_bytes|        column_sizes|value_counts|null_value_counts|nan_value_counts|        lower_bounds|        upper_bounds|key_metadata|split_offsets|equality_ids|sort_order_id|    readable_metrics|
+-------+--------------------+-----------+-------+---------+------------+------------------+--------------------+------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+
|      1|/Users/akashdeepg...|    PARQUET|      0|   {2023}|        2999|              4878|{2147483546 -> 21...|        NULL|             NULL|            NULL|{2147483546 -> [2...|{2147483546 -> [2...|        NULL|          [4]|        NULL|         NULL|{{NULL, NULL, NUL...|
+-------+--------------------+-----------+-------+---------+------------+------------------+--------------------+------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+

Reading Spark created table from PyIceberg

from pyiceberg.table import StaticTable

# Using latest metadata.json path
metadata_path = "./warehouse/db/pyic_mor_test/metadata/v2.metadata.json"

table = StaticTable.from_metadata(metadata_path)

# This will break with the stacktrace provided above
table.inspect.entries()

Issue found after debugging

I did some debugging and figured out the inspect.entries() break for MOR tables while reading the *-delete.parquet files present in table.

While reading the Delete file, value_counts is coming as null. I can see that ManifestEntryStatus is ADDED and DataFile content is DataFileContent.POSITION_DELETES which seems to be correct.
I further looked into the manifest.avro file that holds the entry for delete parquet files. And well, value_counts populated there itself is NULL. That's the reason entry.data_file.value_counts is coming as null.

value_counts as null can also be seen in above in the output of query of delete_files table.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@guptaakashdeep
Copy link
Contributor Author

guptaakashdeep commented Apr 6, 2025

@kevinjqliu Please let me know if we need to add more details that needs to be added in here.

I further looked into the code to fix the issue and it seems to be a simple fix in inspect.py for InspectTable.entries method. I believe we require NULL handling while appending the entries list here: inspect.py

In addition to value_counts, I believe we need null handling for null_value_counts and nan_value_counts

Image

Fokko pushed a commit that referenced this issue Apr 16, 2025
#1902)

<!--
Thanks for opening a pull request!
-->

<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
Closes #1884 

# Rationale for this change
table.inspect.entries() fails when table is MOR table and has Delete
Files present in it. Iceberg MOR Table is created via Apache Spark 3.5.0
with Iceberg 1.5.0 and it's being read via PyIceberg 0.9.0 using
StaticTable.from_metadata()


# Are these changes tested?
Yes

# Are there any user-facing changes?
No

<!-- In the case of user-facing changes, please add the changelog label.
-->
Fokko pushed a commit that referenced this issue Apr 17, 2025
#1902)

<!--
Thanks for opening a pull request!
-->

<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
Closes #1884 

# Rationale for this change
table.inspect.entries() fails when table is MOR table and has Delete
Files present in it. Iceberg MOR Table is created via Apache Spark 3.5.0
with Iceberg 1.5.0 and it's being read via PyIceberg 0.9.0 using
StaticTable.from_metadata()


# Are these changes tested?
Yes

# Are there any user-facing changes?
No

<!-- In the case of user-facing changes, please add the changelog label.
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant