Generated checkpoint files are unexpectedly large and potentially broken

We use kafka-delta-ingest daemon to write data from a Kafka topic to a Delta Lake table on ADLS Gen2 account, and then use [Dremio](https://www.dremio.com/) for processing the data onwards. There's about 10Gb of data saved daily, and the table size is approaching 1Tb right now. Up until recently we had not have metadata checkpointing enabled on the table and this led to queries becoming progressively slower.

First we tried checkpointing the table with Spark, and it worked well. The checkpoint parquet file generated by Spark is around 45Mb, and the query performance improved considerably.

Then we enabled checkpointing in kafka-delta-ingest. As expected, observed new checkpoint files created every 10th commit. However, for some unexpected reason each checkpoint file is more than 500Mb. More crucially, Dremio seems to have a problem with these checkpoint files and is unable to see most of the data in the table.

Dremio issue will be addressed elsewhere, but I am trying to understand what's causing such size differences. Here are metadata snapshots:

Spark checkpoint file metadata:
```
<pyarrow._parquet.FileMetaData object at 0xffffa93dbd30>
  created_by: parquet-mr version 1.13.1 (build db4183109d5b734ec5930d870cdae161e408ddba)
  num_columns: 55
  num_rows: 12316
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 13028
```

Kafka-delta-ingest checkpoint file metadata:
```
<pyarrow._parquet.FileMetaData object at 0xffff96be4bd0>
  created_by: parquet-rs version 49.0.0
  num_columns: 187
  num_rows: 12437
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 129335
```

Comparing schemas it appears that kafka-delta-ingest captures the file stats as `stats_parsed` not respecting the [default](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#checkpoints-1) of storing the stats as JSON string in `stats`, and this explains why there are many more columns. However, this does not explain order of magnitude difference in file size. If anything, JSON stats in Spark-generated checkpoint file should take *more* space than parsed stats.

What could contributing to such a difference in checkpoint file size, and is there something that could be done to reduce it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Generated checkpoint files are unexpectedly large and potentially broken #187

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Generated checkpoint files are unexpectedly large and potentially broken #187

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions