Skip to content

Generated checkpoint files are unexpectedly large and potentially broken #187

@igor-lobanov-maersk

Description

@igor-lobanov-maersk

We use kafka-delta-ingest daemon to write data from a Kafka topic to a Delta Lake table on ADLS Gen2 account, and then use Dremio for processing the data onwards. There's about 10Gb of data saved daily, and the table size is approaching 1Tb right now. Up until recently we had not have metadata checkpointing enabled on the table and this led to queries becoming progressively slower.

First we tried checkpointing the table with Spark, and it worked well. The checkpoint parquet file generated by Spark is around 45Mb, and the query performance improved considerably.

Then we enabled checkpointing in kafka-delta-ingest. As expected, observed new checkpoint files created every 10th commit. However, for some unexpected reason each checkpoint file is more than 500Mb. More crucially, Dremio seems to have a problem with these checkpoint files and is unable to see most of the data in the table.

Dremio issue will be addressed elsewhere, but I am trying to understand what's causing such size differences. Here are metadata snapshots:

Spark checkpoint file metadata:

<pyarrow._parquet.FileMetaData object at 0xffffa93dbd30>
  created_by: parquet-mr version 1.13.1 (build db4183109d5b734ec5930d870cdae161e408ddba)
  num_columns: 55
  num_rows: 12316
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 13028

Kafka-delta-ingest checkpoint file metadata:

<pyarrow._parquet.FileMetaData object at 0xffff96be4bd0>
  created_by: parquet-rs version 49.0.0
  num_columns: 187
  num_rows: 12437
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 129335

Comparing schemas it appears that kafka-delta-ingest captures the file stats as stats_parsed not respecting the default of storing the stats as JSON string in stats, and this explains why there are many more columns. However, this does not explain order of magnitude difference in file size. If anything, JSON stats in Spark-generated checkpoint file should take more space than parsed stats.

What could contributing to such a difference in checkpoint file size, and is there something that could be done to reduce it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions