Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: configurable column encoding for parquet checkpoint files #3214

Merged
merged 2 commits into from
Feb 15, 2025

Conversation

dmunch
Copy link
Contributor

@dmunch dmunch commented Feb 13, 2025

Description

This PR adds a table configuration option to enable or disable run length encoding for checkpoint files.

Note: I'm unsure if the table option is the right way to go - In the original issue it was propose to expose writerProperties on create_checkpoint, however, after evaluating this, I figured this has a few downsides:

  • The automatic checkpointing wouldn't be aware of those custom writerProperties
  • writerProperties would expose too much control, i.e. I don't really need that level of control over the checkpoint writing
  • bindings would need to be updated (I'm using python and C# bindings, so that would add additional work)

Instead I went down the route of table properties, unsure however if the Delta Lake spec allows implementation specific table properties, or if these should be a well-defined set of properties. Since in doubt, I've prefixed the table property with delta-rs instead of delta. Happy to discuss!

If the approach is validated I can update/add documentation as well.

Related Issue(s)

Documentation

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Feb 13, 2025
@dmunch dmunch force-pushed the feat/checkpoint-file-rle-encoding branch from 0a8c1ef to 4ddbb45 Compare February 13, 2025 14:13
Copy link

codecov bot commented Feb 13, 2025

Codecov Report

Attention: Patch coverage is 92.30769% with 1 line in your changes missing coverage. Please review.

Project coverage is 72.29%. Comparing base (1befab9) to head (07f070b).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
crates/core/src/protocol/checkpoints.rs 90.90% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3214      +/-   ##
==========================================
+ Coverage   72.14%   72.29%   +0.15%     
==========================================
  Files         138      138              
  Lines       45394    45398       +4     
  Branches    45394    45398       +4     
==========================================
+ Hits        32751    32822      +71     
+ Misses      10562    10489      -73     
- Partials     2081     2087       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ion-elgreco
Copy link
Collaborator

Makes sense, and I think it's fine to use custom properties. Other connectors also have specific configurations, however the put them under delta, for example Trino: https://trino.io/docs/current/connector/delta-lake.html#delta-lake-general-configuration-properties

@ion-elgreco ion-elgreco force-pushed the feat/checkpoint-file-rle-encoding branch from 433027b to 07f070b Compare February 15, 2025 09:57
@ion-elgreco ion-elgreco added this pull request to the merge queue Feb 15, 2025
Merged via the queue into delta-io:main with commit 2f1a5ac Feb 15, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Configurable column encoding for parquet checkpoint files to address Fabric limitation
2 participants