Skip to content

Draft: introduce Compact Row Codec #2414

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

stevenschlansker
Copy link
Contributor

@stevenschlansker stevenschlansker commented Jul 15, 2025

What does this PR do?

Introduce alternate "compact" row encoding that better uses knowledge of fixed-size types and sacrifices alignment to save space.

Introduce new Builder pattern to avoid explosion of Encoders static methods as more features are added to row format.

Optimizations include:

  • struct stores fixed-size fields (e.g. Int128. FixedSizeBinary) inline in fixed-data area without offset + size
  • struct of all fixed-sized fields is itself considered fixed-size to store in other struct or array
  • struct skips null bitmap if all fields are non-nullable
  • struct sorts fields by fixed-size for best-effort (but not guaranteed) alignment
  • struct can use less than 8 bytes for small data (int, short, etc)
  • struct null bitmap stored at end of struct to borrow alignment padding if possible
  • array stores fixed-size fields inline in fixed-data area without offset+size
  • array header uses 4 bytes for size (since Collection and array are only int-sized) and leaves remaining 4 bytes for start of null bitmap

Fixups include:

  • toString better handles varbinary / fixed-binary (hex dump of first 256 bytes)
  • start making Javadoc for row format

Compromises:

  • less alignment could increase access time, but this is opt-in. and I think on modern processors it is not such a big deal.
  • increased complexity of offset lookup, try to pre-compute in an array when possible and use StableValue when it is GA

TODO:
[ ] figure out how top-level array and map encoder fit into new builder class

Not compatible with existing row format.

Related issues

Fixes #2337

Does this PR introduce any user-facing change?

New API for new Compact codec. Existing codec unchanged.

@stevenschlansker stevenschlansker added enhancement New feature or request java labels Jul 15, 2025
stevenschlansker and others added 14 commits July 14, 2025 17:23
compact encoding will store fixed sized fields inline,
relaxes alignment considerations to preserve alignment where possible
but using space more effectively
@stevenschlansker stevenschlansker marked this pull request as draft July 15, 2025 03:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request java
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Row format better data packing
1 participant