Sorting and encoding of dictionary pages #8778
Unanswered
mobiusklein
asked this question in
Q&A
Replies: 1 comment
-
|
Dictionary pages must be encoded using Being able to sort the dictionary would be a nice addition, but I'm not sure what the level of effort would be. Please feel free to create an issue to request this feature. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I am working on a schema for storing multi-dimensional signal (float32/float64 type) from a mass spectrometer in Parquet, using Rust for the prototypal implementation. In some cases, dictionary encoding has helped as one dimension's values are repeated many times as another dimension varies.
Sometimes, that dictionary can be quite big, and it may not compress well on its own if stored with the plain encoding, based upon experiments where I collected the set of all unique values and just compressed them with Zstd. If I sorted them and byte shuffled them (as in the
BYTE_STREAM_SPLITencoding), the compression improves substantially. Reading https://parquet.apache.org/docs/file-format/metadata/#page-header, it looks like there is a flag on theDictionaryPageHeaderthat says if the dictionary is sorted, and that the dictionary page can be encoded.So far as I can tell, this implementation doesn't support writing out a sorted dictionary, or using any encoding other than
PLAINon that page, correct? Is this something that is technically supported but not implemented here, or not supported by the Parquet format definition?If I understand what'd be involved, sorting the dictionary page wouldn't force new behavior on a reader, but using an encoding other than
PLAINmight?Beta Was this translation helpful? Give feedback.
All reactions