-
Notifications
You must be signed in to change notification settings - Fork 3.8k
GH-47085 [C++][Parquet] increase default compression level for zstd #47086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
An issue here is that the change in default compression level will also apply to IPC buffer compression. The Arrow IPC format has a different usage model than Parquet. For Parquet, it is a reasonable decision to favor disk footprint and read performance. For Arrow IPC which is typically used for data streams (for example over Flight), write performance can be as important as read performance. Therefore, making writing an IPC stream 3x slower can be a big regression. Thoughts @lidavidm @paleolimbot @zeroshade ? |
At least for Flight, I've never found compression to be helpful at all and wouldn't recommend enabling it without benchmarking your use case carefully in the first place |
But, the defaults for Parquet and IPC can't be separated? |
They definitely could, but someone has to take the care to do that and make them fit in the current APIs (or add new APIs). |
I mostly agree with @lidavidm on this. I haven't generally seen much benefit to flight using the compression except for very specific situations. My preference would be to figure out how to detangle the defaults. |
Ok, perhaps we can define an enum: /// An abstract, codec-independent, compression level
///
/// These enum members are mapped to codec-specific
/// concrete compression levels.
enum class CompressionLevel { kFastest, kFaster, kBalanced, kBetter, kBest }; And change class ARROW_EXPORT CodecOptions {
public:
explicit CodecOptions(int compression_level = kUseDefaultCompressionLevel)
: compression_level(compression_level) {}
explicit CodecOptions(CompressionLevel compression_level)
: compression_level(compression_level) {}
virtual ~CodecOptions() = default;
/// An abstract compression level, or a codec-specific compression level
std::variant<CompressionLevel, int> compression_level;
}; And then have Parquet use (naming can be improved) |
It's definitely surprising from the Parquet side...this came from a discussion in the GeoParquet community where we were trying to figure out what compression to recommend. I agree that changing the IPC default will yield surprising/unintended results (although unfortunately is probably much more difficult!) |
Thanks for all the discussion, I was unaware this would affect IPC as well. Up until a few weeks ago GDAL did not have a As this is now sounding like a big change, would the best option be to close this PR and leave the issue open and hopefully someone could look at doing the more advanced fix later? |
Rationale for this change
The default value for ZSTD compression is very low (level 1) this number is not aligned to the other compression types for example GZIP and Brotli both have relatively high levels of level 9 and 8 respectively
Are these changes tested?
I have been manually testing the compression levels with osgeo/gdal's
compression_level
flagAre there any user-facing changes?
This will change the default compression speed for ZSTD to be slightly slower (500ms -> 1.5 seconds compared with brotli's 6 seconds), but give significantly smaller files (approx 30-40% smaller than level 1) on my 110MB test parquet
Fixes: OSGeo/gdal#12750
Fixes: #47085