Skip to content

GH-47085 [C++][Parquet] increase default compression level for zstd #47086

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

blacha
Copy link

@blacha blacha commented Jul 12, 2025

Rationale for this change

The default value for ZSTD compression is very low (level 1) this number is not aligned to the other compression types for example GZIP and Brotli both have relatively high levels of level 9 and 8 respectively

Are these changes tested?

I have been manually testing the compression levels with osgeo/gdal's compression_level flag

Are there any user-facing changes?

This will change the default compression speed for ZSTD to be slightly slower (500ms -> 1.5 seconds compared with brotli's 6 seconds), but give significantly smaller files (approx 30-40% smaller than level 1) on my 110MB test parquet

Fixes: OSGeo/gdal#12750
Fixes: #47085

Copy link

⚠️ GitHub issue #47085 has been automatically assigned in GitHub to PR creator.

@pitrou
Copy link
Member

pitrou commented Jul 16, 2025

An issue here is that the change in default compression level will also apply to IPC buffer compression.

The Arrow IPC format has a different usage model than Parquet. For Parquet, it is a reasonable decision to favor disk footprint and read performance. For Arrow IPC which is typically used for data streams (for example over Flight), write performance can be as important as read performance. Therefore, making writing an IPC stream 3x slower can be a big regression.

Thoughts @lidavidm @paleolimbot @zeroshade ?

@lidavidm
Copy link
Member

At least for Flight, I've never found compression to be helpful at all and wouldn't recommend enabling it without benchmarking your use case carefully in the first place

@lidavidm
Copy link
Member

But, the defaults for Parquet and IPC can't be separated?

@pitrou
Copy link
Member

pitrou commented Jul 16, 2025

They definitely could, but someone has to take the care to do that and make them fit in the current APIs (or add new APIs).

@zeroshade
Copy link
Member

I mostly agree with @lidavidm on this. I haven't generally seen much benefit to flight using the compression except for very specific situations. My preference would be to figure out how to detangle the defaults.

@pitrou
Copy link
Member

pitrou commented Jul 17, 2025

Ok, perhaps we can define an enum:

/// An abstract, codec-independent, compression level
///
/// These enum members are mapped to codec-specific
/// concrete compression levels.
enum class CompressionLevel { kFastest, kFaster, kBalanced, kBetter, kBest };

And change CodecOptions to something like:

class ARROW_EXPORT CodecOptions {
 public:
  explicit CodecOptions(int compression_level = kUseDefaultCompressionLevel)
      : compression_level(compression_level) {}
  explicit CodecOptions(CompressionLevel compression_level)
      : compression_level(compression_level) {}

  virtual ~CodecOptions() = default;

  /// An abstract compression level, or a codec-specific compression level
  std::variant<CompressionLevel, int> compression_level;
};

And then have Parquet use kBalanced or kBetter by default, while IPC would use kFaster by default.

(naming can be improved)

@paleolimbot
Copy link
Member

It's definitely surprising from the Parquet side...this came from a discussion in the GeoParquet community where we were trying to figure out what compression to recommend. I agree that changing the IPC default will yield surprising/unintended results (although unfortunately is probably much more difficult!)

@blacha
Copy link
Author

blacha commented Jul 19, 2025

Thanks for all the discussion, I was unaware this would affect IPC as well.

Up until a few weeks ago GDAL did not have a compresssion_level parameter so all parquet files with compression set zstd were being compressed to level=1, this had lead some people to believe that zstd compression was not good fit for geoparquet, this has now been fixed and a default added to GDAL of level=9 for geoparquet.

As this is now sounding like a big change, would the best option be to close this PR and leave the issue open and hopefully someone could look at doing the more advanced fix later?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet] ZSTD parquet compression level defaults to level 1 ZSTD Parquet creation defaults to level 1?
5 participants