Skip to content

Conversation

hhhizzz
Copy link

@hhhizzz hhhizzz commented Oct 14, 2025

Which issue does this PR close?

Related to:

Rationale for this change

Improve the performance in ParquetRecoredBatchReader, especially when the rowselector is short.

  • By changing a hash map to a enum array
  • By updating the counting method in decoder

What changes are included in this PR?

  1. For parquet/src/column/reader/decoder.rs, update the hashmap to a enum array.
  2. For parquet/src/arrow/array_reader/cached_array_reader.rs, update the hash function
  3. For the parquet/src/column/reader/decoder.rs, Update the counts logic, read_value_num will be returned directly by decoder, save the time to do the counts on byte array.

Are these changes tested?

The hashmaps are already covered by existing tests.
Updated the related tests in definition_levels.rs
Also tested by manual read parquets.

Are there any user-facing changes?

No

Performance results in arrow_reader_row_filter.rs

on my 3950X

Benchmark Change Verdict
int64 == 9999 / all_columns / async ⚪ Within noise ~(-0.28%)
int64 == 9999 / all_columns / sync 🟢 -4.08% Improved
int64 == 9999 / exclude_filter_column / async 🟢 -1.88% Improved
int64 == 9999 / exclude_filter_column / sync 🟢 -3.45% Improved
float64 > 99.0 / all_columns / async 🟢 -8.77% Improved
float64 > 99.0 / all_columns / sync 🟢 -9.67% Improved
float64 > 99.0 / exclude_filter_column / async 🟢 -11.53% Improved
float64 > 99.0 / exclude_filter_column / sync 🟢 -11.47% Improved
ts >= 9000 / all_columns / async 🟢 -5.91% Improved
ts >= 9000 / all_columns / sync 🟢 -5.26% Improved
ts >= 9000 / exclude_filter_column / async 🔴 +2.65% Regressed
ts >= 9000 / exclude_filter_column / sync ⚪ Within noise (-0.65%)  
int64 > 90 / all_columns / async 🟢 -11.91% Improved
int64 > 90 / all_columns / sync 🟢 -15.94% Improved
int64 > 90 / exclude_filter_column / async 🟢 -13.84% Improved
int64 > 90 / exclude_filter_column / sync 🟢 -19.12% Improved
float64 <= 99.0 / all_columns / async 🟢 -5.78% Improved
float64 <= 99.0 / all_columns / sync 🟢 -10.48% Improved
float64 <= 99.0 / exclude_filter_column / async 🟢 -9.12% Improved
float64 <= 99.0 / exclude_filter_column / sync 🟢 -4.36% Improved
ts < 9000 / all_columns / async ⚪ No change (-0.01%)  
ts < 9000 / all_columns / sync 🔴 +2.68% Regressed
ts < 9000 / exclude_filter_column / async 🟢 -2.34% Improved
ts < 9000 / exclude_filter_column / sync ⚪ Within noise (+0.42%)  
utf8View <> '' / all_columns / async 🟢 -8.83% Improved
utf8View <> '' / all_columns / sync 🟢 -14.84% Improved
utf8View <> '' / exclude_filter_column / async 🟢 -11.46% Improved
utf8View <> '' / exclude_filter_column / sync 🟢 -12.59% Improved
float64 > 99.0 AND ts >= 9000 / all_columns / async ⚪ Within noise (+0.25%)  
float64 > 99.0 AND ts >= 9000 / all_columns / sync ⚪ No change (-0.03%)  
float64 > 99.0 AND ts >= 9000 / exclude_filter_column / async ⚪ Within noise (-0.73%)  
float64 > 99.0 AND ts >= 9000 / exclude_filter_column / sync ⚪ Within noise (+0.69%)

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 14, 2025
@hhhizzz hhhizzz marked this pull request as ready for review October 14, 2025 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant