Arrow-Parquet SBBF coercion by mr-brobot · Pull Request #8551 · apache/arrow-rs

mr-brobot · 2025-10-04T01:11:48Z

Which issue does this PR close?

Closes Bloom filters for i8 and i16 always return false negatives #5550.

Rationale for this change

Parquet types are a subset of Arrow types, so the Arrow writer must coerce to Parquet types. In some cases, this changes the physical representation. Therefore, passing Arrow data directly to Sbbf::check will produce false negatives. Correctness is only guaranteed when checking with the coerced Parquet value.

This issue affects some integer and decimal types. It can also affect Date64.

What changes are included in this PR?

Introduces ArrowSbbf as an Arrow-aware interface to the Parquet Sbbf. This coerces incoming data if necessary and calls Sbbf::check.

Currently, Date64 types can be written as either INT32 (days since epoch) or INT64 (milliseconds since epoch), depending on Arrow writer properties (coerce_types). Instead of requiring additional information to handle this special (non-default) case, this implementation instructs users to coerce Date64 to Date32 if the Parquet column type is INT32. I'm open to feedback on this decision.

Are these changes tested?

There are tests for integer, float, decimal, and date types. Not exhaustive but covering all cases where coercion is necessary.

Are there any user-facing changes?

There is a new ArrowSbbf struct that most Arrow users should prefer over using Sbbf directly. Also, the Sized constraint was relaxed on the Sbbf::check function to support slices. This is consistent with Sbbf::insert.

mr-brobot · 2025-10-04T17:16:49Z

parquet/benches/bloom_filter.rs

Benchmark Sbbf ArrowSbbf Delta

i8 1.51 ns 7.38 ns +5.87 ns

i32 3.86 ns 7.15 ns +3.29 ns

Decimal128(5,2) 1.73 ns 7.69 ns +5.96 ns

Decimal128(15,2) 1.73 ns 8.20 ns +6.48 ns

Decimal128(30,2) 1.73 ns 5.85 ns +4.12 ns

so this means that casting the bloom filter results is slower?

adds some tests

alamb

Thank you @mr-brobot -- this is a nice contribution. I left some comments. Let me know what you think

alamb · 2025-10-17T19:41:07Z

parquet/benches/bloom_filter.rs

so this means that casting the bloom filter results is slower?

alamb · 2025-10-17T19:41:40Z

parquet/src/bloom_filter/mod.rs


    /// Check if an [AsBytes] value is probably present or definitely absent in the filter
-    pub fn check<T: AsBytes>(&self, value: &T) -> bool {
+    pub fn check<T: AsBytes + ?Sized>(&self, value: &T) -> bool {


why is this needed?

alamb · 2025-10-17T19:44:50Z

parquet/src/arrow/bloom_filter.rs

+        Self { sbbf, arrow_type }
+    }
+
+    /// Check if a value might be present in the bloom filter


What is the expected format of the bytes? It appears to be the arrow representation 🤔

This code looks slightly different than what is in DataFusion. Not sure if that is good/bad 🤔

https://github.com/apache/datafusion/blob/522403bb44780679109055abca6048d21add0d25/datafusion/datasource-parquet/src/row_group_filter.rs#L239-L298

alamb · 2025-10-17T19:47:23Z

parquet/src/arrow/bloom_filter.rs

+//! match column_chunk.column_type() {
+//!     ParquetType::INT32 => {
+//!         // Date64 was coerced to Date32 - convert milliseconds to days
+//!         let date32_value = (date64_value / MILLISECONDS_IN_DAY) as i32;


how do you envision a user getting this date32_value?

I would expect for an Arrow usecase they would have a Date32Array 🤔

I wonder if the API would more cleanly be expressed as an array kernel? Something like

let boolean_array = ArrowSbbf::check(&date32_array)?;

Though I suppose for the common case where there is a single (constant) value this may be overkill

i do prefer the ergonomics of an array kernel. applies nicely to datafusion, which interacts with bloom filters exclusively via BloomFilterStatistics::contained

perhaps i can implement as array kernel and benchmark, then we can decide from there?

alamb · 2025-11-04T19:01:19Z

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

mr-brobot added 8 commits October 2, 2025 07:14

feat(parquet): arrow sbbf

c9bbc44

feat(parquet): sbbf benchmarks

15f131c

fix(parquet): arrow sbbf date64 support

8e372d9

refactor(parquet): simplify bloom filter tests

94056fa

feat(parquet): arrow sbbf date type handling

dcd6107

fix(parquet): arrow sbbf malformed input handling

7e5cdc3

test(parquet): arrow sbbf decimal256 coercion

3ec2a94

fix(parquet): simplify sbbf benchmark

829dd52

github-actions bot added the parquet Changes to the parquet crate label Oct 4, 2025

mr-brobot added 6 commits October 3, 2025 18:47

fix(parquet): simplify sbbf decimal coercion

094564c

fix(parquet): sbbf doc reference

daea390

fix(parquet): sbbf benchmark noise threshold

0654b40

feat(parquet): add sbbf check benches

ebbf94a

fix(parquet): isolate decimal sbbf benches

cbf6db2

chore(parquet): reorder arrow sbbf tests

93a74b1

mr-brobot commented Oct 4, 2025

View reviewed changes

mr-brobot added 3 commits October 4, 2025 14:12

fix(parquet): remove unnecessary arrow sbbf coercion

981eaaa

adds some tests

fix(parquet): simplify arrow sbbf coercion logic

3b70bf8

chore(parquet): reduce sbbf bench array size

2aae14e

alamb reviewed Oct 17, 2025

View reviewed changes

alamb marked this pull request as draft November 4, 2025 19:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow-Parquet SBBF coercion#8551

Arrow-Parquet SBBF coercion#8551
mr-brobot wants to merge 17 commits intoapache:mainfrom
mr-brobot:feat/parquet/arrow-sbbf

mr-brobot commented Oct 4, 2025 •

edited

Loading

Uh oh!

mr-brobot Oct 4, 2025 •

edited

Loading

Uh oh!

alamb Oct 17, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Oct 17, 2025

Uh oh!

alamb Oct 17, 2025

Uh oh!

alamb Oct 17, 2025

Uh oh!

alamb Oct 17, 2025

Uh oh!

mr-brobot Oct 31, 2025

Uh oh!

alamb commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Benchmark	`Sbbf`	`ArrowSbbf`	Delta
`i8`	1.51 ns	7.38 ns	+5.87 ns
`i32`	3.86 ns	7.15 ns	+3.29 ns
`Decimal128(5,2)`	1.73 ns	7.69 ns	+5.96 ns
`Decimal128(15,2)`	1.73 ns	8.20 ns	+6.48 ns
`Decimal128(30,2)`	1.73 ns	5.85 ns	+4.12 ns

Conversation

mr-brobot commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

mr-brobot Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

mr-brobot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mr-brobot commented Oct 4, 2025 •

edited

Loading

mr-brobot Oct 4, 2025 •

edited

Loading