Conversation
There was a problem hiding this comment.
| Benchmark | Sbbf |
ArrowSbbf |
Delta |
|---|---|---|---|
i8 |
1.51 ns | 7.38 ns | +5.87 ns |
i32 |
3.86 ns | 7.15 ns | +3.29 ns |
Decimal128(5,2) |
1.73 ns | 7.69 ns | +5.96 ns |
Decimal128(15,2) |
1.73 ns | 8.20 ns | +6.48 ns |
Decimal128(30,2) |
1.73 ns | 5.85 ns | +4.12 ns |
There was a problem hiding this comment.
so this means that casting the bloom filter results is slower?
alamb
left a comment
There was a problem hiding this comment.
Thank you @mr-brobot -- this is a nice contribution. I left some comments. Let me know what you think
There was a problem hiding this comment.
so this means that casting the bloom filter results is slower?
|
|
||
| /// Check if an [AsBytes] value is probably present or definitely absent in the filter | ||
| pub fn check<T: AsBytes>(&self, value: &T) -> bool { | ||
| pub fn check<T: AsBytes + ?Sized>(&self, value: &T) -> bool { |
| Self { sbbf, arrow_type } | ||
| } | ||
|
|
||
| /// Check if a value might be present in the bloom filter |
There was a problem hiding this comment.
What is the expected format of the bytes? It appears to be the arrow representation 🤔
This code looks slightly different than what is in DataFusion. Not sure if that is good/bad 🤔
| //! match column_chunk.column_type() { | ||
| //! ParquetType::INT32 => { | ||
| //! // Date64 was coerced to Date32 - convert milliseconds to days | ||
| //! let date32_value = (date64_value / MILLISECONDS_IN_DAY) as i32; |
There was a problem hiding this comment.
how do you envision a user getting this date32_value?
I would expect for an Arrow usecase they would have a Date32Array 🤔
I wonder if the API would more cleanly be expressed as an array kernel? Something like
let boolean_array = ArrowSbbf::check(&date32_array)?;Though I suppose for the common case where there is a single (constant) value this may be overkill
There was a problem hiding this comment.
i do prefer the ergonomics of an array kernel. applies nicely to datafusion, which interacts with bloom filters exclusively via BloomFilterStatistics::contained
perhaps i can implement as array kernel and benchmark, then we can decide from there?
|
Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look |
Which issue does this PR close?
Rationale for this change
Parquet types are a subset of Arrow types, so the Arrow writer must coerce to Parquet types. In some cases, this changes the physical representation. Therefore, passing Arrow data directly to
Sbbf::checkwill produce false negatives. Correctness is only guaranteed when checking with the coerced Parquet value.This issue affects some integer and decimal types. It can also affect
Date64.What changes are included in this PR?
Introduces
ArrowSbbfas an Arrow-aware interface to the ParquetSbbf. This coerces incoming data if necessary and callsSbbf::check.Currently,
Date64types can be written as eitherINT32(days since epoch) orINT64(milliseconds since epoch), depending on Arrow writer properties (coerce_types). Instead of requiring additional information to handle this special (non-default) case, this implementation instructs users to coerceDate64toDate32if the Parquet column type isINT32. I'm open to feedback on this decision.Are these changes tested?
There are tests for integer, float, decimal, and date types. Not exhaustive but covering all cases where coercion is necessary.
Are there any user-facing changes?
There is a new
ArrowSbbfstruct that most Arrow users should prefer over usingSbbfdirectly. Also, theSizedconstraint was relaxed on theSbbf::checkfunction to support slices. This is consistent withSbbf::insert.