Adding try_append_value implementation to ByteViewBuilder#8594
Adding try_append_value implementation to ByteViewBuilder#8594alamb merged 7 commits intoapache:mainfrom
try_append_value implementation to ByteViewBuilder#8594Conversation
73faf99 to
8859ff7
Compare
| .map(u32::from_le_bytes) | ||
| .ok_or_else(|| { | ||
| ArrowError::InvalidArgumentError( | ||
| "String must be at least 4 bytes for non-inline view".to_string(), |
There was a problem hiding this comment.
This error is unreachable as we checked that the value is longer than MAX_INLINE_VIEW_LEN (12 bytes) above.
| let offset = self.in_progress.len() as u32; | ||
| let offset: u32 = self.in_progress.len().try_into().map_err(|_| { | ||
| ArrowError::InvalidArgumentError(format!( | ||
| "In-progress buffer length {} exceeds u32::MAX", |
There was a problem hiding this comment.
-
I think the method can recover by starting a new in-progress buffer instead of returning an error here.
-
I am unsure if this error is even reachable.
There was a problem hiding this comment.
I think a new buffer would be allocated in the line immediately above this. Maybe we should do a checked add in let required_cap = self.in_progress.len() + v.len(); 🤔
To error here, we would need a usize that doesn't fit into a u32.. I think all platforms we care about have usize that is at least u32 (aka 32-bit architectures)
There was a problem hiding this comment.
To error here, we would need a usize that doesn't fit into a u32.. I think all platforms we care about have usize that is at least u32 (aka 32-bit architectures)
I think that would be the opposite, a usize in a 64-bit arch wouldn't fit a u32? Anyway, I will review and update these changes over the weekend
There was a problem hiding this comment.
Yes, you are right -- thank you
There was a problem hiding this comment.
I am unsure if this error is even reachable.
I think this is right. I wasn't able to trigger this. For the panic to verify, it would need to skip the flush_in_progress() operation that happens when the buffer reaches the required capacity.
On top of that, the push_completed used by the flush_in_progress asserts on the block.len() (see below). Therefore, let offset: u32 = self.in_progress.len() wouldn't be reached:
arrow-rs/arrow-array/src/builder/generic_bytes_view_builder.rs
Lines 273 to 276 in d49f017
I'm proceeding by reverting this the map_err in the offset initialization.
alamb
left a comment
There was a problem hiding this comment.
Thanks for this @samueleresca
| let offset = self.in_progress.len() as u32; | ||
| let offset: u32 = self.in_progress.len().try_into().map_err(|_| { | ||
| ArrowError::InvalidArgumentError(format!( | ||
| "In-progress buffer length {} exceeds u32::MAX", |
There was a problem hiding this comment.
I think a new buffer would be allocated in the line immediately above this. Maybe we should do a checked add in let required_cap = self.in_progress.len() + v.len(); 🤔
To error here, we would need a usize that doesn't fit into a u32.. I think all platforms we care about have usize that is at least u32 (aka 32-bit architectures)
|
🤖 |
alamb
left a comment
There was a problem hiding this comment.
Thank you @samueleresca and @ctsk -- this change makes sense to me.
I have kicked off some benchmark runs just to make sure this doesn't affect performance somehow. Assuming they look good I think we can merge this one in
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
# Which issue does this PR close? N/A # Rationale for this change not testing the correct length # What changes are included in this PR? remove * 8 as the length of the buffer is in bytes already # Are these changes tested? created tests to make sure they are failing before AND created tests that make sure that ceil is used for future changes # Are there any user-facing changes? Nope
…tArray` (apache#8627) # Which issue does this PR close? - Closes apache#8610 # Rationale for this change Since the fields of `VariantArray` impl `PartialEq`, this PR simply derives `PartialEq` for `VariantArray` out. Based off of apache#8625
…trings for error messages (apache#8636) # Which issue does this PR close? This is a small performance improvement for the thrift remodeling - Part of apache#5853. # Rationale for this change Some of the often-called methods in the thrift protocol implementation created `ParquetError` instances with a string message that had to be allocated and formatted. This formatting code and probably also some drop glue bloats these otherwise small methods and prevented inlining. # What changes are included in this PR? Introduce a separate error type `ThriftProtocolError` that is smaller than `ParquetError` and does not contain any allocated data. The `ReadThrift` trait is not changed, since its custom implementations actually require the more expressive `ParquetError`. # Are these changes tested? The success path is covered by existing tests. Testing the error paths would require crafting some actually malformed files, or using a fuzzer. # Are there any user-facing changes? The `ThriftProtocolError` is crate-internal so there should be no api changes. Some error messages might differ slightly.
df730fd to
a0e79a4
Compare
|
I reviewed the benchmarks and I conclude they don't show any meaningful performance change, as expected. Thanks again @samueleresca |
Which issue does this PR close?
concat_elements_utf8viewpanics with large buffer on 64bit machines datafusion#17857Rationale for this change
These changes add a safer version of
append_valueinByteViewBuilderthat handles panics calledtry_append_value. Datafusions will consume the API and handle the Result coming back from the function.What changes are included in this PR?
Are these changes tested?
The method is already covered by existing tests.
Are there any user-facing changes?
No breaking changes, as the original
append_valuemethod hasn't changed.