-
Notifications
You must be signed in to change notification settings - Fork 3.8k
GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice #46730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@mapleFU is currently implemented interface expected? |
cpp/src/arrow/array/builder_binary.h
Outdated
return AppendBlock(value.data(), static_cast<int64_t>(value.size())); | ||
} | ||
|
||
Status AppendViewFromBuffer(int32_t buffer_id, int32_t buffer_offset, int32_t start, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming: from buffer or from block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally both is ok for me, I prefer Buffer since a variable is buffer_index
cpp/src/arrow/array/builder_binary.h
Outdated
@@ -645,6 +657,28 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder { | |||
UnsafeAppend(value.data(), static_cast<int64_t>(value.size())); | |||
} | |||
|
|||
Result<std::pair<int32_t, int32_t>> AppendBlock(const uint8_t* value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use more specific name rather than pair<i32, i32>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can i directly use BinaryViewType::c_type
since it already contains these two info we need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The syntax is a bit weird here? Append a BinaryView and then append the sub-slice of the view?
@@ -100,6 +100,13 @@ void BinaryViewBuilder::Reset() { | |||
data_heap_builder_.Reset(); | |||
} | |||
|
|||
Result<std::pair<int32_t, int32_t>> BinaryViewBuilder::AppendBlock(const uint8_t* value, | |||
const int64_t length) { | |||
DCHECK_GT(length, TypeClass::kInlineSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If length <= kInlineSize, should this return false or ok? Why just DCHECK here?
cpp/src/arrow/array/builder_binary.h
Outdated
c_type GetViewFromBlock(int32_t block_id, int32_t block_offset, int32_t offset, | ||
int32_t length) const { | ||
const auto* value = blocks_.at(block_id)->data_as<uint8_t>() + block_offset + offset; | ||
if (length <= BinaryViewType::kInlineSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uses ToBinaryView
?
Should we rename Maybe aligning with arrow-rs's impl is fine.. |
Some personal thoughts:
|
Not sure why these 2 ci always failed.. |
To implement interface aligned with what arrow-rs did, i have to change some behavior of More specific, before If this change is unacceptable, plz let me know, i'll try to find another way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General LGTM. Also cc @pitrou for your advices for this interface. This interface would be used for read binary as stringView from bytearray type faster
cpp/src/arrow/array/builder_binary.h
Outdated
return AppendBuffer(reinterpret_cast<const uint8_t*>(value), length); | ||
} | ||
|
||
Result<int32_t> AppendBuffer(const std::string& value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can std::string_view
being used rather than const std::string&
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this one since std::string_view is added here?
cpp/src/arrow/array/builder_binary.h
Outdated
void UnsafeAppendViewFromBuffer(const int32_t buffer_idx, const int32_t start, | ||
const int32_t length) { | ||
UnsafeAppendToBitmap(true); | ||
const auto v = data_heap_builder_.GetViewFromBuffer<false>(buffer_idx, start, length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const auto v = data_heap_builder_.GetViewFromBuffer<false>(buffer_idx, start, length); | |
const auto v = data_heap_builder_.GetViewFromBuffer</*Safe=*/false>(buffer_idx, start, length); |
cpp/src/arrow/array/builder_binary.h
Outdated
const int32_t length) { | ||
ARROW_RETURN_NOT_OK(Reserve(1)); | ||
UnsafeAppendToBitmap(true); | ||
ARROW_ASSIGN_OR_RAISE(const auto v, data_heap_builder_.GetViewFromBuffer<true>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ARROW_ASSIGN_OR_RAISE(const auto v, data_heap_builder_.GetViewFromBuffer<true>( | |
ARROW_ASSIGN_OR_RAISE(const auto v, data_heap_builder_.GetViewFromBuffer</*Safe=*/true>( |
Gentle ping @pitrou |
Isn't this approach wasteful? If you have lots of strings <= 12 bytes, you will still store their contents in a data buffer, while they're inlined in the string views. |
@pitrou I suppose this is used to append a whole parquet page and add buffer for it |
Is it a win, though? If most Parquet strings are <= 12 bytes we would pointlessly waste space and CPU time. |
A nice question. I agree when most Parquet strings are <= 12 bytes, it would be memory wasted because a huge memcpy is applied. But when read large binary it would benefit a lot from this. I think usally a large memcpy might much faster than little un-continogous memcpy Maybe we can also try to pick this way when average len is huge enough? |
That's true, but another cost is to create the views themselves. It would be nice if a prototype could tell us which speedup we can expect.
Yes, that's definitely a possibility. |
So can we start to review this? We can set a ratio when average length > 12 or > 20 |
@mapleFU @pitrou 1- API and Handling of the Last Buffer 2-
In this pull request, I proposed a method that could help avoid memory bloat when buffers are shared. Additionally, in this issue, I think this metadata could help determine when CompactArray should be called. Overall, my suggestion is to either modify this pull request or create a new API to support buffer sharing. It is possible to decide whether a created array should be compacted based on some metadata, in order to avoid memory bloat. |
Rationale for this change
see #46677
What changes are included in this PR?
see #46677
Are these changes tested?
Yes
Are there any user-facing changes?
No