Skip to content

GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice #46730

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

IndifferentArea
Copy link
Contributor

@IndifferentArea IndifferentArea commented Jun 6, 2025

Rationale for this change

see #46677

What changes are included in this PR?

see #46677

Are these changes tested?

Yes

Are there any user-facing changes?

No

@IndifferentArea
Copy link
Contributor Author

@mapleFU is currently implemented interface expected?

return AppendBlock(value.data(), static_cast<int64_t>(value.size()));
}

Status AppendViewFromBuffer(int32_t buffer_id, int32_t buffer_offset, int32_t start,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: from buffer or from block?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally both is ok for me, I prefer Buffer since a variable is buffer_index

@@ -645,6 +657,28 @@ class ARROW_EXPORT BinaryViewBuilder : public ArrayBuilder {
UnsafeAppend(value.data(), static_cast<int64_t>(value.size()));
}

Result<std::pair<int32_t, int32_t>> AppendBlock(const uint8_t* value,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use more specific name rather than pair<i32, i32>?

Copy link
Contributor Author

@IndifferentArea IndifferentArea Jun 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can i directly use BinaryViewType::c_type since it already contains these two info we need?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The syntax is a bit weird here? Append a BinaryView and then append the sub-slice of the view?

@@ -100,6 +100,13 @@ void BinaryViewBuilder::Reset() {
data_heap_builder_.Reset();
}

Result<std::pair<int32_t, int32_t>> BinaryViewBuilder::AppendBlock(const uint8_t* value,
const int64_t length) {
DCHECK_GT(length, TypeClass::kInlineSize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If length <= kInlineSize, should this return false or ok? Why just DCHECK here?

c_type GetViewFromBlock(int32_t block_id, int32_t block_offset, int32_t offset,
int32_t length) const {
const auto* value = blocks_.at(block_id)->data_as<uint8_t>() + block_offset + offset;
if (length <= BinaryViewType::kInlineSize) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uses ToBinaryView?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 7, 2025
@IndifferentArea
Copy link
Contributor Author

IndifferentArea commented Jun 7, 2025

Should we rename AppendBuffer or redesign the interface's semantics? since currently we don't really append a buffer/block, we will directly append to the last one if remaining size is enough.. I think It may introduce confusion.

Maybe aligning with arrow-rs's impl is fine..

@mapleFU
Copy link
Member

mapleFU commented Jun 7, 2025

Some personal thoughts:

  1. AppendBuffer ( and etc ) which returns a StringView is a bit weird, Block is not a StringView
  2. Now aligned with arrow-rs also a good way

@IndifferentArea
Copy link
Contributor Author

Not sure why these 2 ci always failed..

@IndifferentArea IndifferentArea marked this pull request as ready for review June 8, 2025 12:44
@IndifferentArea
Copy link
Contributor Author

IndifferentArea commented Jun 14, 2025

To implement interface aligned with what arrow-rs did, i have to change some behavior of StringHeapBuild::FinishLastBlock() and mark it as public.

More specific, before FinishLastBlock() just resize the last block. Now FinishLastBlock() reset internal states, including current_offset_, current_out_buffer_ and current_remaining_bytes_. I believe it's more safe and reasonable.
The interface was private before so don't mind external usage, for current internal usage of FinishLastBlock(), they always reset or change all these status.

If this change is unacceptable, plz let me know, i'll try to find another way.

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General LGTM. Also cc @pitrou for your advices for this interface. This interface would be used for read binary as stringView from bytearray type faster

return AppendBuffer(reinterpret_cast<const uint8_t*>(value), length);
}

Result<int32_t> AppendBuffer(const std::string& value) {
Copy link
Member

@mapleFU mapleFU Jun 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can std::string_view being used rather than const std::string&?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this one since std::string_view is added here?

@IndifferentArea IndifferentArea requested a review from mapleFU June 18, 2025 12:21
void UnsafeAppendViewFromBuffer(const int32_t buffer_idx, const int32_t start,
const int32_t length) {
UnsafeAppendToBitmap(true);
const auto v = data_heap_builder_.GetViewFromBuffer<false>(buffer_idx, start, length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const auto v = data_heap_builder_.GetViewFromBuffer<false>(buffer_idx, start, length);
const auto v = data_heap_builder_.GetViewFromBuffer</*Safe=*/false>(buffer_idx, start, length);

const int32_t length) {
ARROW_RETURN_NOT_OK(Reserve(1));
UnsafeAppendToBitmap(true);
ARROW_ASSIGN_OR_RAISE(const auto v, data_heap_builder_.GetViewFromBuffer<true>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ARROW_ASSIGN_OR_RAISE(const auto v, data_heap_builder_.GetViewFromBuffer<true>(
ARROW_ASSIGN_OR_RAISE(const auto v, data_heap_builder_.GetViewFromBuffer</*Safe=*/true>(

@mapleFU mapleFU requested a review from pitrou June 20, 2025 15:38
@mapleFU
Copy link
Member

mapleFU commented Jun 26, 2025

Gentle ping @pitrou

@pitrou
Copy link
Member

pitrou commented Jun 26, 2025

Isn't this approach wasteful? If you have lots of strings <= 12 bytes, you will still store their contents in a data buffer, while they're inlined in the string views.

@mapleFU
Copy link
Member

mapleFU commented Jun 26, 2025

Isn't this approach wasteful? If you have lots of strings <= 12 bytes, you will still store their contents in a data buffer, while they're inlined in the string views.

@pitrou I suppose this is used to append a whole parquet page and add buffer for it

@pitrou
Copy link
Member

pitrou commented Jun 26, 2025

Is it a win, though? If most Parquet strings are <= 12 bytes we would pointlessly waste space and CPU time.

@mapleFU
Copy link
Member

mapleFU commented Jun 26, 2025

Is it a win, though? If most Parquet strings are <= 12 bytes we would pointlessly waste space and CPU time.

A nice question. I agree when most Parquet strings are <= 12 bytes, it would be memory wasted because a huge memcpy is applied. But when read large binary it would benefit a lot from this. I think usally a large memcpy might much faster than little un-continogous memcpy

Maybe we can also try to pick this way when average len is huge enough?

@pitrou
Copy link
Member

pitrou commented Jun 26, 2025

A nice question. I agree when most Parquet strings are <= 12 bytes, it would be memory wasted because a huge memcpy is applied. But when read large binary it would benefit a lot from this. I think usally a large memcpy might much faster than little un-continogous memcpy

That's true, but another cost is to create the views themselves. It would be nice if a prototype could tell us which speedup we can expect.

Maybe we can also try to pick this way when average len is huge enough?

Yes, that's definitely a possibility.

@mapleFU
Copy link
Member

mapleFU commented Jun 26, 2025

So can we start to review this? We can set a ratio when average length > 12 or > 20

@andishgar
Copy link
Contributor

@mapleFU @pitrou
I believe this pull request is related to several other PRs I've submitted. Here's a summary:

1- API and Handling of the Last Buffer
In this pull request, I demonstrated that it’s possible to share buffers without copying or finalizing the last buffer. This avoids relocating the buffer to remove blank space, which can be a costly operation when the unused space exceeds 64 bytes.

2-

Is it a win, though? If most Parquet strings are <= 12 bytes we would pointlessly waste space and CPU time.

In this pull request, I proposed a method that could help avoid memory bloat when buffers are shared. Additionally, in this issue, I think this metadata could help determine when CompactArray should be called.

Overall, my suggestion is to either modify this pull request or create a new API to support buffer sharing. It is possible to decide whether a created array should be compacted based on some metadata, in order to avoid memory bloat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants