Skip to content

GH-46177: [C++][Compute] Correct the behavior of cast compute functions regarding string types #46230

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

andishgar
Copy link
Contributor

@andishgar andishgar commented Apr 26, 2025

Rationale for this change
There are eight compute functions for casting between string types. All, except the fixed-to-offset string function, assume the second buffer is not preallocated. However, the second buffer is preallocated when the output does not involve String/Binary view types

What changes are included in this PR?
1-I changed the way the kernel is created not to allocate Buffer for the second buffers
2- Makes the behaviour of FixedSize to String types cast compute function likes the others.

Are these changes tested?
I run the relevant unit test.

Are there any user-facing changes?
No

Copy link

⚠️ GitHub issue #46177 has been automatically assigned in GitHub to PR creator.

@andishgar andishgar marked this pull request as ready for review April 26, 2025 13:51
@kou kou changed the title GH-46177: Correct the behavior of cast compute functions regarding string types GH-46177: [C++][Compute] Correct the behavior of cast compute functions regarding string types Apr 27, 2025
@bkietz
Copy link
Member

bkietz commented Apr 28, 2025

If we're not preallocating buffers which could be preallocated, I think it'd be better to correct the other kernels to do so rather than remove an optimization from what is probably the most-used kernel

@andishgar
Copy link
Contributor Author

andishgar commented Apr 29, 2025

If we're not preallocating buffers which could be preallocated, I think it'd be better to correct the other kernels to do so rather than remove an optimization from what is probably the most-used kernel

My changes impact 8 kernels. Based on your feedback, I understand I should rewrite the entire casting compute functions related binary to binary. Let me explain each change in detail.

1- Offset String -> Offset String
The allocated offset buffer is dropped in zero-copy casting

RETURN_NOT_OK(ZeroCopyCastExec(ctx, batch, out));

If the input and output offset types differ, a new buffer is allocated
ARROW_ASSIGN_OR_RAISE(
output->buffers[1],
ctx->Allocate((output->length + output->offset + 1) * sizeof(output_offset_type)));

ARROW_ASSIGN_OR_RAISE(output->buffers[1],
ctx->Allocate((output->length + output->offset + 1) *
sizeof(output_offset_type)));

2- String View -> Offset String
The implementation overlooks the allocation of a second buffer

OffsetBuilder offset_builder(ctx->memory_pool());

Additionally, there’s a risk of generating utf8() and binary() types with lengths exceeding 2 billion, as discussed here.

3- Offset String -> String View
This requires updates to the compute infrastructure first (issue). Additional challenges with slicing and casting are noted here.

4- String View -> String View
The code works correctly with the current logic in the compute function., however if we resolve this, MemAllocation::NO_PREALLOCATE should be used to prevent extra casting

5- Fixed -> String View
It has the same issue of Offset string-> String View

6- Fixed -> Offset string
Is it possible to have a separate path for scalar and array?

// Data buffer (index 1) for FWBinary becomes data buffer for VarBinary
// (index 2). After ARROW-16757, we need to copy this memory instead of
// zero-copy it because a Scalar value promoted to an ArraySpan may be
// referencing a temporary buffer whose scope does not extend beyond the
// kernel execution. In that scenario, the validity bitmap above can be
// zero-copied because it points to static memory (either a byte with a 1 or
// a 0 depending on whether the value is null or not).
std::shared_ptr<Buffer> input_data = input.GetBuffer(1);
if (input_data != nullptr) {
ARROW_ASSIGN_OR_RAISE(output->buffers[2], input_data->CopySlice(0, input_data->size(),
ctx->memory_pool()));
} else {
// TODO(wesm): it should already be nullptr, so we may be able to remove
// this
output->buffers[2] = nullptr;
}

7-Fixed -> Fixed
The second buffer is allocated, However, there is no need

8- String| String View -> Fixed
First : The second buffer is allocated but ignored by the implementation.
Second: Why the data buffer is copied for casting from offset string to FixedSizeString

RETURN_NOT_OK(VisitArraySpanInline<I>(
input,
[&](std::string_view v) {
if (v.size() != static_cast<size_t>(builder.byte_width())) {
return Status::Invalid("Failed casting from ", input.type->ToString(), " to ",
options.to_type.ToString(), ": widths must match");
}
builder.UnsafeAppend(v);
return Status::OK();
},
[&] {
builder.UnsafeAppendNull();
return Status::OK();
}));
return builder.FinishInternal(&std::get<std::shared_ptr<ArrayData>>(out->value));
}

@andishgar
Copy link
Contributor Author

Should I applay above changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants