-
Notifications
You must be signed in to change notification settings - Fork 3.7k
GH-46177: [C++][Compute] Correct the behavior of cast compute functions regarding string types #46230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
If we're not preallocating buffers which could be preallocated, I think it'd be better to correct the other kernels to do so rather than remove an optimization from what is probably the most-used kernel |
My changes impact 8 kernels. Based on your feedback, I understand I should rewrite the entire casting compute functions related binary to binary. Let me explain each change in detail. 1- Offset String -> Offset String
If the input and output offset types differ, a new buffer is allocated arrow/cpp/src/arrow/compute/kernels/scalar_cast_string.cc Lines 251 to 253 in e6d0edc
arrow/cpp/src/arrow/compute/kernels/scalar_cast_string.cc Lines 279 to 281 in e6d0edc
2- String View -> Offset String
Additionally, there’s a risk of generating utf8() and binary() types with lengths exceeding 2 billion, as discussed here. 3- Offset String -> String View 4- String View -> String View 5- Fixed -> String View 6- Fixed -> Offset string arrow/cpp/src/arrow/compute/kernels/scalar_cast_string.cc Lines 597 to 612 in e6d0edc
7-Fixed -> Fixed 8- String| String View -> Fixed arrow/cpp/src/arrow/compute/kernels/scalar_cast_string.cc Lines 645 to 661 in e6d0edc
|
Should I applay above changes? |
Rationale for this change
There are eight compute functions for casting between string types. All, except the fixed-to-offset string function, assume the second buffer is not preallocated. However, the second buffer is preallocated when the output does not involve String/Binary view types
What changes are included in this PR?
1-I changed the way the kernel is created not to allocate Buffer for the second buffers
2- Makes the behaviour of FixedSize to String types cast compute function likes the others.
Are these changes tested?
I run the relevant unit test.
Are there any user-facing changes?
No