Skip to content

GH-46578: [C++][Statistics] Fix the ownership handling of arrow::ArrayData::statistics in arrow::ArrayData::CopyTo and arrow::ArrayDataViewOrCopy #46625

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

andishgar
Copy link
Contributor

@andishgar andishgar commented May 28, 2025

Rationale for this change

Fix the ownership management of arrow::ArrayStatistics within arrow::ArrayData::ViewOrCopyTo .

What changes are included in this PR?

Correct ownership handling of arrow::ArrayStatistics in arrow::ArrayData::ViewOrCopyTo.
Add tests to verify whether arrow::ArrayData::statistics is shared or copied as expected.

Are these changes tested?

Yes, I run the relevant unit tests.

Are there any user-facing changes?

No

Copy link

⚠️ GitHub issue #46578 has been automatically assigned in GitHub to PR creator.

@andishgar andishgar marked this pull request as draft May 28, 2025 12:42
@andishgar andishgar changed the title GH-46578: [C++][Statistics]: Fix the ownership handling of arrow::ArrayData::statistics in arrow::ArrayData::CopyTo and arrow::ArrayDataViewOrCopy GH-46578: [C++][Statistics] Fix the ownership handling of arrow::ArrayData::statistics in arrow::ArrayData::CopyTo and arrow::ArrayDataViewOrCopy May 28, 2025
@andishgar
Copy link
Contributor Author

andishgar commented May 28, 2025

@kou
Before we begin the discussion, I’d like to share two notes:

1-While I’ve tried to follow the project’s coding guidelines, my primary goal was to provide an initial solution quickly to facilitate further discussion. Let’s first focus on whether the proposed solution is correct. Once we agree on that, we can refine the code structure. If anything is unclear in the code, please let me know—I’ll be happy to clarify it.

2-Currently, there are no existing tests for arrow::ArrayData::ViewOrCopyTo . This PR could also be seen as a starting point for introducing test coverage for this function too.

@andishgar
Copy link
Contributor Author

andishgar commented May 28, 2025

@kou
I have several questions related to our discussion:
1-The following code demonstrates a practical case where some buffers are copied while others are only viewed. In such a scenario, should arrow::ArrayStatistics be shared or Copied?

// Since it's not possible to get a host pointer from data_buffer,
// data_buffer is copied. (Bitmap buffer is viewed)
ASSERT_OK_AND_ASSIGN(auto copied_array_data, array_data->ViewOrCopyTo(cpu_mm_));
ASSERT_FALSE(is_statistics_shared(array_data, copied_array_data));

2-My solution might be incorrect on GPUs that do not support device attribute cudaDevAttrCanUseHostPointerForRegisteredMem, as noted in the code below.

// I ran this test on an RTX 3050 Mobile, which supports the device attribute
// cudaDevAttrCanUseHostPointerForRegisteredMem, and the assertion passed.
// On GPUs that do not support this attribute, the assertion is expected to fail.

Would it be possible to modify arrow::Buffer::ViewOrCopy to indicate whether the buffer was actually copied or not? Since it's clear within the method whether a copy occurs

Result<std::shared_ptr<Buffer>> Buffer::ViewOrCopy(
std::shared_ptr<Buffer> source, const std::shared_ptr<MemoryManager>& to) {
auto maybe_buffer = MemoryManager::ViewBuffer(source, to);
if (maybe_buffer.ok()) {
return maybe_buffer;
}
return MemoryManager::CopyBuffer(source, to);
}

@andishgar
Copy link
Contributor Author

andishgar commented May 28, 2025

@kou
I have several questions which is not related to our discussion.
1- Does Arrow Specification allow to have an Array which has buffers from several Device or MemoryManager?
1.1- if the answer is no, Should we consider the following validation as bug?

ASSERT_OK_AND_ASSIGN(auto bitmap_buffer, GenerateCPUBitmapBuffer(10));
ASSERT_OK_AND_ASSIGN(auto data_buffer, GenerateGPUtDataBufferPointToCpuMemory(
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}));
auto array_data = ArrayData::Make(int32(), 10, {});
array_data->buffers = {bitmap_buffer, data_buffer};
auto statistics = std::make_shared<ArrayStatistics>();
statistics->max = 10;
array_data->statistics = statistics;
array_data->null_count = 4;
ASSERT_TRUE(is_valid_as_array(array_data));

2- The question is located in the comment below.
ASSERT_OK_AND_ASSIGN(auto bitmap_buffer, GenerateGPUBitmapBuffer(10));
ASSERT_OK_AND_ASSIGN(auto data_buffer,
GenerateGPUtDataBuffer({1, 2, 3, 4, 5, 6, 7, 8, 9, 10}));
auto array_data = ArrayData::Make(int32(), 10, {});
array_data->buffers = {bitmap_buffer, data_buffer};
auto statistics = std::make_shared<ArrayStatistics>();
statistics->max = 10;
array_data->statistics = statistics;
array_data->null_count = 4;
// Why does the assertion below fail? Is the bitmap_buffer on the GPU invalid,
// even though the other GPU-resident buffers are valid?
// ASSERT_TRUE(is_valid_as_array(array_data));

3- The result of the following expression is false:
my_cuda_mm_->AllocateBuffer(10).ValueOrDie()->memory_manager() == my_cuda_mm_
3.1- Is this the expected behavior?
3.2- If the answer is no, should we consider the above line a bug?
// Despite using the same memory manager, CudaBuffers are still being copied.
// Should we consider this a bug?
ASSERT_OK_AND_ASSIGN(auto copied_array_data, array_data->ViewOrCopyTo(mm_));
ASSERT_FALSE(is_statistics_shared(array_data, copied_array_data));

4- arrow::CudaHostBuffer::FromVector and arrow::CudaBuffer::FromVector both call the following lines, which do not produce either a arrow::CudaHostBuffer or a CudaBuffer . Should this be considered a bug?
template <typename T>
static std::shared_ptr<Buffer> FromVector(std::vector<T> vec) {
static_assert(std::is_trivial_v<T>,
"Buffer::FromVector can only wrap vectors of trivial objects");
if (vec.empty()) {
return std::shared_ptr<Buffer>{new Buffer()};
}
auto* data = reinterpret_cast<uint8_t*>(vec.data());
auto size_in_bytes = static_cast<int64_t>(vec.size() * sizeof(T));
return std::shared_ptr<Buffer>{
new Buffer{data, size_in_bytes},
// Keep the vector's buffer alive inside the shared_ptr's destructor until after
// we have deleted the Buffer. Note we can't use this trick in FromString since
// std::string's data is inline for short strings so moving invalidates pointers
// into the string's buffer.
[vec = std::move(vec)](Buffer* buffer) { delete buffer; }};
}

@andishgar andishgar force-pushed the corret-ArrayStatistics-ownership branch from b3e4d22 to 96e68ef Compare May 28, 2025 18:36
@andishgar andishgar marked this pull request as ready for review May 28, 2025 18:54
@kou
Copy link
Member

kou commented May 29, 2025

Sorry for not responding this topic. I don't have enough time to consider this topic.

I think that we can always share with both of arrow::ArrayData::CopyTo() and arrow::ArrayData::ViewOrCopy(). Because both of them doesn't change data type.

arrow::Array::View() changes data type. It affects statistics. So we must discard statistics by arrow::Array::View().

But arrow::Array::CopyTo() and arrow::Array::ViewOrCopyTo() don't change data type. So we can share statistics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants