[Draft] Implemented casting for RunEnd Encoding (pt2)#8384
[Draft] Implemented casting for RunEnd Encoding (pt2)#8384Rich-T-kid wants to merge 4 commits intoapache:mainfrom
Conversation
Implement casting between REE arrays and other Arrow types. REE-to-REE casting validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent invalid sequences.
Implement casting between REE arrays and other Arrow types. REE-to-REE casting validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent invalid sequences. rebased changes
|
@Rich-T-kid I made some comments on the other PR #7713 |
| // Both non-null - use slice comparison as a basic approach | ||
| // This is a simplified implementation | ||
| cast_array.slice(i, 1).to_data() == cast_array.slice(i - 1, 1).to_data() |
There was a problem hiding this comment.
Addressing @brancz's comments about not copying #7713 (comment)
| // Both non-null - use slice comparison as a basic approach | |
| // This is a simplified implementation | |
| cast_array.slice(i, 1).to_data() == cast_array.slice(i - 1, 1).to_data() | |
| // Primitive types | |
| DataType::Boolean => { | |
| cast_array.as_boolean().value(i) == cast_array.as_boolean().value(i - 1) | |
| } | |
| DataType::Int8 => { | |
| cast_array.as_primitive::<Int8Type>().value(i) | |
| == cast_array.as_primitive::<Int8Type>().value(i - 1) | |
| } | |
| DataType::Int16 => { | |
| cast_array.as_primitive::<Int16Type>().value(i) | |
| == cast_array.as_primitive::<Int16Type>().value(i - 1) | |
| } | |
| DataType::Int32 => { | |
| cast_array.as_primitive::<Int32Type>().value(i) | |
| == cast_array.as_primitive::<Int32Type>().value(i - 1) | |
| } | |
| DataType::Int64 => { | |
| cast_array.as_primitive::<Int64Type>().value(i) | |
| == cast_array.as_primitive::<Int64Type>().value(i - 1) | |
| } | |
| DataType::UInt8 => { | |
| cast_array.as_primitive::<UInt8Type>().value(i) | |
| == cast_array.as_primitive::<UInt8Type>().value(i - 1) | |
| } | |
| DataType::UInt16 => { | |
| cast_array.as_primitive::<UInt16Type>().value(i) | |
| == cast_array.as_primitive::<UInt16Type>().value(i - 1) | |
| } | |
| DataType::UInt32 => { | |
| cast_array.as_primitive::<UInt32Type>().value(i) | |
| == cast_array.as_primitive::<UInt32Type>().value(i - 1) | |
| } | |
| DataType::UInt64 => { | |
| cast_array.as_primitive::<UInt64Type>().value(i) | |
| == cast_array.as_primitive::<UInt64Type>().value(i - 1) | |
| } | |
| DataType::Float16 => { | |
| cast_array.as_primitive::<Float16Type>().value(i) | |
| == cast_array.as_primitive::<Float16Type>().value(i - 1) | |
| } | |
| DataType::Float32 => { | |
| cast_array.as_primitive::<Float32Type>().value(i) | |
| == cast_array.as_primitive::<Float32Type>().value(i - 1) | |
| } | |
| DataType::Float64 => { | |
| cast_array.as_primitive::<Float64Type>().value(i) | |
| == cast_array.as_primitive::<Float64Type>().value(i - 1) | |
| } | |
| // String types | |
| DataType::Utf8 => { | |
| cast_array.as_string::<i32>().value(i) | |
| == cast_array.as_string::<i32>().value(i - 1) | |
| } | |
| DataType::LargeUtf8 => { | |
| cast_array.as_string::<i64>().value(i) | |
| == cast_array.as_string::<i64>().value(i - 1) | |
| } | |
| DataType::Utf8View => { | |
| cast_array.as_string_view().value(i) == cast_array.as_string_view().value(i - 1) | |
| } | |
| // Binary types | |
| DataType::Binary => { | |
| cast_array.as_binary::<i32>().value(i) | |
| == cast_array.as_binary::<i32>().value(i - 1) | |
| } | |
| DataType::LargeBinary => { | |
| cast_array.as_binary::<i64>().value(i) | |
| == cast_array.as_binary::<i64>().value(i - 1) | |
| } | |
| DataType::BinaryView => { | |
| cast_array.as_binary_view().value(i) == cast_array.as_binary_view().value(i - 1) | |
| } | |
| DataType::FixedSizeBinary(_) => { | |
| cast_array.as_fixed_size_binary().value(i) | |
| == cast_array.as_fixed_size_binary().value(i - 1) | |
| } | |
| // Temporal types | |
| DataType::Date32 => { | |
| cast_array.as_primitive::<Date32Type>().value(i) | |
| == cast_array.as_primitive::<Date32Type>().value(i - 1) | |
| } | |
| DataType::Date64 => { | |
| cast_array.as_primitive::<Date64Type>().value(i) | |
| == cast_array.as_primitive::<Date64Type>().value(i - 1) | |
| } | |
| DataType::Timestamp(time_unit, _) => match time_unit { | |
| TimeUnit::Second => { | |
| cast_array.as_primitive::<TimestampSecondType>().value(i) | |
| == cast_array | |
| .as_primitive::<TimestampSecondType>() | |
| .value(i - 1) | |
| } | |
| TimeUnit::Millisecond => { | |
| cast_array | |
| .as_primitive::<TimestampMillisecondType>() | |
| .value(i) | |
| == cast_array | |
| .as_primitive::<TimestampMillisecondType>() | |
| .value(i - 1) | |
| } | |
| TimeUnit::Microsecond => { | |
| cast_array | |
| .as_primitive::<TimestampMicrosecondType>() | |
| .value(i) | |
| == cast_array | |
| .as_primitive::<TimestampMicrosecondType>() | |
| .value(i - 1) | |
| } | |
| TimeUnit::Nanosecond => { | |
| cast_array | |
| .as_primitive::<TimestampNanosecondType>() | |
| .value(i) | |
| == cast_array | |
| .as_primitive::<TimestampNanosecondType>() | |
| .value(i - 1) | |
| } | |
| }, | |
| DataType::Time32(time_unit) => match time_unit { | |
| TimeUnit::Second => { | |
| cast_array.as_primitive::<Time32SecondType>().value(i) | |
| == cast_array.as_primitive::<Time32SecondType>().value(i - 1) | |
| } | |
| TimeUnit::Millisecond => { | |
| cast_array.as_primitive::<Time32MillisecondType>().value(i) | |
| == cast_array | |
| .as_primitive::<Time32MillisecondType>() | |
| .value(i - 1) | |
| } | |
| TimeUnit::Microsecond | TimeUnit::Nanosecond => { | |
| panic!("Time32 must have a TimeUnit of either seconds or milliseconds") | |
| } | |
| }, | |
| DataType::Time64(time_unit) => match time_unit { | |
| TimeUnit::Second | TimeUnit::Millisecond => { | |
| panic!("Time64 must have a TimeUnit of either microseconds or nanoseconds") | |
| } | |
| TimeUnit::Microsecond => { | |
| cast_array.as_primitive::<Time64MicrosecondType>().value(i) | |
| == cast_array | |
| .as_primitive::<Time64MicrosecondType>() | |
| .value(i - 1) | |
| } | |
| TimeUnit::Nanosecond => { | |
| cast_array.as_primitive::<Time64NanosecondType>().value(i) | |
| == cast_array | |
| .as_primitive::<Time64NanosecondType>() | |
| .value(i - 1) | |
| } | |
| }, | |
| DataType::Duration(time_unit) => match time_unit { | |
| TimeUnit::Second => { | |
| cast_array.as_primitive::<DurationSecondType>().value(i) | |
| == cast_array.as_primitive::<DurationSecondType>().value(i - 1) | |
| } | |
| TimeUnit::Millisecond => { | |
| cast_array | |
| .as_primitive::<DurationMillisecondType>() | |
| .value(i) | |
| == cast_array | |
| .as_primitive::<DurationMillisecondType>() | |
| .value(i - 1) | |
| } | |
| TimeUnit::Microsecond => { | |
| cast_array | |
| .as_primitive::<DurationMicrosecondType>() | |
| .value(i) | |
| == cast_array | |
| .as_primitive::<DurationMicrosecondType>() | |
| .value(i - 1) | |
| } | |
| TimeUnit::Nanosecond => { | |
| cast_array.as_primitive::<DurationNanosecondType>().value(i) | |
| == cast_array | |
| .as_primitive::<DurationNanosecondType>() | |
| .value(i - 1) | |
| } | |
| }, | |
| DataType::Interval(interval_unit) => match interval_unit { | |
| IntervalUnit::YearMonth => { | |
| cast_array.as_primitive::<IntervalYearMonthType>().value(i) | |
| == cast_array | |
| .as_primitive::<IntervalYearMonthType>() | |
| .value(i - 1) | |
| } | |
| IntervalUnit::DayTime => { | |
| cast_array.as_primitive::<IntervalDayTimeType>().value(i) | |
| == cast_array | |
| .as_primitive::<IntervalDayTimeType>() | |
| .value(i - 1) | |
| } | |
| IntervalUnit::MonthDayNano => { | |
| cast_array | |
| .as_primitive::<IntervalMonthDayNanoType>() | |
| .value(i) | |
| == cast_array | |
| .as_primitive::<IntervalMonthDayNanoType>() | |
| .value(i - 1) | |
| } | |
| }, | |
| // Decimal types | |
| DataType::Decimal32(_, _) => { | |
| cast_array.as_primitive::<Decimal32Type>().value(i) | |
| == cast_array.as_primitive::<Decimal32Type>().value(i - 1) | |
| } | |
| DataType::Decimal64(_, _) => { | |
| cast_array.as_primitive::<Decimal64Type>().value(i) | |
| == cast_array.as_primitive::<Decimal64Type>().value(i - 1) | |
| } | |
| DataType::Decimal128(_, _) => { | |
| cast_array.as_primitive::<Decimal128Type>().value(i) | |
| == cast_array.as_primitive::<Decimal128Type>().value(i - 1) | |
| } | |
| DataType::Decimal256(_, _) => { | |
| cast_array.as_primitive::<Decimal256Type>().value(i) | |
| == cast_array.as_primitive::<Decimal256Type>().value(i - 1) | |
| } | |
| // TODO: How to handle REE? | |
| DataType::RunEndEncoded(_, _) => todo!(), | |
| DataType::Null | |
| | DataType::List(_) | |
| | DataType::ListView(_) | |
| | DataType::FixedSizeList(_, _) | |
| | DataType::LargeList(_) | |
| | DataType::LargeListView(_) | |
| | DataType::Struct(_) | |
| | DataType::Union(_, _) | |
| | DataType::Dictionary(_, _) | |
| | DataType::Map(_, _) => false, |
vegarsti
left a comment
There was a problem hiding this comment.
These two changes together address @brancz's comments here #7713 (comment)
| let cast_array = if array.data_type() == value_type { | ||
| // No casting needed, use the array as-is | ||
| make_array(array.to_data()) | ||
| } else { | ||
| // Cast to the target value type | ||
| cast_with_options(array, value_type, cast_options)? | ||
| }; |
There was a problem hiding this comment.
| let cast_array = if array.data_type() == value_type { | |
| // No casting needed, use the array as-is | |
| make_array(array.to_data()) | |
| } else { | |
| // Cast to the target value type | |
| cast_with_options(array, value_type, cast_options)? | |
| }; | |
| let cast_array = if array.data_type() == value_type { | |
| array | |
| } else { | |
| &cast_with_options(array, value_type, cast_options)? | |
| }; |
| let indices = PrimitiveArray::<UInt32Type>::from_iter_values( | ||
| values_indices.iter().map(|&idx| idx as u32), | ||
| ); | ||
| let values_array = take(&cast_array, &indices, None)?; |
There was a problem hiding this comment.
| let values_array = take(&cast_array, &indices, None)?; | |
| let values_array = take(cast_array, &indices, None)?; |
|
@vegarsti you want to open a new PR that combines the changes (of course keeping @Rich-T-kid’s commits in tact but at this point you’ve done with that should be recognized as well :) )? |
Yeah, definitely! |
|
Opened #8589 cc @brancz @Rich-T-kid |
# Which issue does this PR close? - Contribues towards the RunEndEncoded (REE) epic #3520, but there is no specific issue for casting. - Replaces PRs #7713 and #8384. # Rationale for this change This PR implements casting support for RunEndEncoded arrays in Apache Arrow. # What changes are included in this PR? - `run_end_encoded_cast` in `arrow-cast/src/cast/run_array.rs` - `cast_to_run_end_encoded` in `arrow-cast/src/cast/run_array.rs` - Tests in `arrow-cast/src/cast/mod.rs` # Are these changes tested? Yes! # Are there any user-facing changes? No breaking changes, just new functionality --------- Co-authored-by: Richard Baah <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>
|
Looks like this is superseded by #8589 |
Which issue does this PR close?
We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.
Rationale for this change
context
For whatever reason, I’m running into an authentication issue when trying to push to the branch I created on my Datadog account. I wanted to push some of the refactoring changes that Brancz mentioned on my previous PRs.
I’ve been very busy with school and work, so I don’t think I’ll be able to continue contributing to the project at the same level as I did during the summer. Hopefully, I can still help out with smaller tasks around the codebase, but I don’t have large chunks of time to devote to REE right now.
Sorry about that—I hope I can still make smaller contributions.