-
Notifications
You must be signed in to change notification settings - Fork 1k
Removal IntervalDtype/StructDtype inheritance #21114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
IntervalDtype now inherits from _BaseDtype instead of StructDtype. This change includes: - Updated class declaration - Store _fields directly in __init__ instead of calling super().__init__() - Added @Property fields that returns the stored _fields dict - Added @Property type that returns pd.Interval - Added @cached_property itemsize that computes size from fields - Removed outdated comment about subclassing StructDtype Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
The _recursively_replace_fields method converts dict with numeric/string keys
to {"left": ..., "right": ...} format. This is needed when results come from
pylibcudf without preserved nested field names.
The method:
- Converts dict keys (0, 1 or "0", "1") to proper field names ("left", "right")
- Handles nested StructDtype recursively
- Handles nested ListDtype recursively
Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Update helper functions to handle IntervalDtype properly: - recursively_update_struct_names: Add handling for IntervalDtype to recursively update nested struct/list subtypes - _dtype_to_metadata: Add handling for IntervalDtype to generate proper ColumnMetadata for arrow conversion Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
IntervalColumn should explicitly raise NotImplementedError for __cuda_array_interface__ since intervals are structured types that cannot be represented as contiguous arrays. This matches the behavior of StructColumn and prevents incorrect usage.
When passing a list of pandas Interval objects to as_column (e.g., via IntervalIndex constructor), PyArrow cannot infer the interval type from the raw list. This commit adds pd.Interval to the special case handling that converts to pandas Series first, similar to pd.Timestamp and pd.Timedelta. The fix ensures that IntervalIndex([pd.Interval(0, 1)]) now correctly creates an IntervalColumn instead of a StructColumn. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
IntervalColumn now handles its own metadata via its own _with_type_metadata method, so this special case is no longer needed in StructColumn. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Now that IntervalDtype no longer inherits from StructDtype, we can use isinstance(dtype, StructDtype) instead of type(dtype) is not StructDtype. The exact type check was only needed to exclude the IntervalDtype subclass. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
StructColumn should create a regular Index, not IntervalIndex. Before the split, IntervalColumn was a subclass of StructColumn, so the tuple (IntervalColumn, StructColumn) was correct. Now that they are independent sibling classes, they need separate handling.
Add proper handling for pandas IntervalDtype in as_column to ensure it returns an IntervalColumn instead of a StructColumn. Changes: - In as_column: When converting pandas Series with IntervalDtype, wrap the result using _with_type_metadata with IntervalDtype - In StructColumn._with_type_metadata: Add dispatch logic to convert to IntervalColumn when given an IntervalDtype, similar to how IntervalColumn.from_arrow works This ensures that IntervalIndex constructor receives an IntervalColumn as expected, fixing the "data must be an iterable of Interval data" error. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
IntervalDtype is stored as STRUCT at the pylibcudf level, so dtype_to_pylibcudf_type should return plc.TypeId.STRUCT for IntervalDtype, just like it does for StructDtype.
IntervalDtype has children (left and right fields) like StructDtype, so it needs the same special handling in column_empty to create child columns rather than trying to create a scalar.
When gathering columns (used by iloc, loc, etc.), apply type metadata to ensure the original column type is preserved. This is critical for IntervalColumn which is stored as STRUCT at the libcudf level but needs to be reconstituted as IntervalColumn with IntervalDtype after gather. Without this fix, operations like series.iloc[0] on interval series would return StructColumn instead of preserving the IntervalColumn type.
…ture - Remove _apply_child_metadata method from IntervalColumn - Update _get_sliced_child to use plc_column.num_children() - Update _with_type_metadata to use plc_column.children() - Remove children= parameter from _from_preprocessed calls - RESTORE IntervalDtype handling in StructColumn._with_type_metadata (needed for construction path: StructColumn -> IntervalColumn) - Add strict=True to zip() call The key insight is that StructColumn must retain the ability to convert itself to IntervalColumn via _with_type_metadata(IntervalDtype), even though IntervalColumn no longer inherits from StructColumn. This is because the construction path creates StructColumn first (from Arrow interval storage), then converts to IntervalColumn. All 2,711 interval tests now pass.
This method is not needed because: - IntervalColumn no longer inherits from StructColumn - The struct accessor only registers for StructDtype, not IntervalDtype - IntervalColumn's left/right properties directly access plc_column.children() All 2,711 interval tests still pass.
When finding a common type between two IntervalDtype with different subtypes (e.g., interval[int64, left] and interval[float64, left]), find_common_type was falling back to object dtype because numpy doesn't know about IntervalDtype. Now, when all dtypes are IntervalDtype with the same closed parameter, we: 1. Find the common type of their subtypes 2. Return IntervalDtype(common_subtype, closed) This fixes union() operations on empty IntervalIndexes with different subtypes, which was returning Index with dtype='object' instead of IntervalIndex. Fixes pandas compatibility test: tests/indexes/interval/test_setops.py::TestIntervalIndex::test_union_empty_result
mroeschke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big +1 with this decoupling
| # Wrap StructColumn as IntervalColumn with proper metadata | ||
| result = result._with_type_metadata( | ||
| IntervalDtype( | ||
| subtype=cudf.dtype(arbitrary.dtype.subtype), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: cudf.dtype will be called in the IntervalDtype constructor as it would be nice to have less places use cudf.dtype
| ) | ||
| # For pandas dtypes, store them directly in the column's dtype property | ||
| elif isinstance(dtype, pd.ArrowDtype) and isinstance( | ||
| dtype.pyarrow_dtype, pa.lib.StructType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably More Correct if dtype.pyarrow_dtype was an instance of ArrowIntervalType (we have similar handling for this in ColumnBase.from_arrow)
| # Check IntervalDtype first because it's a subclass of StructDtype | ||
| if isinstance(dtype, IntervalDtype): | ||
| # TODO: Rewrite this to avoid needing to round-trip via ColumnBase | ||
| # Dispatch to IntervalColumn when given IntervalDtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this entire branch just call return IntervalColumn._with_type_metadata?
1. Remove redundant cudf.dtype() call in column.py
- IntervalDtype constructor already calls cudf.dtype() internally
2. Check for ArrowIntervalType instead of pa.lib.StructType in interval.py
- More correct type check following pattern in ColumnBase.from_arrow
3. Simplify IntervalDtype handling in struct.py _with_type_metadata
- Instead of manually reconstructing IntervalColumn, create it with
current dtype and let IntervalColumn._with_type_metadata handle
the child conversion
- Reduces code duplication
All 2,711 interval tests pass.
Description
Checklist