v23.10.00
🚨 Breaking Changes
- Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
- Raise
MixedTypeError
when a column of mixed-dtype is being constructed (#14050) @galipremsagar - Raise
NotImplementedError
forMultiIndex.to_series
(#14049) @galipremsagar - Create table_input_metadata from a table_metadata (#13920) @etseidl
- Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
- Change
NA
toNaT
fordatetime
andtimedelta
types (#13868) @galipremsagar - Fix
any
,all
reduction behavior foraxis=None
and warn for other reductions (#13831) @galipremsagar - Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
- Remove the libcudf cudf::offset_type type (#13788) @davidwendt
- Raise error when trying to join
datetime
andtimedelta
types with other types (#13786) @galipremsagar - Update to Cython 3.0.0 (#13777) @vyasr
- Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
- Enforce deprecations in
23.10
(#13732) @galipremsagar - Upgrade to arrow 12 (#13728) @galipremsagar
- Remove Arrow dependency from the
datasource.hpp
public header (#13698) @vuule
🐛 Bug Fixes
- Fix inaccurate ceil/floor and inaccurate rescaling casts of fixed-point values. (#14242) @bdice
- Fix inaccuracy in decimal128 rounding. (#14233) @bdice
- Workaround for illegal instruction error in sm90 for warp instrinsics with mask (#14201) @karthikeyann
- Fix pytorch related pytest (#14198) @galipremsagar
- Pin to
aws-sdk-cpp<1.11
(#14173) @pentschev - Fix assert failure for range window functions (#14168) @mythrocks
- Fix Memcheck error found in JSON_TEST JsonReaderTest.ErrorStrings (#14164) @karthikeyann
- Fix calls to copy_bitmask to pass stream parameter (#14158) @davidwendt
- Fix DataFrame from Series with different CategoricalIndexes (#14157) @mroeschke
- Pin to numpy<1.25 and numba<0.58 to avoid errors and deprecation warnings-as-errors. (#14156) @bdice
- Fix kernel launch error for cudf::io::orc::gpu::rowgroup_char_counts_kernel (#14139) @davidwendt
- Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
- Fix DataFrame.values with no columns but index (#14134) @mroeschke
- Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
- Add support for nested dict in
DataFrame
constructor (#14119) @galipremsagar - Restrict iterables of
DataFrame
's as input toDataFrame
constructor (#14118) @galipremsagar - Allow
numeric_only=True
for reduction operations on numeric types (#14111) @galipremsagar - Preserve name of the column while initializing a
DataFrame
(#14110) @galipremsagar - Correct numerous 20054-D: dynamic initialization errors found on arm+12.2 (#14108) @robertmaynard
- Drop
kwargs
fromSeries.count
(#14106) @galipremsagar - Fix naming issues with
Index.to_frame
andMultiIndex.to_frame
APIs (#14105) @galipremsagar - Only use memory resources that haven't been freed (#14103) @robertmaynard
- Add support for
__round__
inSeries
andDataFrame
(#14099) @galipremsagar - Validate ignore_index type in drop_duplicates (#14098) @mroeschke
- Fix renaming
Series
andIndex
(#14080) @galipremsagar - Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
- Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
- Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
- Use
conda mambabuild
rather thanmamba mambabuild
(#14067) @wence- - Raise NotImplementedError in to_datetime with dayfirst without infer_format (#14058) @mroeschke
- Fix various issues in
Index.intersection
(#14054) @galipremsagar - Fix
Index.difference
to match with pandas (#14053) @galipremsagar - Fix empty string column construction (#14052) @galipremsagar
- Fix
IntervalIndex.union
to preserve type-metadata (#14051) @galipremsagar - Raise
MixedTypeError
when a column of mixed-dtype is being constructed (#14050) @galipremsagar - Raise
NotImplementedError
forMultiIndex.to_series
(#14049) @galipremsagar - Ignore compile_commands.json (#14048) @harrism
- Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
- Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
- Implement
sort_remaining
forsort_index
(#14033) @wence- - Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
- Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
- Preserve types of scalar being returned when possible in
quantile
(#14014) @galipremsagar - Fix return type of
MultiIndex.difference
(#14009) @galipremsagar - Raise an error when timezone subtypes are encountered in
pd.IntervalDtype
(#14006) @galipremsagar - Fix map column can not be non-nullable for java (#14003) @res-life
- Fix
name
selection inIndex.difference
andIndex.intersection
(#13986) @galipremsagar - Restore column type metadata with
dropna
to fixfactorize
API (#13980) @galipremsagar - Use thread_index_type to avoid out of bounds accesses in conditional joins (#13971) @vyasr
- Fix
MultiIndex.to_numpy
to return numpy array with tuples (#13966) @galipremsagar - Use cudf::thread_index_type in get_json_object and tdigest kernels (#13962) @nvdbaranec
- Fix an issue with
IntervalIndex.repr
when null values are present (#13958) @galipremsagar - Fix type metadata issue preservation with
Column.unique
(#13957) @galipremsagar - Handle
Interval
scalars when passed in list-like inputs tocudf.Index
(#13956) @galipremsagar - Fix setting of categories order when
dtype
is passed to aCategoricalColumn
(#13955) @galipremsagar - Handle
as_index
inGroupBy.apply
(#13951) @brandon-b-miller - Raise error for string types in
nsmallest
andnlargest
(#13946) @galipremsagar - Fix
index
ofGroupby.apply
results when it is performed on empty objects (#13944) @galipremsagar - Fix integer overflow in shim
device_sum
functions (#13943) @brandon-b-miller - Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
- Fixed processed bytes calculation in APPLY_BOOLEAN_MASK benchmark. (#13937) @Blonck
- Fix construction of
Grouping
objects (#13932) @galipremsagar - Fix an issue with
loc
when column names isMultiIndex
(#13929) @galipremsagar - Fix handling of typecasting in
searchsorted
(#13925) @galipremsagar - Preserve index
name
inreindex
(#13917) @galipremsagar - Use
cudf::thread_index_type
in cuIO to prevent overflow in row indexing (#13910) @vuule - Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
- Use cudf::thread_index_type in concatenate.cu. (#13906) @bdice
- Use cudf::thread_index_type in replace.cu. (#13905) @bdice
- Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
- Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
- Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
- Use
thread_index_type
to avoid index overflow in grid-stride loops (#13895) @PointKernel - Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
- Raise error when trying to construct a
DataFrame
with mixed types (#13889) @galipremsagar - Return
nan
when one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller - Correctly detect the BOM mark in
read_csv
with compressed input (#13881) @vuule - Check for the presence of all values in
MultiIndex.isin
(#13879) @galipremsagar - Fix nvtext::generate_character_ngrams performance regression for longer strings (#13874) @davidwendt
- Fix return type of
MultiIndex.levels
(#13870) @galipremsagar - Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
- Disable construction of Index when
freq
is set in pandas-compatibility mode (#13857) @galipremsagar - Fix an issue with fetching
NA
from aTimedeltaColumn
(#13853) @galipremsagar - Simplify implementation of interval_range() and fix behaviour for floating
freq
(#13844) @shwina - Fix binary operations between
Series
andIndex
(#13842) @galipremsagar - Update make_lists_column_from_scalar to use make_offsets_child_column utility (#13841) @davidwendt
- Fix read out of bounds in string concatenate (#13838) @pentschev
- Raise error for more cases when
timezone-aware
data is passed toas_column
(#13835) @galipremsagar - Fix
any
,all
reduction behavior foraxis=None
and warn for other reductions (#13831) @galipremsagar - Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
- Fix cuFile I/O factories (#13829) @vuule
- DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
- Branch 23.10 merge 23.08 (#13822) @vyasr
- Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
- No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
- Raise error when mixed types are being constructed (#13816) @galipremsagar
- Fix unbounded sequence issue in
DataFrame
constructor (#13811) @galipremsagar - Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
- Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
- Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
- Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
- Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
- Raise error when trying to join
datetime
andtimedelta
types with other types (#13786) @galipremsagar - Fix negative unary operation for boolean type (#13780) @galipremsagar
- Fix contains(
in
) method forSeries
(#13779) @galipremsagar - Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
- Cast only time of day to nanos to avoid an overflow in Parquet INT96 write (#13776) @gerashegalov
- Preserve names of column object in various APIs (#13772) @galipremsagar
- Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
- Fix construction of DataFrames from dict when columns are provided (#13766) @wence-
- Provide our own Cython declaration for make_unique (#13746) @wence-
📖 Documentation
- Fix typo in docstring: metadata. (#14025) @bdice
- Fix typo in parquet/page_decode.cuh (#13849) @XinyuZeng
- Simplify Python doc configuration (#13826) @vyasr
- Update documentation to reflect recent changes in JSON reader and writer (#13791) @vuule
- Fix all warnings in Python docs (#13789) @vyasr
🚀 New Features
- [Java] Add JNI bindings for
integers_to_hex
(#14205) @razajafri - Propagate errors from Parquet reader kernels back to host (#14167) @vuule
- JNI for
HISTOGRAM
andMERGE_HISTOGRAM
aggregations (#14154) @ttnghia - Expose streams in all public sorting APIs (#14146) @vyasr
- Enable direct ingestion and production of Arrow scalars (#14121) @vyasr
- Implement
GroupBy.value_counts
to match pandas API (#14114) @stmio - Refactor parquet thrift reader (#14097) @etseidl
- Refactor
hash_reduce_by_row
(#14095) @ttnghia - Support negative preceding/following for ROW window functions (#14093) @mythrocks
- Support for progressive parquet chunked reading. (#14079) @nvdbaranec
- Implement
HISTOGRAM
andMERGE_HISTOGRAM
aggregations (#14045) @ttnghia - Expose streams in public search APIs (#14034) @vyasr
- Expose streams in public replace APIs (#14010) @vyasr
- Add stream parameter to public cudf::strings::split APIs (#13997) @davidwendt
- Expose streams in public filling APIs (#13990) @vyasr
- Expose streams in public concatenate APIs (#13987) @vyasr
- Use HostMemoryAllocator in jni::allocate_host_buffer (#13975) @gerashegalov
- Enable fractional null probability for hashing benchmark (#13967) @Blonck
- Switch pylibcudf-enabled types to use enum class in Cython (#13931) @vyasr
- Add nvtext::tokenize_with_vocabulary API (#13930) @davidwendt
- Rewrite
DataFrame.stack
to support multi level column names (#13927) @isVoid - Add HostMemoryAllocator interface (#13924) @gerashegalov
- Global stream pool (#13922) @etseidl
- Create table_input_metadata from a table_metadata (#13920) @etseidl
- Translate column size overflow exception to JNI (#13911) @mythrocks
- Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
- Exclude some tests from running with the compute sanitizer (#13872) @firestarman
- Expand statistics support in ORC writer (#13848) @vuule
- Register the memory mapped buffer in
datasource
to improve H2D throughput (#13814) @vuule - Add cudf::strings::find function with target per row (#13808) @davidwendt
- Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
- Remove unnecessary pointer copying in JIT GroupBy Apply (#13792) @brandon-b-miller
- Add 'poll' function to custreamz kafka consumer (#13782) @jdye64
- Support
corr
inGroupBy.apply
through the jit engine (#13767) @shwina - Optionally write version 2 page headers in Parquet writer (#13751) @etseidl
- Support more numeric types in
Groupby.apply
withengine='jit'
(#13729) @brandon-b-miller - [FEA] Add DELTA_BINARY_PACKED decoding support to Parquet reader (#13637) @etseidl
- Read FIXED_LEN_BYTE_ARRAY as binary in parquet reader (#13437) @PointKernel
🛠️ Improvements
- Pin
dask
anddistributed
for23.10
release (#14225) @galipremsagar - update rmm tag path (#14195) @AyodeAwe
- Disable
Recently Updated
Check (#14193) @ajschmidt8 - Move cpp/src/hash/hash_allocator.cuh to include/cudf/hashing/detail (#14163) @davidwendt
- Add Parquet reader benchmarks for row selection (#14147) @vuule
- Update image names (#14145) @AyodeAwe
- Support callables in DataFrame.assign (#14142) @wence-
- Reduce memory usage of as_categorical_column (#14138) @wence-
- Replace Python scalar conversions with libcudf (#14124) @vyasr
- Update to clang 16.0.6. (#14120) @bdice
- Fix type of empty
Index
and raise warning inSeries
constructor (#14116) @galipremsagar - Add stream parameter to external dict APIs (#14115) @SurajAralihalli
- Add fallback matrix for nvcomp. (#14082) @bdice
- [Java] Add recoverWithNull to JSONOptions and pass to Table.readJSON (#14078) @andygrove
- Remove header tests (#14072) @ajschmidt8
- Refactor
contains_table
with cuco::static_set (#14064) @PointKernel - Remove debug print in a Parquet test (#14063) @vuule
- Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
- Expose stream parameter in public strings find APIs (#14060) @davidwendt
- Update doxygen to 1.9.1 (#14059) @vyasr
- Remove the mr from the base fixture (#14057) @vyasr
- Expose streams in public strings case APIs (#14056) @davidwendt
- Refactor libcudf indexalator to typed normalator (#14043) @davidwendt
- Use cudf::make_empty_column instead of column_view constructor (#14030) @davidwendt
- Remove quadratic runtime due to accessing Frame._dtypes in loop (#14028) @wence-
- Explicitly depend on zlib in conda recipes (#14018) @wence-
- Use grid_stride for stride computations. (#13996) @bdice
- Fix an issue where casting null-array to
object
dtype will result in a failure (#13994) @galipremsagar - Add tab as literal to cudf::test::to_string output (#13993) @davidwendt
- Enable
codes
dtype parity in pandas-compatibility mode forfactorize
API (#13982) @galipremsagar - Fix
CategoricalIndex
ordering inGroupby.agg
when pandas-compatibility mode is enabled (#13978) @galipremsagar - Produce a fatal error if cudf is unable to find pyarrow include directory (#13976) @cwharris
- Use
thread_index_type
inpartitioning.cu
(#13973) @divyegala - Use
cudf::thread_index_type
inmerge.cu
(#13972) @divyegala - Use
copy-pr-bot
(#13970) @ajschmidt8 - Use cudf::thread_index_type in strings custom kernels (#13968) @davidwendt
- Add
bytes_per_second
to hash_partition benchmark (#13965) @Blonck - Added pinned pool reservation API for java (#13964) @revans2
- Simplify wheel build scripts and allow alphas of RAPIDS dependencies (#13963) @vyasr
- Add
bytes_per_second
to copy_if_else benchmark (#13960) @Blonck - Add pandas compatible output to
Series.unique
(#13959) @galipremsagar - Add
bytes_per_second
to compiled binaryop benchmark (#13938) @Blonck - Unpin
dask
anddistributed
for23.10
development (#13935) @galipremsagar - Make HostColumnVector.getRefCount public (#13934) @abellina
- Use cuco::static_set in JSON tree algorithm (#13928) @karthikeyann
- Add java API to get size of host memory needed to copy column view (#13919) @revans2
- Use cudf::size_type instead of int32 where appropriate in nvtext functions (#13915) @davidwendt
- Enable hugepage for arrow host allocations (#13914) @madsbk
- Improve performance of nvtext::edit_distance (#13912) @davidwendt
- Ensure cudf internals use pylibcudf in pure Python mode (#13909) @vyasr
- Use
empty()
instead ofsize()
where possible (#13908) @vuule - [JNI] Adds HostColumnVector.EventHandler for spillability checks (#13898) @abellina
- Return
Timestamp
&Timedelta
for fetching scalars inDatetimeIndex
&TimedeltaIndex
(#13896) @galipremsagar - Allow explicit
shuffle="p2p"
within dask-cudf API (#13893) @rjzamora - Disable creation of
DatetimeIndex
whenfreq
is passed tocudf.date_range
(#13890) @galipremsagar - Bring parity with pandas for
datetime
&timedelta
comparison operations (#13877) @galipremsagar - Change
NA
toNaT
fordatetime
andtimedelta
types (#13868) @galipremsagar - Raise error when
astype(object)
is called in pandas compatibility mode (#13862) @galipremsagar - Fixes a performance regression in FST (#13850) @elstehle
- Set native handles to null on close in Java wrapper classes (#13818) @jlowe
- Avoid use of CUDF_EXPECTS in libcudf unit tests outside of helper functions with return values (#13812) @vuule
- Update
lists::contains
to experimental row comparator (#13810) @divyegala - Reduce
lists::contains
dispatches for scalars (#13805) @divyegala - Long string optimization for string column parsing in JSON reader (#13803) @karthikeyann
- Raise NotImplementedError for pd.SparseDtype (#13798) @mroeschke
- Remove the libcudf cudf::offset_type type (#13788) @davidwendt
- Move Spark-indpendent Table debug to cudf Java (#13783) @gerashegalov
- Update to Cython 3.0.0 (#13777) @vyasr
- Refactor Parquet reader handling of V2 page header info (#13775) @etseidl
- Branch 23.10 merge 23.08 (#13773) @vyasr
- Restructure JSON code to correctly reflect legacy/experimental status (#13757) @vuule
- Branch 23.10 merge 23.08 (#13753) @vyasr
- Enforce deprecations in
23.10
(#13732) @galipremsagar - Upgrade to arrow 12 (#13728) @galipremsagar
- Refactors JSON reader's pushdown automaton (#13716) @elstehle
- Remove Arrow dependency from the
datasource.hpp
public header (#13698) @vuule