Release v23.10.00 · rapidsai/cudf

🚨 Breaking Changes

Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Create table_input_metadata from a table_metadata (#13920) @etseidl
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Update to Cython 3.0.0 (#13777) @vyasr
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

🐛 Bug Fixes

Fix inaccurate ceil/floor and inaccurate rescaling casts of fixed-point values. (#14242) @bdice
Fix inaccuracy in decimal128 rounding. (#14233) @bdice
Workaround for illegal instruction error in sm90 for warp instrinsics with mask (#14201) @karthikeyann
Fix pytorch related pytest (#14198) @galipremsagar
Pin to aws-sdk-cpp<1.11 (#14173) @pentschev
Fix assert failure for range window functions (#14168) @mythrocks
Fix Memcheck error found in JSON_TEST JsonReaderTest.ErrorStrings (#14164) @karthikeyann
Fix calls to copy_bitmask to pass stream parameter (#14158) @davidwendt
Fix DataFrame from Series with different CategoricalIndexes (#14157) @mroeschke
Pin to numpy<1.25 and numba<0.58 to avoid errors and deprecation warnings-as-errors. (#14156) @bdice
Fix kernel launch error for cudf::io::orc::gpu::rowgroup_char_counts_kernel (#14139) @davidwendt
Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
Fix DataFrame.values with no columns but index (#14134) @mroeschke
Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
Add support for nested dict in DataFrame constructor (#14119) @galipremsagar
Restrict iterables of DataFrame's as input to DataFrame constructor (#14118) @galipremsagar
Allow numeric_only=True for reduction operations on numeric types (#14111) @galipremsagar
Preserve name of the column while initializing a DataFrame (#14110) @galipremsagar
Correct numerous 20054-D: dynamic initialization errors found on arm+12.2 (#14108) @robertmaynard
Drop kwargs from Series.count (#14106) @galipremsagar
Fix naming issues with Index.to_frame and MultiIndex.to_frame APIs (#14105) @galipremsagar
Only use memory resources that haven't been freed (#14103) @robertmaynard
Add support for __round__ in Series and DataFrame (#14099) @galipremsagar
Validate ignore_index type in drop_duplicates (#14098) @mroeschke
Fix renaming Series and Index (#14080) @galipremsagar
Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
Use conda mambabuild rather than mamba mambabuild (#14067) @wence-
Raise NotImplementedError in to_datetime with dayfirst without infer_format (#14058) @mroeschke
Fix various issues in Index.intersection (#14054) @galipremsagar
Fix Index.difference to match with pandas (#14053) @galipremsagar
Fix empty string column construction (#14052) @galipremsagar
Fix IntervalIndex.union to preserve type-metadata (#14051) @galipremsagar
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Ignore compile_commands.json (#14048) @harrism
Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
Implement sort_remaining for sort_index (#14033) @wence-
Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
Preserve types of scalar being returned when possible in quantile (#14014) @galipremsagar
Fix return type of MultiIndex.difference (#14009) @galipremsagar
Raise an error when timezone subtypes are encountered in pd.IntervalDtype (#14006) @galipremsagar
Fix map column can not be non-nullable for java (#14003) @res-life
Fix name selection in Index.difference and Index.intersection (#13986) @galipremsagar
Restore column type metadata with dropna to fix factorize API (#13980) @galipremsagar
Use thread_index_type to avoid out of bounds accesses in conditional joins (#13971) @vyasr
Fix MultiIndex.to_numpy to return numpy array with tuples (#13966) @galipremsagar
Use cudf::thread_index_type in get_json_object and tdigest kernels (#13962) @nvdbaranec
Fix an issue with IntervalIndex.repr when null values are present (#13958) @galipremsagar
Fix type metadata issue preservation with Column.unique (#13957) @galipremsagar
Handle Interval scalars when passed in list-like inputs to cudf.Index (#13956) @galipremsagar
Fix setting of categories order when dtype is passed to a CategoricalColumn (#13955) @galipremsagar
Handle as_index in GroupBy.apply (#13951) @brandon-b-miller
Raise error for string types in nsmallest and nlargest (#13946) @galipremsagar
Fix index of Groupby.apply results when it is performed on empty objects (#13944) @galipremsagar
Fix integer overflow in shim device_sum functions (#13943) @brandon-b-miller
Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
Fixed processed bytes calculation in APPLY_BOOLEAN_MASK benchmark. (#13937) @Blonck
Fix construction of Grouping objects (#13932) @galipremsagar
Fix an issue with loc when column names is MultiIndex (#13929) @galipremsagar
Fix handling of typecasting in searchsorted (#13925) @galipremsagar
Preserve index name in reindex (#13917) @galipremsagar
Use cudf::thread_index_type in cuIO to prevent overflow in row indexing (#13910) @vuule
Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
Use cudf::thread_index_type in concatenate.cu. (#13906) @bdice
Use cudf::thread_index_type in replace.cu. (#13905) @bdice
Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
Use thread_index_type to avoid index overflow in grid-stride loops (#13895) @PointKernel
Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
Raise error when trying to construct a DataFrame with mixed types (#13889) @galipremsagar
Return nan when one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller
Correctly detect the BOM mark in read_csv with compressed input (#13881) @vuule
Check for the presence of all values in MultiIndex.isin (#13879) @galipremsagar
Fix nvtext::generate_character_ngrams performance regression for longer strings (#13874) @davidwendt
Fix return type of MultiIndex.levels (#13870) @galipremsagar
Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
Disable construction of Index when freq is set in pandas-compatibility mode (#13857) @galipremsagar
Fix an issue with fetching NA from a TimedeltaColumn (#13853) @galipremsagar
Simplify implementation of interval_range() and fix behaviour for floating freq (#13844) @shwina
Fix binary operations between Series and Index (#13842) @galipremsagar
Update make_lists_column_from_scalar to use make_offsets_child_column utility (#13841) @davidwendt
Fix read out of bounds in string concatenate (#13838) @pentschev
Raise error for more cases when timezone-aware data is passed to as_column (#13835) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
Fix cuFile I/O factories (#13829) @vuule
DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
Branch 23.10 merge 23.08 (#13822) @vyasr
Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
Raise error when mixed types are being constructed (#13816) @galipremsagar
Fix unbounded sequence issue in DataFrame constructor (#13811) @galipremsagar
Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Fix negative unary operation for boolean type (#13780) @galipremsagar
Fix contains(in) method for Series (#13779) @galipremsagar
Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
Cast only time of day to nanos to avoid an overflow in Parquet INT96 write (#13776) @gerashegalov
Preserve names of column object in various APIs (#13772) @galipremsagar
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Fix construction of DataFrames from dict when columns are provided (#13766) @wence-
Provide our own Cython declaration for make_unique (#13746) @wence-

📖 Documentation

Fix typo in docstring: metadata. (#14025) @bdice
Fix typo in parquet/page_decode.cuh (#13849) @XinyuZeng
Simplify Python doc configuration (#13826) @vyasr
Update documentation to reflect recent changes in JSON reader and writer (#13791) @vuule
Fix all warnings in Python docs (#13789) @vyasr

🚀 New Features

[Java] Add JNI bindings for integers_to_hex (#14205) @razajafri
Propagate errors from Parquet reader kernels back to host (#14167) @vuule
JNI for HISTOGRAM and MERGE_HISTOGRAM aggregations (#14154) @ttnghia
Expose streams in all public sorting APIs (#14146) @vyasr
Enable direct ingestion and production of Arrow scalars (#14121) @vyasr
Implement GroupBy.value_counts to match pandas API (#14114) @stmio
Refactor parquet thrift reader (#14097) @etseidl
Refactor hash_reduce_by_row (#14095) @ttnghia
Support negative preceding/following for ROW window functions (#14093) @mythrocks
Support for progressive parquet chunked reading. (#14079) @nvdbaranec
Implement HISTOGRAM and MERGE_HISTOGRAM aggregations (#14045) @ttnghia
Expose streams in public search APIs (#14034) @vyasr
Expose streams in public replace APIs (#14010) @vyasr
Add stream parameter to public cudf::strings::split APIs (#13997) @davidwendt
Expose streams in public filling APIs (#13990) @vyasr
Expose streams in public concatenate APIs (#13987) @vyasr
Use HostMemoryAllocator in jni::allocate_host_buffer (#13975) @gerashegalov
Enable fractional null probability for hashing benchmark (#13967) @Blonck
Switch pylibcudf-enabled types to use enum class in Cython (#13931) @vyasr
Add nvtext::tokenize_with_vocabulary API (#13930) @davidwendt
Rewrite DataFrame.stack to support multi level column names (#13927) @isVoid
Add HostMemoryAllocator interface (#13924) @gerashegalov
Global stream pool (#13922) @etseidl
Create table_input_metadata from a table_metadata (#13920) @etseidl
Translate column size overflow exception to JNI (#13911) @mythrocks
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Exclude some tests from running with the compute sanitizer (#13872) @firestarman
Expand statistics support in ORC writer (#13848) @vuule
Register the memory mapped buffer in datasource to improve H2D throughput (#13814) @vuule
Add cudf::strings::find function with target per row (#13808) @davidwendt
Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
Remove unnecessary pointer copying in JIT GroupBy Apply (#13792) @brandon-b-miller
Add 'poll' function to custreamz kafka consumer (#13782) @jdye64
Support corr in GroupBy.apply through the jit engine (#13767) @shwina
Optionally write version 2 page headers in Parquet writer (#13751) @etseidl
Support more numeric types in Groupby.apply with engine='jit' (#13729) @brandon-b-miller
[FEA] Add DELTA_BINARY_PACKED decoding support to Parquet reader (#13637) @etseidl
Read FIXED_LEN_BYTE_ARRAY as binary in parquet reader (#13437) @PointKernel

🛠️ Improvements

Pin dask and distributed for 23.10 release (#14225) @galipremsagar
update rmm tag path (#14195) @AyodeAwe
Disable Recently Updated Check (#14193) @ajschmidt8
Move cpp/src/hash/hash_allocator.cuh to include/cudf/hashing/detail (#14163) @davidwendt
Add Parquet reader benchmarks for row selection (#14147) @vuule
Update image names (#14145) @AyodeAwe
Support callables in DataFrame.assign (#14142) @wence-
Reduce memory usage of as_categorical_column (#14138) @wence-
Replace Python scalar conversions with libcudf (#14124) @vyasr
Update to clang 16.0.6. (#14120) @bdice
Fix type of empty Index and raise warning in Series constructor (#14116) @galipremsagar
Add stream parameter to external dict APIs (#14115) @SurajAralihalli
Add fallback matrix for nvcomp. (#14082) @bdice
[Java] Add recoverWithNull to JSONOptions and pass to Table.readJSON (#14078) @andygrove
Remove header tests (#14072) @ajschmidt8
Refactor contains_table with cuco::static_set (#14064) @PointKernel
Remove debug print in a Parquet test (#14063) @vuule
Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
Expose stream parameter in public strings find APIs (#14060) @davidwendt
Update doxygen to 1.9.1 (#14059) @vyasr
Remove the mr from the base fixture (#14057) @vyasr
Expose streams in public strings case APIs (#14056) @davidwendt
Refactor libcudf indexalator to typed normalator (#14043) @davidwendt
Use cudf::make_empty_column instead of column_view constructor (#14030) @davidwendt
Remove quadratic runtime due to accessing Frame._dtypes in loop (#14028) @wence-
Explicitly depend on zlib in conda recipes (#14018) @wence-
Use grid_stride for stride computations. (#13996) @bdice
Fix an issue where casting null-array to object dtype will result in a failure (#13994) @galipremsagar
Add tab as literal to cudf::test::to_string output (#13993) @davidwendt
Enable codes dtype parity in pandas-compatibility mode for factorize API (#13982) @galipremsagar
Fix CategoricalIndex ordering in Groupby.agg when pandas-compatibility mode is enabled (#13978) @galipremsagar
Produce a fatal error if cudf is unable to find pyarrow include directory (#13976) @cwharris
Use thread_index_type in partitioning.cu (#13973) @divyegala
Use cudf::thread_index_type in merge.cu (#13972) @divyegala
Use copy-pr-bot (#13970) @ajschmidt8
Use cudf::thread_index_type in strings custom kernels (#13968) @davidwendt
Add bytes_per_second to hash_partition benchmark (#13965) @Blonck
Added pinned pool reservation API for java (#13964) @revans2
Simplify wheel build scripts and allow alphas of RAPIDS dependencies (#13963) @vyasr
Add bytes_per_second to copy_if_else benchmark (#13960) @Blonck
Add pandas compatible output to Series.unique (#13959) @galipremsagar
Add bytes_per_second to compiled binaryop benchmark (#13938) @Blonck
Unpin dask and distributed for 23.10 development (#13935) @galipremsagar
Make HostColumnVector.getRefCount public (#13934) @abellina
Use cuco::static_set in JSON tree algorithm (#13928) @karthikeyann
Add java API to get size of host memory needed to copy column view (#13919) @revans2
Use cudf::size_type instead of int32 where appropriate in nvtext functions (#13915) @davidwendt
Enable hugepage for arrow host allocations (#13914) @madsbk
Improve performance of nvtext::edit_distance (#13912) @davidwendt
Ensure cudf internals use pylibcudf in pure Python mode (#13909) @vyasr
Use empty() instead of size() where possible (#13908) @vuule
[JNI] Adds HostColumnVector.EventHandler for spillability checks (#13898) @abellina
Return Timestamp & Timedelta for fetching scalars in DatetimeIndex & TimedeltaIndex (#13896) @galipremsagar
Allow explicit shuffle="p2p" within dask-cudf API (#13893) @rjzamora
Disable creation of DatetimeIndex when freq is passed to cudf.date_range (#13890) @galipremsagar
Bring parity with pandas for datetime & timedelta comparison operations (#13877) @galipremsagar
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Raise error when astype(object) is called in pandas compatibility mode (#13862) @galipremsagar
Fixes a performance regression in FST (#13850) @elstehle
Set native handles to null on close in Java wrapper classes (#13818) @jlowe
Avoid use of CUDF_EXPECTS in libcudf unit tests outside of helper functions with return values (#13812) @vuule
Update lists::contains to experimental row comparator (#13810) @divyegala
Reduce lists::contains dispatches for scalars (#13805) @divyegala
Long string optimization for string column parsing in JSON reader (#13803) @karthikeyann
Raise NotImplementedError for pd.SparseDtype (#13798) @mroeschke
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Move Spark-indpendent Table debug to cudf Java (#13783) @gerashegalov
Update to Cython 3.0.0 (#13777) @vyasr
Refactor Parquet reader handling of V2 page header info (#13775) @etseidl
Branch 23.10 merge 23.08 (#13773) @vyasr
Restructure JSON code to correctly reflect legacy/experimental status (#13757) @vuule
Branch 23.10 merge 23.08 (#13753) @vyasr
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Refactors JSON reader's pushdown automaton (#13716) @elstehle
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v23.10.00

🚨 Breaking Changes

🐛 Bug Fixes

📖 Documentation

🚀 New Features

🛠️ Improvements

Contributors