Release v23.08.00 · rapidsai/cudf

🚨 Breaking Changes

Enforce deprecations and add clarifications around existing deprecations (#13710) @galipremsagar
Separate MurmurHash32 from hash_functions.cuh (#13681) @davidwendt
Avoid storing metadata in pointers in ORC and Parquet writers (#13648) @vuule
Expose streams in all public copying APIs (#13629) @vyasr
Remove deprecated cudf::strings::slice_strings (by delimiter) functions (#13628) @davidwendt
Remove deprecated cudf.set_allocator. (#13591) @bdice
Change build.sh to use pip install instead of setup.py (#13507) @vyasr
Remove unused max_rows_tensor parameter from subword tokenizer (#13463) @davidwendt
Fix decimal scale reductions in _get_decimal_type (#13224) @charlesbluca

🐛 Bug Fixes

Add CUDA version to cudf_kafka and libcudf-example build strings. (#13769) @bdice
Fix typo in wheels-test.yaml. (#13763) @bdice
Don't test strings shorter than the requested ngram size (#13758) @vyasr
Add CUDA version to custreamz build string. (#13754) @bdice
Fix writing of ORC files with empty child string columns (#13745) @vuule
Remove the erroneous "empty level" short-circuit from ORC reader (#13722) @vuule
Fix character counting when writing sliced tables into ORC (#13721) @vuule
Parquet uses row group row count if missing from header (#13712) @hyperbolic2346
Fix reading of RLE encoded boolean data from parquet files with V2 page headers (#13707) @etseidl
Fix a corner case of list lexicographic comparator (#13701) @ttnghia
Fix combined filtering and column projection in dask_cudf.read_parquet (#13697) @rjzamora
Revert fetch-rapids changes (#13696) @vyasr
Data generator - include offsets in the size estimate of list elments (#13688) @vuule
Add cuda-nvcc-impl to cudf for numba CUDA 12 (#13673) @jakirkham
Fix combined filtering and column projection in read_parquet (#13666) @rjzamora
Use thrust::identity as hash functions for byte pair encoding (#13665) @PointKernel
Fix loc-getitem ordering when index contains duplicate labels (#13659) @wence-
[REVIEW] Introduce parity with pandas for MultiIndex.loc ordering & fix a bug in Groupby with as_index (#13657) @galipremsagar
Fix memcheck error found in nvtext tokenize functions (#13649) @davidwendt
Fix has_nonempty_nulls ignoring column offset (#13647) @ttnghia
[Java] Avoid double-free corruption in case of an Exception while creating a ColumnView (#13645) @razajafri
Fix memcheck error in ORC reader call to cudf::io::copy_uncompressed_kernel (#13643) @davidwendt
Fix CUDA 12 conda environment to remove cubinlinker and ptxcompiler. (#13636) @bdice
Fix inf/NaN comparisons for FLOAT orderby in window functions (#13635) @mythrocks
Refactor Index search to simplify code and increase correctness (#13625) @wence-
Fix compile warning for unused variable in split_re.cu (#13621) @davidwendt
Fix tz_localize for dask_cudf Series (#13610) @shwina
Fix issue with no decompressed data in ORC reader (#13609) @vuule
Fix floating point window range extents. (#13606) @mythrocks
Fix localize(None) for timezone-naive columns (#13603) @shwina
Fixed a memory leak caused by Exception thrown while constructing a ColumnView (#13597) @razajafri
Handle nullptr return value from bitmask_or in distinct_count (#13590) @wence-
Bring parity with pandas in Index.join (#13589) @galipremsagar
Fix cudf.melt when there are more than 255 columns (#13588) @hcho3
Fix memory issues in cuIO due to removal of memory padding (#13586) @ttnghia
Fix Parquet multi-file reading (#13584) @etseidl
Fix memcheck error found in LISTS_TEST (#13579) @davidwendt
Fix memcheck error found in STRINGS_TEST (#13578) @davidwendt
Fix memcheck error found in INTEROP_TEST (#13577) @davidwendt
Fix memcheck errors found in REDUCTION_TEST (#13574) @davidwendt
Preemptive fix for hive-partitioning change in dask (#13564) @rjzamora
Fix an issue with dask_cudf.read_csv when lines are needed to be skipped (#13555) @galipremsagar
Fix out-of-bounds memory write in cudf::dictionary::detail::concatenate (#13554) @davidwendt
Fix the null mask size in json reader (#13537) @karthikeyann
Fix cudf::strings::strip for all-empty input column (#13533) @davidwendt
Make sure to build without isolation or installing dependencies (#13524) @vyasr
Remove preload lib from CMake for now (#13519) @vyasr
Fix missing separator after null values in JSON writer (#13503) @karthikeyann
Ensure single_lane_block_sum_reduce is safe to call in a loop (#13488) @wence-
Update all versions in pyproject.toml files. (#13486) @bdice
Remove applying nvbench that doesn't exist in 23.08 (#13484) @robertmaynard
Fix chunked Parquet reader benchmark (#13482) @vuule
Update JNI JSON reader column compatability for Spark (#13477) @revans2
Fix unsanitized output of scan with strings (#13455) @davidwendt
Reject functions without bytecode from _can_be_jitted in GroupBy Apply (#13429) @brandon-b-miller
Fix decimal scale reductions in _get_decimal_type (#13224) @charlesbluca

📖 Documentation

Fix doxygen groups for io data sources and sinks (#13718) @davidwendt
Add pandas compatibility note to DataFrame.query docstring (#13693) @beckernick
Add pylibcudf to developer guide (#13639) @vyasr
Fix repeated words in doxygen text (#13598) @karthikeyann
Update docs for top-level API. (#13592) @bdice
Fix the the doxygen text for cudf::concatenate and other places (#13561) @davidwendt
Document stream validation approach used in testing (#13556) @vyasr
Cleanup doc repetitions in libcudf (#13470) @karthikeyann

🚀 New Features

Support min and max aggregations for list type in groupby and reduction (#13676) @ttnghia
Add nvtext::jaccard_index API for strings columns (#13669) @davidwendt
Add read_parquet_metadata libcudf API (#13663) @karthikeyann
Expose streams in all public copying APIs (#13629) @vyasr
Add XXHash_64 hash function to cudf (#13612) @davidwendt
Java support: Floating point order-by columns for RANGE window functions (#13595) @mythrocks
Use cuco::static_map to build string dictionaries in ORC writer (#13580) @vuule
Add pylibcudf subpackage with gather implementation (#13562) @vyasr
Add JNI for lists::concatenate_list_elements (#13547) @ttnghia
Enable nested types for lists::concatenate_list_elements (#13545) @ttnghia
Add unicode encoding for string columns in JSON writer (#13539) @karthikeyann
Remove numba kernels from find_index_of_val (#13517) @brandon-b-miller
Floating point order-by columns for RANGE window functions (#13512) @mythrocks
Parse column chunk metadata statistics in parquet reader (#13472) @karthikeyann
Add abs function to apply (#13408) @brandon-b-miller
[FEA] AST filtering in parquet reader (#13348) @karthikeyann
[FEA] Adds option to recover from invalid JSON lines in JSON tokenizer (#13344) @elstehle
Ensure cccl packages don't clash with upstream version (#13235) @robertmaynard
Update struct_minmax_util to experimental row comparator (#13069) @divyegala
Add stream parameter to hashing APIs (#12090) @vyasr

🛠️ Improvements

Pin dask and distributed for 23.08 release (#13802) @galipremsagar
Relax protobuf pinnings. (#13770) @bdice
Switch fully unbounded window functions to use aggregations (#13727) @mythrocks
Switch to new wheel building pipeline (#13723) @vyasr
Revert CUDA 12.0 CI workflows to branch-23.08. (#13719) @bdice
Adding identify minimum version requirement (#13713) @hyperbolic2346
Enforce deprecations and add clarifications around existing deprecations (#13710) @galipremsagar
Optimize ORC reader performance for list data (#13708) @vyasr
fix limit overflow message in a docstring (#13703) @ahmet-uyar
Alleviates JSON parser's need for multi-file sources to end with a newline (#13702) @elstehle
Update cython-lint and replace flake8 with ruff (#13699) @vyasr
Add __dask_tokenize__ definitions to cudf classes (#13695) @rjzamora
Convert libcudf hashing benchmarks to nvbench (#13694) @davidwendt
Separate MurmurHash32 from hash_functions.cuh (#13681) @davidwendt
Improve performance of cudf::strings::split on whitespace (#13680) @davidwendt
Allow ORC and Parquet writers to write nullable columns without nulls as non-nullable (#13675) @vuule
Raise a NotImplementedError in to_datetime when utc is passed (#13670) @shwina
Add rmm_mode parameter to nvbench base fixture (#13668) @davidwendt
Fix multiindex loc ordering in pandas-compat mode (#13660) @wence-
Add nvtext hash_character_ngrams function (#13654) @davidwendt
Avoid storing metadata in pointers in ORC and Parquet writers (#13648) @vuule
Acquire spill lock in to/from_arrow (#13646) @shwina
Expose stable versions of libcudf sort routines (#13634) @wence-
Separate out hash_test.cpp source for each hash API (#13633) @davidwendt
Remove deprecated cudf::strings::slice_strings (by delimiter) functions (#13628) @davidwendt
Create separate libcudf hash APIs for each supported hash function (#13626) @davidwendt
Add convert_dtypes API (#13623) @shwina
Clean up cupy in dependencies.yaml. (#13617) @bdice
Use cuda-version to constrain cudatoolkit. (#13615) @bdice
Add murmurhash3_x64_128 function to libcudf (#13604) @davidwendt
Performance improvement for cudf::strings::like (#13594) @davidwendt
Remove deprecated cudf.set_allocator. (#13591) @bdice
Clean up cudf device atomic with cuda::atomic_ref (#13583) @PointKernel
Add java bindings for distinct count (#13573) @revans2
Use nvcomp conda package. (#13566) @bdice
Add exception to string_scalar if input string exceeds size_type (#13560) @davidwendt
Add dispatch for cudf.Dataframe to/from pyarrow.Table conversion (#13558) @rjzamora
Get rid of cuco::pair_type aliases (#13553) @PointKernel
Introduce parity with pandas when sort=False in Groupby (#13551) @galipremsagar
Update CMake in docker to 3.26.4 (#13550) @NvTimLiu
Clarify source of error message in stream testing. (#13541) @bdice
Deprecate strings_to_categorical in cudf.read_parquet (#13540) @galipremsagar
Update to CMake 3.26.4 (#13538) @vyasr
s3 folder naming fix (#13536) @AyodeAwe
Implement iloc-getitem using parse-don't-validate approach (#13534) @wence-
Make synchronization explicit in the names of hostdevice_* copying APIs (#13530) @ttnghia
Add benchmark (Google Benchmark) dependency to conda packages. (#13528) @bdice
Add libcufile to dependencies.yaml. (#13523) @bdice
Fix some memoization logic in groupby/sort/sort_helper.cu (#13521) @davidwendt
Use sizes_to_offsets_iterator in cudf::gather for strings (#13520) @davidwendt
use rapids-upload-docs script (#13518) @AyodeAwe
Support UTF-8 BOM in CSV reader (#13516) @davidwendt
Move stream-related test configuration to CMake (#13513) @vyasr
Implement cudf.option_context (#13511) @galipremsagar
Unpin dask and distributed for development (#13508) @galipremsagar
Change build.sh to use pip install instead of setup.py (#13507) @vyasr
Use test default stream (#13506) @vyasr
Remove documentation build scripts for Jenkins (#13495) @ajschmidt8
Use east const in include files (#13494) @karthikeyann
Use east const in src files (#13493) @karthikeyann
Use east const in tests files (#13492) @karthikeyann
Use east const in benchmarks files (#13491) @karthikeyann
Performance improvement for nvtext tokenize/token functions (#13480) @davidwendt
Add pd.Float*Dtype to Avro and ORC mappings (#13475) @mroeschke
Use pandas public APIs where available (#13467) @mroeschke
Allow pd.ArrowDtype in cudf.from_pandas (#13465) @mroeschke
Rework libcudf regex benchmarks with nvbench (#13464) @davidwendt
Remove unused max_rows_tensor parameter from subword tokenizer (#13463) @davidwendt
Separate io-text and nvtext pytests into different files (#13435) @davidwendt
Add a move_to function to cudf::string_view::const_iterator (#13428) @davidwendt
Allow newer scikit-build (#13424) @vyasr
Refactor sort_by_values to sort_values, drop indices from return values. (#13419) @bdice
Inline Cython exception handler (#13411) @vyasr
Init JNI version 23.08.0-SNAPSHOT (#13401) @pxLi
Refactor ORC reader (#13396) @ttnghia
JNI: Remove cleaned objects in memory cleaner (#13378) @res-life
Add tests of currently unsupported indexing (#13338) @wence-
Performance improvement for some libcudf regex functions for long strings (#13322) @davidwendt
Exposure Tracked Buffer (first step towards unifying copy-on-write and spilling) (#13307) @madsbk
Write string data directly to column_buffer in Parquet reader (#13302) @etseidl
Add stacktrace into cudf exception types (#13298) @ttnghia
cuDF: Build CUDA 12 packages (#12922) @bdice

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v23.08.00

🚨 Breaking Changes

🐛 Bug Fixes

📖 Documentation

🚀 New Features

🛠️ Improvements

Contributors