v23.08.00
🚨 Breaking Changes
- Enforce deprecations and add clarifications around existing deprecations (#13710) @galipremsagar
- Separate MurmurHash32 from hash_functions.cuh (#13681) @davidwendt
- Avoid storing metadata in pointers in ORC and Parquet writers (#13648) @vuule
- Expose streams in all public copying APIs (#13629) @vyasr
- Remove deprecated cudf::strings::slice_strings (by delimiter) functions (#13628) @davidwendt
- Remove deprecated cudf.set_allocator. (#13591) @bdice
- Change build.sh to use pip install instead of setup.py (#13507) @vyasr
- Remove unused max_rows_tensor parameter from subword tokenizer (#13463) @davidwendt
- Fix decimal scale reductions in
_get_decimal_type
(#13224) @charlesbluca
🐛 Bug Fixes
- Add CUDA version to cudf_kafka and libcudf-example build strings. (#13769) @bdice
- Fix typo in wheels-test.yaml. (#13763) @bdice
- Don't test strings shorter than the requested ngram size (#13758) @vyasr
- Add CUDA version to custreamz build string. (#13754) @bdice
- Fix writing of ORC files with empty child string columns (#13745) @vuule
- Remove the erroneous "empty level" short-circuit from ORC reader (#13722) @vuule
- Fix character counting when writing sliced tables into ORC (#13721) @vuule
- Parquet uses row group row count if missing from header (#13712) @hyperbolic2346
- Fix reading of RLE encoded boolean data from parquet files with V2 page headers (#13707) @etseidl
- Fix a corner case of list lexicographic comparator (#13701) @ttnghia
- Fix combined filtering and column projection in
dask_cudf.read_parquet
(#13697) @rjzamora - Revert fetch-rapids changes (#13696) @vyasr
- Data generator - include offsets in the size estimate of list elments (#13688) @vuule
- Add
cuda-nvcc-impl
tocudf
fornumba
CUDA 12 (#13673) @jakirkham - Fix combined filtering and column projection in
read_parquet
(#13666) @rjzamora - Use
thrust::identity
as hash functions for byte pair encoding (#13665) @PointKernel - Fix loc-getitem ordering when index contains duplicate labels (#13659) @wence-
- [REVIEW] Introduce parity with pandas for
MultiIndex.loc
ordering & fix a bug inGroupby
withas_index
(#13657) @galipremsagar - Fix memcheck error found in nvtext tokenize functions (#13649) @davidwendt
- Fix
has_nonempty_nulls
ignoring column offset (#13647) @ttnghia - [Java] Avoid double-free corruption in case of an Exception while creating a ColumnView (#13645) @razajafri
- Fix memcheck error in ORC reader call to cudf::io::copy_uncompressed_kernel (#13643) @davidwendt
- Fix CUDA 12 conda environment to remove cubinlinker and ptxcompiler. (#13636) @bdice
- Fix inf/NaN comparisons for FLOAT orderby in window functions (#13635) @mythrocks
- Refactor
Index
search to simplify code and increase correctness (#13625) @wence- - Fix compile warning for unused variable in split_re.cu (#13621) @davidwendt
- Fix tz_localize for dask_cudf Series (#13610) @shwina
- Fix issue with no decompressed data in ORC reader (#13609) @vuule
- Fix floating point window range extents. (#13606) @mythrocks
- Fix
localize(None)
for timezone-naive columns (#13603) @shwina - Fixed a memory leak caused by Exception thrown while constructing a ColumnView (#13597) @razajafri
- Handle nullptr return value from bitmask_or in distinct_count (#13590) @wence-
- Bring parity with pandas in Index.join (#13589) @galipremsagar
- Fix cudf.melt when there are more than 255 columns (#13588) @hcho3
- Fix memory issues in cuIO due to removal of memory padding (#13586) @ttnghia
- Fix Parquet multi-file reading (#13584) @etseidl
- Fix memcheck error found in LISTS_TEST (#13579) @davidwendt
- Fix memcheck error found in STRINGS_TEST (#13578) @davidwendt
- Fix memcheck error found in INTEROP_TEST (#13577) @davidwendt
- Fix memcheck errors found in REDUCTION_TEST (#13574) @davidwendt
- Preemptive fix for hive-partitioning change in dask (#13564) @rjzamora
- Fix an issue with
dask_cudf.read_csv
when lines are needed to be skipped (#13555) @galipremsagar - Fix out-of-bounds memory write in cudf::dictionary::detail::concatenate (#13554) @davidwendt
- Fix the null mask size in json reader (#13537) @karthikeyann
- Fix cudf::strings::strip for all-empty input column (#13533) @davidwendt
- Make sure to build without isolation or installing dependencies (#13524) @vyasr
- Remove preload lib from CMake for now (#13519) @vyasr
- Fix missing separator after null values in JSON writer (#13503) @karthikeyann
- Ensure
single_lane_block_sum_reduce
is safe to call in a loop (#13488) @wence- - Update all versions in pyproject.toml files. (#13486) @bdice
- Remove applying nvbench that doesn't exist in 23.08 (#13484) @robertmaynard
- Fix chunked Parquet reader benchmark (#13482) @vuule
- Update JNI JSON reader column compatability for Spark (#13477) @revans2
- Fix unsanitized output of scan with strings (#13455) @davidwendt
- Reject functions without bytecode from
_can_be_jitted
in GroupBy Apply (#13429) @brandon-b-miller - Fix decimal scale reductions in
_get_decimal_type
(#13224) @charlesbluca
📖 Documentation
- Fix doxygen groups for io data sources and sinks (#13718) @davidwendt
- Add pandas compatibility note to DataFrame.query docstring (#13693) @beckernick
- Add pylibcudf to developer guide (#13639) @vyasr
- Fix repeated words in doxygen text (#13598) @karthikeyann
- Update docs for top-level API. (#13592) @bdice
- Fix the the doxygen text for cudf::concatenate and other places (#13561) @davidwendt
- Document stream validation approach used in testing (#13556) @vyasr
- Cleanup doc repetitions in libcudf (#13470) @karthikeyann
🚀 New Features
- Support
min
andmax
aggregations for list type in groupby and reduction (#13676) @ttnghia - Add nvtext::jaccard_index API for strings columns (#13669) @davidwendt
- Add read_parquet_metadata libcudf API (#13663) @karthikeyann
- Expose streams in all public copying APIs (#13629) @vyasr
- Add XXHash_64 hash function to cudf (#13612) @davidwendt
- Java support: Floating point order-by columns for RANGE window functions (#13595) @mythrocks
- Use
cuco::static_map
to build string dictionaries in ORC writer (#13580) @vuule - Add pylibcudf subpackage with gather implementation (#13562) @vyasr
- Add JNI for
lists::concatenate_list_elements
(#13547) @ttnghia - Enable nested types for
lists::concatenate_list_elements
(#13545) @ttnghia - Add unicode encoding for string columns in JSON writer (#13539) @karthikeyann
- Remove numba kernels from
find_index_of_val
(#13517) @brandon-b-miller - Floating point order-by columns for RANGE window functions (#13512) @mythrocks
- Parse column chunk metadata statistics in parquet reader (#13472) @karthikeyann
- Add
abs
function to apply (#13408) @brandon-b-miller - [FEA] AST filtering in parquet reader (#13348) @karthikeyann
- [FEA] Adds option to recover from invalid JSON lines in JSON tokenizer (#13344) @elstehle
- Ensure cccl packages don't clash with upstream version (#13235) @robertmaynard
- Update
struct_minmax_util
to experimental row comparator (#13069) @divyegala - Add stream parameter to hashing APIs (#12090) @vyasr
🛠️ Improvements
- Pin
dask
anddistributed
for23.08
release (#13802) @galipremsagar - Relax protobuf pinnings. (#13770) @bdice
- Switch fully unbounded window functions to use aggregations (#13727) @mythrocks
- Switch to new wheel building pipeline (#13723) @vyasr
- Revert CUDA 12.0 CI workflows to branch-23.08. (#13719) @bdice
- Adding identify minimum version requirement (#13713) @hyperbolic2346
- Enforce deprecations and add clarifications around existing deprecations (#13710) @galipremsagar
- Optimize ORC reader performance for list data (#13708) @vyasr
- fix limit overflow message in a docstring (#13703) @ahmet-uyar
- Alleviates JSON parser's need for multi-file sources to end with a newline (#13702) @elstehle
- Update cython-lint and replace flake8 with ruff (#13699) @vyasr
- Add
__dask_tokenize__
definitions to cudf classes (#13695) @rjzamora - Convert libcudf hashing benchmarks to nvbench (#13694) @davidwendt
- Separate MurmurHash32 from hash_functions.cuh (#13681) @davidwendt
- Improve performance of cudf::strings::split on whitespace (#13680) @davidwendt
- Allow ORC and Parquet writers to write nullable columns without nulls as non-nullable (#13675) @vuule
- Raise a NotImplementedError in to_datetime when utc is passed (#13670) @shwina
- Add rmm_mode parameter to nvbench base fixture (#13668) @davidwendt
- Fix multiindex loc ordering in pandas-compat mode (#13660) @wence-
- Add nvtext hash_character_ngrams function (#13654) @davidwendt
- Avoid storing metadata in pointers in ORC and Parquet writers (#13648) @vuule
- Acquire spill lock in to/from_arrow (#13646) @shwina
- Expose stable versions of libcudf sort routines (#13634) @wence-
- Separate out hash_test.cpp source for each hash API (#13633) @davidwendt
- Remove deprecated cudf::strings::slice_strings (by delimiter) functions (#13628) @davidwendt
- Create separate libcudf hash APIs for each supported hash function (#13626) @davidwendt
- Add convert_dtypes API (#13623) @shwina
- Clean up cupy in dependencies.yaml. (#13617) @bdice
- Use cuda-version to constrain cudatoolkit. (#13615) @bdice
- Add murmurhash3_x64_128 function to libcudf (#13604) @davidwendt
- Performance improvement for cudf::strings::like (#13594) @davidwendt
- Remove deprecated cudf.set_allocator. (#13591) @bdice
- Clean up cudf device atomic with
cuda::atomic_ref
(#13583) @PointKernel - Add java bindings for distinct count (#13573) @revans2
- Use nvcomp conda package. (#13566) @bdice
- Add exception to string_scalar if input string exceeds size_type (#13560) @davidwendt
- Add dispatch for
cudf.Dataframe
to/frompyarrow.Table
conversion (#13558) @rjzamora - Get rid of
cuco::pair_type
aliases (#13553) @PointKernel - Introduce parity with pandas when
sort=False
inGroupby
(#13551) @galipremsagar - Update CMake in docker to 3.26.4 (#13550) @NvTimLiu
- Clarify source of error message in stream testing. (#13541) @bdice
- Deprecate
strings_to_categorical
incudf.read_parquet
(#13540) @galipremsagar - Update to CMake 3.26.4 (#13538) @vyasr
- s3 folder naming fix (#13536) @AyodeAwe
- Implement iloc-getitem using parse-don't-validate approach (#13534) @wence-
- Make synchronization explicit in the names of
hostdevice_*
copying APIs (#13530) @ttnghia - Add benchmark (Google Benchmark) dependency to conda packages. (#13528) @bdice
- Add libcufile to dependencies.yaml. (#13523) @bdice
- Fix some memoization logic in groupby/sort/sort_helper.cu (#13521) @davidwendt
- Use sizes_to_offsets_iterator in cudf::gather for strings (#13520) @davidwendt
- use rapids-upload-docs script (#13518) @AyodeAwe
- Support UTF-8 BOM in CSV reader (#13516) @davidwendt
- Move stream-related test configuration to CMake (#13513) @vyasr
- Implement
cudf.option_context
(#13511) @galipremsagar - Unpin
dask
anddistributed
for development (#13508) @galipremsagar - Change build.sh to use pip install instead of setup.py (#13507) @vyasr
- Use test default stream (#13506) @vyasr
- Remove documentation build scripts for Jenkins (#13495) @ajschmidt8
- Use east const in include files (#13494) @karthikeyann
- Use east const in src files (#13493) @karthikeyann
- Use east const in tests files (#13492) @karthikeyann
- Use east const in benchmarks files (#13491) @karthikeyann
- Performance improvement for nvtext tokenize/token functions (#13480) @davidwendt
- Add pd.Float*Dtype to Avro and ORC mappings (#13475) @mroeschke
- Use pandas public APIs where available (#13467) @mroeschke
- Allow pd.ArrowDtype in cudf.from_pandas (#13465) @mroeschke
- Rework libcudf regex benchmarks with nvbench (#13464) @davidwendt
- Remove unused max_rows_tensor parameter from subword tokenizer (#13463) @davidwendt
- Separate io-text and nvtext pytests into different files (#13435) @davidwendt
- Add a move_to function to cudf::string_view::const_iterator (#13428) @davidwendt
- Allow newer scikit-build (#13424) @vyasr
- Refactor sort_by_values to sort_values, drop indices from return values. (#13419) @bdice
- Inline Cython exception handler (#13411) @vyasr
- Init JNI version 23.08.0-SNAPSHOT (#13401) @pxLi
- Refactor ORC reader (#13396) @ttnghia
- JNI: Remove cleaned objects in memory cleaner (#13378) @res-life
- Add tests of currently unsupported indexing (#13338) @wence-
- Performance improvement for some libcudf regex functions for long strings (#13322) @davidwendt
- Exposure Tracked Buffer (first step towards unifying copy-on-write and spilling) (#13307) @madsbk
- Write string data directly to column_buffer in Parquet reader (#13302) @etseidl
- Add stacktrace into cudf exception types (#13298) @ttnghia
- cuDF: Build CUDA 12 packages (#12922) @bdice