Release v24.02.01 · rapidsai/cudf

🚨 Breaking Changes

Remove **kwargs from astype (#14765) @mroeschke
Remove mimesis as a testing dependency (#14723) @mroeschke
Update to Dask's shuffle_method kwarg (#14708) @pentschev
Drop Pascal GPU support. (#14630) @bdice
Update to CCCL 2.2.0. (#14576) @bdice
Expunge as_frame conversions in Column algorithms (#14491) @wence-
Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
Include writer code and writerVersion in ORC files (#14458) @vuule
Remove null mask for zero nulls in json readers (#14451) @karthikeyann
REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
Move chars column to parent data buffer in strings column (#14202) @karthikeyann
Switch to scikit-build-core (#13531) @vyasr

🐛 Bug Fixes

[HOTFIX] Unpin numba<0.58 (#15031) @raydouglass
Exclude tests from builds (#14981) @vyasr
Fix the bounce buffer size in ORC writer (#14947) @vuule
Revert sum/product aggregation to always produce int64_t type (#14907) @SurajAralihalli
Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
Fix total_byte_size in Parquet row group metadata (#14802) @etseidl
Fix index difference to follow the pandas format (#14789) @amiralimi
Fix shared-workflows repo name (#14784) @raydouglass
Remove unparseable attributes from all nodes (#14780) @vyasr
Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
Fix calls to deprecated strings factory API (#14771) @davidwendt
Fix ptx file discovery in editable installs (#14767) @vyasr
Revise shuffle deprecation to align with dask/dask (#14762) @rjzamora
Enable intermediate proxies to be picklable (#14752) @shwina
Add CUDF_TEST_PROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
Fix CMake args (#14746) @vyasr
Fix logic bug introduced in #14730 (#14742) @wence-
[Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
Fix Groupby.get_group (#14728) @rjzamora
Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
Split cuda versions for notebook testing (#14722) @raydouglass
Fix to_numeric not preserving Series index and name (#14718) @mroeschke
Update dask-cudf wheel name (#14713) @raydouglass
Fix strings::contains matching end of string target (#14711) @davidwendt
Update to Dask's shuffle_method kwarg (#14708) @pentschev
Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
Potential fix for peformance regression in #14415 (#14706) @etseidl
Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
Skip numba test that fails on ARM (#14702) @brandon-b-miller
Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
Fix nan_as_null not being respected when passing arrow object (#14688) @mroeschke
Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
Add BaseOffset as a final proxy type to pass instancechecks for offsets against BaseOffset (#14678) @shwina
Add row conversion code from spark-rapids-jni (#14664) @ttnghia
Unconditionally export the CCCL path (#14656) @vyasr
Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
Fix invalid memory access in Parquet reader (#14637) @etseidl
Use column_empty over as_column([]) (#14632) @mroeschke
Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
Remove non-empty nulls in cudf::get_json_object (#14609) @davidwendt
Remove cuda::proclaim_return_type from nested lambda (#14607) @ttnghia
Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
Address potential race conditions in Parquet reader (#14602) @etseidl
Fix DataFrame.reindex removing column name (#14601) @mroeschke
Remove unsanitized input test data from copy gtests (#14600) @davidwendt
Fix race detected in Parquet writer (#14598) @etseidl
Correct invalid or missing return types (#14587) @robertmaynard
Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
Fix unsanitized nulls produced by cudf::clamp APIs (#14580) @davidwendt
Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
Fixes a symbol group lookup table issue (#14561) @elstehle
Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
Improve memory footprint of isin by using contains (#14478) @wence-
Move creation of env.yaml outside the current directory (#14476) @davidwendt
Enable pd.Timestamp objects to be picklable when cudf.pandas is active (#14474) @shwina
Correct dtype of count aggregations on empty dataframes (#14473) @wence-
Avoid DataFrame conversion in MultiIndex.from_pandas (#14470) @mroeschke
JSON writer: avoid default stream use in string_scalar constructors (#14444) @vuule
Fix default stream use in the CSV reader (#14443) @vuule
Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
Defer PTX file load to runtime (#13690) @brandon-b-miller

📖 Documentation

Disable parallel build (#14796) @vyasr
Add pylibcudf to the docs (#14791) @vyasr
Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
More doxygen fixes (#14639) @vyasr
Enable doxygen XML generation and fix issues (#14477) @vyasr
Some doxygen improvements (#14469) @vyasr
Remove warning in dask-cudf docs (#14454) @wence-
Update README links with redirects. (#14378) @bdice
Add pip install instructions to README (#13677) @shwina

🚀 New Features

Add ci check for external kernels (#14768) @robertmaynard
JSON single quote normalization API (#14729) @shrshi
Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
Don't constrain numba<0.58 (#14616) @brandon-b-miller
Add DELTA_LENGTH_BYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
JSON quote normalization (#14545) @shrshi
Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
Implement more copying APIs in pylibcudf (#14508) @vyasr
Include writer code and writerVersion in ORC files (#14458) @vuule
Parquet sub-rowgroup reading. (#14360) @nvdbaranec
Move chars column to parent data buffer in strings column (#14202) @karthikeyann
PARQUET-2261 Size Statistics (#14000) @etseidl
Improve GroupBy JIT error handling (#13854) @brandon-b-miller
Generate unified Python/C++ docs (#13846) @vyasr
Expand JIT groupby test suite (#13813) @brandon-b-miller

🛠️ Improvements

Pin pytest<8 (#14920) @galipremsagar
Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
Clean up TimedeltaIndex.__init__ constructor (#14775) @mroeschke
Clean up DatetimeIndex.__init__ constructor (#14774) @mroeschke
Some frame.py typing, move seldom used methods in frame.py (#14766) @mroeschke
Remove **kwargs from astype (#14765) @mroeschke
fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
Add pynvjitlink as a dependency (#14763) @brandon-b-miller
Resolve degenerate performance in create_structs_data (#14761) @SurajAralihalli
Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
Pin pytest-cases<3.8.2 (#14756) @mroeschke
Use _from_data instead of _from_columns for initialzing Frame (#14755) @mroeschke
Consolidate cudf object handling in as_column (#14754) @mroeschke
Reduce execution time of Parquet C++ tests (#14750) @vuule
Implement to_datetime(..., utc=True) (#14749) @mroeschke
Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
Implement cudf.MultiIndex.from_arrays (#14740) @mroeschke
Remove unused/single use methods (#14739) @mroeschke
refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
Remove unneeded methods in Column (#14730) @mroeschke
Clean up base column methods (#14725) @mroeschke
Ensure column.fillna signatures are consistent (#14724) @mroeschke
Remove mimesis as a testing dependency (#14723) @mroeschke
Replace as_numerical with as_numerical_column/codes (#14719) @mroeschke
Use offsetalator in gather_chars (#14700) @davidwendt
Use make_strings_children for fill() specialization logic (#14697) @davidwendt
Change io::detail::orc namespace into io::orc::detail (#14696) @ttnghia
Fix call to deprecated factory function (#14695) @davidwendt
Use as_column instead of arange for range like inputs (#14689) @mroeschke
Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
Split parquet test into multiple files (#14663) @etseidl
Custom error messages for IO with nonexistent files (#14662) @vuule
Explicitly pass .dtype into is_foo_dtype functions (#14657) @mroeschke
Basic validation in reader benchmarks (#14647) @vuule
Update dependencies.yaml to support CUDA 12.*. (#14644) @bdice
Consolidate memoryview handling in as_column (#14643) @mroeschke
Convert FieldType to scoped enum (#14642) @vuule
Use instance over is_foo_dtype (#14641) @mroeschke
Use isinstance over is_foo_dtype internally (#14638) @mroeschke
Remove unnecessary **kwargs in function signatures (#14635) @mroeschke
Drop nvbench patch for nvml. (#14631) @bdice
Drop Pascal GPU support. (#14630) @bdice
Add cpp/doxygen/xml to .gitignore (#14613) @davidwendt
Create strings-specific make_offsets_child_column for multiple offset types (#14612) @davidwendt
Use the offsetalator in cudf::concatenate for strings (#14611) @davidwendt
Make Parquet ColumnIndex null_counts optional (#14596) @etseidl
Support freq in DatetimeIndex (#14593) @shwina
Remove legacy benchmarks for cuDF-python (#14591) @osidekyle
Remove WORKSPACE env var from cudf_test temp_directory class (#14588) @davidwendt
Use exceptions instead of return values to handle errors in CompactProtocolReader (#14582) @vuule
Use cuda::proclaim_return_type on device lambdas. (#14577) @bdice
Update to CCCL 2.2.0. (#14576) @bdice
Update dependencies.yaml to new pip index (#14575) @vyasr
Simplify Python CMake (#14565) @vyasr
Java expose parquet pass_read_limit (#14564) @revans2
Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_* macros (#14559) @SurajAralihalli
Use cudf_test temp_directory class for nvtext::subword_tokenize gbenchmark (#14558) @davidwendt
Fix return type of prefix increment overloads (#14544) @vuule
Make bpe_merge_pairs_impl member private (#14543) @davidwendt
Small clean up in io::statistics (#14542) @vuule
Change json gtest environment variable to compile-time definition (#14541) @davidwendt
Remove extra total chars size calculation from cudf::concatenate (#14540) @davidwendt
Refactor IndexedFrame.hash_values to use cudf::hashing functions, add xxhash64 to cudf Python. (#14538) @bdice
Move non-templated inline function definitions from table_view.hpp to table_view.cpp (#14535) @davidwendt
Add JNI for strings::code_points (#14533) @thirtiseven
Add a test for issue 12773 (#14529) @vyasr
Split libarrow build dependencies. (#14506) @bdice
Implement IndexedFrame.duplicated with distinct_indices + scatter (#14493) @wence-
Expunge as_frame conversions in Column algorithms (#14491) @wence-
Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
Refactor Parquet kernel_error (#14464) @etseidl
Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
Expose stream parameter in public nvtext APIs (#14456) @davidwendt
Include encode type in the error message when unsupported Parquet encoding is detected (#14453) @ZelboK
Remove null mask for zero nulls in json readers (#14451) @karthikeyann
Refactor cudf.Series.init (#14450) @mroeschke
Remove the use of volatile in Parquet (#14448) @vuule
REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
Testing stream pool implementation (#14437) @shrshi
Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
Use isinstance(..., cudf.IntervalDtype) instead of is_interval_dtype (#14424) @mroeschke
Use isinstance(..., cudf.CategoricalDtype) instead of is_categorical_dtype (#14423) @mroeschke
Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
REF: Remove instances of pd.core (#14421) @mroeschke
Expose streams in public filling APIs for label_bins (#14401) @ZelboK
Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
Limit DELTA_BINARY_PACKED encoder to the same number of bits as the physical type being encoded (#14392) @etseidl
Add SHA-1 and SHA-2 hash functions. (#14391) @bdice
Expose streams in Parquet reader and writer APIs (#14359) @shrshi
Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
Replace default stream for scalars and column factories usages (because of defaulted arguments) (#14354) @karthikeyann
Expose streams in ORC reader and writer APIs (#14350) @shrshi
Convert compression and io to string axis type in IO benchmarks (#14347) @SurajAralihalli
Add cuDF devcontainers (#14015) @trxcllnt
Refactoring of Buffers (last step towards unifying COW and Spilling) (#13801) @madsbk
Switch to scikit-build-core (#13531) @vyasr
Simplify null count checking in column equality comparator (#13312) @vyasr

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v24.02.01

🚨 Breaking Changes

🐛 Bug Fixes

📖 Documentation

🚀 New Features

🛠️ Improvements

Contributors