v24.02.01
🚨 Breaking Changes
- Remove **kwargs from astype (#14765) @mroeschke
- Remove mimesis as a testing dependency (#14723) @mroeschke
- Update to Dask's
shuffle_method
kwarg (#14708) @pentschev - Drop Pascal GPU support. (#14630) @bdice
- Update to CCCL 2.2.0. (#14576) @bdice
- Expunge as_frame conversions in Column algorithms (#14491) @wence-
- Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
- Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
- Include writer code and writerVersion in ORC files (#14458) @vuule
- Remove null mask for zero nulls in json readers (#14451) @karthikeyann
- REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
- Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
- Move chars column to parent data buffer in strings column (#14202) @karthikeyann
- Switch to scikit-build-core (#13531) @vyasr
🐛 Bug Fixes
- [HOTFIX] Unpin numba<0.58 (#15031) @raydouglass
- Exclude tests from builds (#14981) @vyasr
- Fix the bounce buffer size in ORC writer (#14947) @vuule
- Revert sum/product aggregation to always produce
int64_t
type (#14907) @SurajAralihalli - Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
- Fix total_byte_size in Parquet row group metadata (#14802) @etseidl
- Fix index difference to follow the pandas format (#14789) @amiralimi
- Fix shared-workflows repo name (#14784) @raydouglass
- Remove unparseable attributes from all nodes (#14780) @vyasr
- Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
- Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
- Fix calls to deprecated strings factory API (#14771) @davidwendt
- Fix ptx file discovery in editable installs (#14767) @vyasr
- Revise
shuffle
deprecation to align with dask/dask (#14762) @rjzamora - Enable intermediate proxies to be picklable (#14752) @shwina
- Add CUDF_TEST_PROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
- Fix CMake args (#14746) @vyasr
- Fix logic bug introduced in #14730 (#14742) @wence-
- [Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
- Fix
Groupby.get_group
(#14728) @rjzamora - Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
- Split cuda versions for notebook testing (#14722) @raydouglass
- Fix to_numeric not preserving Series index and name (#14718) @mroeschke
- Update dask-cudf wheel name (#14713) @raydouglass
- Fix strings::contains matching end of string target (#14711) @davidwendt
- Update to Dask's
shuffle_method
kwarg (#14708) @pentschev - Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
- Potential fix for peformance regression in #14415 (#14706) @etseidl
- Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
- Skip numba test that fails on ARM (#14702) @brandon-b-miller
- Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
- Fix nan_as_null not being respected when passing arrow object (#14688) @mroeschke
- Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
- Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
- Add BaseOffset as a final proxy type to pass instancechecks for offsets against
BaseOffset
(#14678) @shwina - Add row conversion code from spark-rapids-jni (#14664) @ttnghia
- Unconditionally export the CCCL path (#14656) @vyasr
- Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
- Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
- Fix invalid memory access in Parquet reader (#14637) @etseidl
- Use column_empty over as_column([]) (#14632) @mroeschke
- Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
- Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
- Remove non-empty nulls in cudf::get_json_object (#14609) @davidwendt
- Remove
cuda::proclaim_return_type
from nested lambda (#14607) @ttnghia - Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
- Address potential race conditions in Parquet reader (#14602) @etseidl
- Fix DataFrame.reindex removing column name (#14601) @mroeschke
- Remove unsanitized input test data from copy gtests (#14600) @davidwendt
- Fix race detected in Parquet writer (#14598) @etseidl
- Correct invalid or missing return types (#14587) @robertmaynard
- Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
- Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
- Fix unsanitized nulls produced by
cudf::clamp
APIs (#14580) @davidwendt - Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
- Fixes a symbol group lookup table issue (#14561) @elstehle
- Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
- REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
- Improve memory footprint of isin by using contains (#14478) @wence-
- Move creation of env.yaml outside the current directory (#14476) @davidwendt
- Enable
pd.Timestamp
objects to be picklable whencudf.pandas
is active (#14474) @shwina - Correct dtype of count aggregations on empty dataframes (#14473) @wence-
- Avoid DataFrame conversion in
MultiIndex.from_pandas
(#14470) @mroeschke - JSON writer: avoid default stream use in
string_scalar
constructors (#14444) @vuule - Fix default stream use in the CSV reader (#14443) @vuule
- Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
- Defer PTX file load to runtime (#13690) @brandon-b-miller
📖 Documentation
- Disable parallel build (#14796) @vyasr
- Add pylibcudf to the docs (#14791) @vyasr
- Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
- Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
- More doxygen fixes (#14639) @vyasr
- Enable doxygen XML generation and fix issues (#14477) @vyasr
- Some doxygen improvements (#14469) @vyasr
- Remove warning in dask-cudf docs (#14454) @wence-
- Update README links with redirects. (#14378) @bdice
- Add pip install instructions to README (#13677) @shwina
🚀 New Features
- Add ci check for external kernels (#14768) @robertmaynard
- JSON single quote normalization API (#14729) @shrshi
- Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
- Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
- Don't constrain
numba<0.58
(#14616) @brandon-b-miller - Add DELTA_LENGTH_BYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
- JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
- JSON quote normalization (#14545) @shrshi
- Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
- Implement more copying APIs in pylibcudf (#14508) @vyasr
- Include writer code and writerVersion in ORC files (#14458) @vuule
- Parquet sub-rowgroup reading. (#14360) @nvdbaranec
- Move chars column to parent data buffer in strings column (#14202) @karthikeyann
- PARQUET-2261 Size Statistics (#14000) @etseidl
- Improve GroupBy JIT error handling (#13854) @brandon-b-miller
- Generate unified Python/C++ docs (#13846) @vyasr
- Expand JIT groupby test suite (#13813) @brandon-b-miller
🛠️ Improvements
- Pin
pytest<8
(#14920) @galipremsagar - Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
- Clean up
TimedeltaIndex.__init__
constructor (#14775) @mroeschke - Clean up
DatetimeIndex.__init__
constructor (#14774) @mroeschke - Some
frame.py
typing, move seldom used methods inframe.py
(#14766) @mroeschke - Remove **kwargs from astype (#14765) @mroeschke
- fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
- Add
pynvjitlink
as a dependency (#14763) @brandon-b-miller - Resolve degenerate performance in
create_structs_data
(#14761) @SurajAralihalli - Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
- Pin pytest-cases<3.8.2 (#14756) @mroeschke
- Use _from_data instead of _from_columns for initialzing Frame (#14755) @mroeschke
- Consolidate cudf object handling in as_column (#14754) @mroeschke
- Reduce execution time of Parquet C++ tests (#14750) @vuule
- Implement to_datetime(..., utc=True) (#14749) @mroeschke
- Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
- Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
- Implement
cudf.MultiIndex.from_arrays
(#14740) @mroeschke - Remove unused/single use methods (#14739) @mroeschke
- refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
- Remove unneeded methods in Column (#14730) @mroeschke
- Clean up base column methods (#14725) @mroeschke
- Ensure column.fillna signatures are consistent (#14724) @mroeschke
- Remove mimesis as a testing dependency (#14723) @mroeschke
- Replace as_numerical with as_numerical_column/codes (#14719) @mroeschke
- Use offsetalator in gather_chars (#14700) @davidwendt
- Use make_strings_children for fill() specialization logic (#14697) @davidwendt
- Change
io::detail::orc
namespace intoio::orc::detail
(#14696) @ttnghia - Fix call to deprecated factory function (#14695) @davidwendt
- Use as_column instead of arange for range like inputs (#14689) @mroeschke
- Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
- Split parquet test into multiple files (#14663) @etseidl
- Custom error messages for IO with nonexistent files (#14662) @vuule
- Explicitly pass .dtype into is_foo_dtype functions (#14657) @mroeschke
- Basic validation in reader benchmarks (#14647) @vuule
- Update dependencies.yaml to support CUDA 12.*. (#14644) @bdice
- Consolidate memoryview handling in as_column (#14643) @mroeschke
- Convert
FieldType
to scoped enum (#14642) @vuule - Use instance over is_foo_dtype (#14641) @mroeschke
- Use isinstance over is_foo_dtype internally (#14638) @mroeschke
- Remove unnecessary **kwargs in function signatures (#14635) @mroeschke
- Drop nvbench patch for nvml. (#14631) @bdice
- Drop Pascal GPU support. (#14630) @bdice
- Add cpp/doxygen/xml to .gitignore (#14613) @davidwendt
- Create strings-specific make_offsets_child_column for multiple offset types (#14612) @davidwendt
- Use the offsetalator in cudf::concatenate for strings (#14611) @davidwendt
- Make Parquet ColumnIndex null_counts optional (#14596) @etseidl
- Support
freq
in DatetimeIndex (#14593) @shwina - Remove legacy benchmarks for cuDF-python (#14591) @osidekyle
- Remove WORKSPACE env var from cudf_test temp_directory class (#14588) @davidwendt
- Use exceptions instead of return values to handle errors in
CompactProtocolReader
(#14582) @vuule - Use cuda::proclaim_return_type on device lambdas. (#14577) @bdice
- Update to CCCL 2.2.0. (#14576) @bdice
- Update dependencies.yaml to new pip index (#14575) @vyasr
- Simplify Python CMake (#14565) @vyasr
- Java expose parquet pass_read_limit (#14564) @revans2
- Add column sanitization checks in
CUDF_TEST_EXPECT_COLUMN_*
macros (#14559) @SurajAralihalli - Use cudf_test temp_directory class for nvtext::subword_tokenize gbenchmark (#14558) @davidwendt
- Fix return type of prefix increment overloads (#14544) @vuule
- Make bpe_merge_pairs_impl member private (#14543) @davidwendt
- Small clean up in
io::statistics
(#14542) @vuule - Change json gtest environment variable to compile-time definition (#14541) @davidwendt
- Remove extra total chars size calculation from cudf::concatenate (#14540) @davidwendt
- Refactor IndexedFrame.hash_values to use cudf::hashing functions, add xxhash64 to cudf Python. (#14538) @bdice
- Move non-templated inline function definitions from table_view.hpp to table_view.cpp (#14535) @davidwendt
- Add JNI for strings::code_points (#14533) @thirtiseven
- Add a test for issue 12773 (#14529) @vyasr
- Split libarrow build dependencies. (#14506) @bdice
- Implement
IndexedFrame.duplicated
withdistinct_indices
+scatter
(#14493) @wence- - Expunge as_frame conversions in Column algorithms (#14491) @wence-
- Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
- Refactor Parquet kernel_error (#14464) @etseidl
- Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
- Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
- Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
- Expose stream parameter in public nvtext APIs (#14456) @davidwendt
- Include encode type in the error message when unsupported Parquet encoding is detected (#14453) @ZelboK
- Remove null mask for zero nulls in json readers (#14451) @karthikeyann
- Refactor cudf.Series.init (#14450) @mroeschke
- Remove the use of
volatile
in Parquet (#14448) @vuule - REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
- Testing stream pool implementation (#14437) @shrshi
- Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
- Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
- Use isinstance(..., cudf.IntervalDtype) instead of is_interval_dtype (#14424) @mroeschke
- Use isinstance(..., cudf.CategoricalDtype) instead of is_categorical_dtype (#14423) @mroeschke
- Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
- REF: Remove instances of pd.core (#14421) @mroeschke
- Expose streams in public filling APIs for label_bins (#14401) @ZelboK
- Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
- Limit DELTA_BINARY_PACKED encoder to the same number of bits as the physical type being encoded (#14392) @etseidl
- Add SHA-1 and SHA-2 hash functions. (#14391) @bdice
- Expose streams in Parquet reader and writer APIs (#14359) @shrshi
- Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
- Replace default stream for scalars and column factories usages (because of defaulted arguments) (#14354) @karthikeyann
- Expose streams in ORC reader and writer APIs (#14350) @shrshi
- Convert compression and io to string axis type in IO benchmarks (#14347) @SurajAralihalli
- Add cuDF devcontainers (#14015) @trxcllnt
- Refactoring of Buffers (last step towards unifying COW and Spilling) (#13801) @madsbk
- Switch to scikit-build-core (#13531) @vyasr
- Simplify null count checking in column equality comparator (#13312) @vyasr