v24.06.01
🚨 Breaking Changes
- Deprecate
Groupby.collect
(#15808) @galipremsagar - Raise FileNotFoundError when a literal JSON string that looks like a json filename is passed (#15806) @lithomas1
- Support filtered I/O in
chunked_parquet_reader
and simplify the use ofparquet_reader_options
(#15764) @mhaseeb123 - Raise errors for unsupported operations on certain types (#15712) @galipremsagar
- Support
DurationType
in cudf parquet reader viaarrow:schema
(#15617) @mhaseeb123 - Remove protobuf and use parsed ORC statistics from libcudf (#15564) @bdice
- Remove legacy JSON reader from Python (#15538) @bdice
- Removing all batching code from parquet writer (#15528) @mhaseeb123
- Convert libcudf resource parameters to rmm::device_async_resource_ref (#15507) @harrism
- Remove deprecated strings offsets_begin (#15454) @davidwendt
- Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
- Bind
read_parquet_metadata
API to libcudf instead of pyarrow and extractRowGroup
information (#15398) @mhaseeb123 - Remove deprecated hash() and spark_murmurhash3_x86_32() (#15375) @davidwendt
- Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
- [FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
- Align date_range defaults with pandas, support tz (#15139) @mroeschke
🐛 Bug Fixes
- Backport: Use size_t to allow large conditional joins (#16127) (#16133) @bdice
- Backport #16045 to 24.06 (#16102) @vyasr
- Backport #16038 to 24.06 (#16101) @vyasr
- Backport: Fix segfault in conditional join (#16094) (#16100) @bdice
- Add patch for incorrect cuco noexcept clauses (#16077) @vyasr
- Revert "Fix docs for IO readers and strings_convert" (#15872) @vyasr
- Remove problematic call of index setter to unblock dask-cuda CI (#15844) @charlesbluca
- Use rapids_cpm_nvtx3 to get same nvtx3 target state as rmm (#15840) @robertmaynard
- Return boolean from config_host_memory_resource instead of throwing (#15815) @abellina
- Add temporary dask-cudf workaround for categorical sorting (#15801) @rjzamora
- Fix row group alignment in ORC writer (#15789) @vuule
- Raise error when sorting by categorical column in dask-cudf (#15788) @rjzamora
- Upgrade
arrow
to 16.1 (#15787) @galipremsagar - Add support for
PandasArray
forpandas<2.1.0
(#15786) @galipremsagar - Limit runtime dependency to
libarrow>=16.0.0,<16.1.0a0
(#15782) @pentschev - Fix cat.as_ordered not propogating correct size (#15780) @mroeschke
- Handle mixed-like homogeneous types in
isin
(#15771) @galipremsagar - Fix id_vars and value_vars not accepting string scalars in melt (#15765) @mroeschke
- Fix
DatetimeIndex.loc
for all types of ordering cases (#15761) @galipremsagar - Fix arrow versioning logic (#15755) @vyasr
- Avoid running sanitizer on Java test designed to cause an error (#15753) @jlowe
- Handle empty dataframe object with index present in setitem of
loc
(#15752) @galipremsagar - Eliminate circular reference in DataFrame/Series.iloc/loc (#15749) @mroeschke
- Cap the absolute row index per pass in parquet chunked reader. (#15735) @nvdbaranec
- Fix
Index.repeat
fordatetime64
types (#15722) @galipremsagar - Fix multibyte check for case convert for large strings (#15721) @davidwendt
- Fix
get_loc
to properly fetch results from an index that is in decreasing order (#15719) @galipremsagar - Return same type as the original index for
.loc
operations (#15717) @galipremsagar - Correct static builds + static arrow (#15715) @robertmaynard
- Raise errors for unsupported operations on certain types (#15712) @galipremsagar
- Fix ColumnAccessor caching of nrows if empty previously (#15710) @mroeschke
- Allow
None
whennan_as_null=False
in column constructor (#15709) @galipremsagar - Refine
CudaTest.testCudaException
in case throwing wrong type of CudaError under aarch64 (#15706) @sperlingxx - Fix maxima of categorical column (#15701) @rjzamora
- Add proxy for inplace operations in
cudf.pandas
(#15695) @galipremsagar - Make
nan_as_null
behavior consistent across all APIs (#15692) @galipremsagar - Fix CI s3 api command to fetch latest results (#15687) @galipremsagar
- Add
NumpyExtensionArray
proxy type incudf.pandas
(#15686) @galipremsagar - Properly implement binaryops for proxy types (#15684) @galipremsagar
- Fix copy assignment and the comparison operator of
rmm_host_allocator
(#15677) @vuule - Fix multi-source reading in JSON byte range reader (#15671) @shrshi
- Return
int64
when pandas compatible mode is turned on forget_indexer
(#15659) @galipremsagar - Fix Index contains for error validations and float vs int comparisons (#15657) @galipremsagar
- Preserve sub-second data for time scalars in column construction (#15655) @galipremsagar
- Check row limit size in cudf::strings::join_strings (#15643) @davidwendt
- Enable sorting on column with nulls using query-planning (#15639) @rjzamora
- Fix operator precedence problem in Parquet reader (#15638) @etseidl
- Fix decoding of dictionary encoded FIXED_LEN_BYTE_ARRAY data in Parquet reader (#15601) @etseidl
- Fix debug warnings/errors in from_arrow_device_test.cpp (#15596) @davidwendt
- Add "collect" aggregation support to dask-cudf (#15593) @rjzamora
- Fix categorical-accessor support and testing in dask-cudf (#15591) @rjzamora
- Disable compute-sanitizer usage in CI tests with CUDA<11.6 (#15584) @davidwendt
- Preserve RangeIndex.step in to_arrow/from_arrow (#15581) @mroeschke
- Ignore new cupy warning (#15574) @vyasr
- Add cuda-sanitizer-api dependency for test-cpp matrix 11.4 (#15573) @davidwendt
- Allow apply udf to reference global modules in cudf.pandas (#15569) @mroeschke
- Fix deprecation warnings for json legacy reader (#15563) @davidwendt
- Fix millisecond resampling in cudf Python (#15560) @mroeschke
- Rename JSON_READER_OPTION to JSON_READER_OPTION_NVBENCH. (#15553) @bdice
- Fix a JNI bug in JSON parsing fixup (#15550) @revans2
- Remove conda channel setup from wheel CI image script. (#15539) @bdice
- cudf.pandas: Series dt accessor is CombinedDatetimelikeProperties (#15523) @wence-
- Fix for some compiler warnings in parquet/page_decode.cuh (#15518) @etseidl
- Fix exponent overflow in strings-to-double conversion (#15517) @davidwendt
- nanoarrow uses package override for proper pinned versions generation (#15515) @robertmaynard
- Remove index name overrides in dask-cudf pyarrow table dispatch (#15514) @charlesbluca
- Fix async synchronization issues in json_column.cu (#15497) @karthikeyann
- Add new patch to hide more CCCL APIs (#15493) @vyasr
- Make improvements in pandas-test reporting (#15485) @galipremsagar
- Fixed page data truncation in parquet writer under certain conditions. (#15474) @nvdbaranec
- Only use data_type constructor with scale for decimal types (#15472) @wence-
- Avoid "p2p" shuffle as a default when
dask_cudf
is imported (#15469) @rjzamora - Fix debug build errors from to_arrow_device_test.cpp (#15463) @davidwendt
- Fix base_normalator::integer_sizeof_fn integer dispatch (#15457) @davidwendt
- Allow consumers of static builds to find nanoarrow (#15456) @robertmaynard
- Allow jit compilation when using a splayed CUDA toolkit (#15451) @robertmaynard
- Handle case of scan aggregation in groupby-transform (#15450) @wence-
- Test static builds in CI and fix nanoarrow configure (#15437) @vyasr
- Fixes potential race in JSON parser when parsing JSON lines format and when recovering from invalid lines (#15419) @elstehle
- Fix errors in chunked ORC writer when no tables were (successfully) written (#15393) @vuule
- Support implicit array conversion with query-planning enabled (#15378) @rjzamora
- Fix arrow-based round trip of empty dataframes (#15373) @wence-
- Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
- Remove boundscheck=False setting in cython files (#15362) @wence-
- Patch dask-expr
var
logic in dask-cudf (#15347) @rjzamora - Fix for logical and syntactical errors in libcudf c++ examples (#15346) @mhaseeb123
- Disable dask-expr in docs builds. (#15343) @bdice
- Apply the cuFile error work around to data_sink as well (#15335) @vuule
- Fix parquet predicate filtering with column projection (#15113) @karthikeyann
- Check column type equality, handling nested types correctly. (#14531) @bdice
📖 Documentation
- Fix docs for IO readers and strings_convert (#15842) @bdice
- Update cudf.pandas docs for GA (#15744) @beckernick
- Add contributing warning about circular imports (#15691) @er-eis
- Update libcudf developer guide for strings offsets column (#15661) @davidwendt
- Update developer guide with device_async_resource_ref guidelines (#15562) @harrism
- DOC: add pandas intersphinx mapping (#15531) @raybellwaves
- rm-dup-doc in frame.py (#15530) @raybellwaves
- Update CONTRIBUTING.md to use latest cuda env (#15467) @raybellwaves
- Doc: interleave columns pandas compat (#15383) @raybellwaves
- Simplified README Examples (#15338) @wkaisertexas
- Add debug tips section to libcudf developer guide (#15329) @davidwendt
- Fix and clarify notes on result ordering (#13255) @shwina
🚀 New Features
- Add JNI bindings for zstd compression of NVCOMP. (#15729) @firestarman
- Fix spaces around CSV quoted strings (#15727) @thabetx
- Add default pinned pool that falls back to new pinned allocations (#15665) @vuule
- Overhaul ops-codeowners coverage (#15660) @raydouglass
- Concatenate dictionary of objects along axis=1 (#15623) @er-eis
- Construct
pylibcudf
columns from objects supporting__cuda_array_interface__
(#15615) @brandon-b-miller - Expose some Parquet per-column configuration options via the python API (#15613) @etseidl
- Migrate string
find
operations topylibcudf
(#15604) @brandon-b-miller - Round trip FIXED_LEN_BYTE_ARRAY data properly in Parquet writer (#15600) @etseidl
- Reading multi-line JSON in string columns using runtime configurable delimiter (#15556) @shrshi
- Remove public gtest dependency from libcudf conda package (#15534) @robertmaynard
- Fea/move to latest nanoarrow (#15526) @robertmaynard
- Migrate string
case
operations topylibcudf
(#15489) @brandon-b-miller - Add Parquet encoding statistics to column chunk metadata (#15452) @etseidl
- Implement JNI for chunked ORC reader (#15446) @ttnghia
- Add some missing optional fields to the Parquet RowGroup metadata (#15421) @etseidl
- Adding parquet transcoding example (#15420) @mhaseeb123
- Add fields to Parquet Statistics structure that were added in parquet-format 2.10 (#15412) @etseidl
- Add option to Parquet writer to skip compressing individual columns (#15411) @etseidl
- Add BYTE_STREAM_SPLIT support to Parquet (#15311) @etseidl
- Introduce benchmark suite for JSON reader options (#15124) @shrshi
- Implement ORC chunked reader (#15094) @ttnghia
- Extend cudf devcontainers to specify jitify2 kernel cache (#15068) @robertmaynard
- Add
to_arrow_device
function to cudf interop using nanoarrow (#15047) @zeroshade - Add JSON option to prune columns (#14996) @karthikeyann
🛠️ Improvements
- Deprecate
Groupby.collect
(#15808) @galipremsagar - Raise FileNotFoundError when a literal JSON string that looks like a json filename is passed (#15806) @lithomas1
- Deprecate
divisions='quantile'
support inset_index
(#15804) @rjzamora - Improve performance of Series.to_numpy/to_cupy (#15792) @mroeschke
- Access
self.index
instead ofself._index
where possible (#15781) @mroeschke - Support filtered I/O in
chunked_parquet_reader
and simplify the use ofparquet_reader_options
(#15764) @mhaseeb123 - Avoid index-to-column conversion in some DataFrame ops (#15763) @mroeschke
- Fix
chunked_parquet_reader
behavior when input has no more rows to read (#15757) @mhaseeb123 - [JNI] Expose java API for cudf::io::config_host_memory_resource (#15745) @abellina
- Migrate all cpp pxd files into pylibcudf (#15740) @vyasr
- Validate and materialize iterators earlier in as_column (#15739) @mroeschke
- Push some as_column arrow logic to ColumnBase.from_arrow (#15738) @mroeschke
- Expose stream parameter in public reduction APIs (#15737) @srinivasyadav18
- remove unnecessary 'setuptools' host dependency, simplify dependencies.yaml (#15736) @jameslamb
- Defer to C++ equality and hashing for pylibcudf DataType and Aggregation objects (#15732) @wence-
- Implement null-aware NOT_EQUALS binop (#15731) @wence-
- Fix split-record result list column offset type (#15707) @davidwendt
- Upgrade
arrow
to16
(#15703) @galipremsagar - Remove experimental namespace from make_strings_children (#15702) @davidwendt
- Rework get_json_object benchmark to use nvbench (#15698) @davidwendt
- Rework some python tests of Parquet delta encodings (#15693) @etseidl
- Skeleton cudf polars package (#15688) @wence-
- Upgrade pre commit hooks (#15685) @wence-
- Allow
fillna
to validate forCategoricalColumn.fillna
(#15683) @galipremsagar - Misc Column cleanups (#15682) @mroeschke
- Reducing runtime of JSON reader options benchmark (#15681) @shrshi
- Add
Timestamp
andTimedelta
proxy types (#15680) @galipremsagar - Remove host_parse_nested_json. (#15674) @bdice
- Reduce runtime for ParquetChunkedReaderInputLimitTest gtests (#15672) @davidwendt
- Add large-strings gtest for cudf::interleave_columns (#15669) @davidwendt
- Use experimental make_strings_children for multi-replace_re (#15667) @davidwendt
- Enabled
Holiday
types incudf.pandas
(#15664) @galipremsagar - Remove obsolete
XFAIL
markers for query-planning (#15662) @rjzamora - Clean up join benchmarks (#15644) @PointKernel
- Enable warnings as errors in custreamz (#15642) @mroeschke
- Improve distinct join with set
retrieve
(#15636) @PointKernel - Fix -Werror=type-limits. (#15635) @bdice
- Enable FutureWarnings/DeprecationWarnings as errors for dask_cudf (#15634) @mroeschke
- Remove NVBench SHA override. (#15633) @alliepiper
- Add support for large string columns to Parquet reader and writer (#15632) @etseidl
- Large strings support in MD5 and SHA hashers (#15631) @davidwendt
- Fix make_offsets_child_column usage in cudf::strings::detail::shift (#15630) @davidwendt
- Use experimental make_strings_children for strings convert (#15629) @davidwendt
- Forward-merge branch-24.04 to branch-24.06 (#15627) @bdice
- Avoid accessing attributes via
_column
if not needed (#15624) @mroeschke - Make ColumnBase.cuda_array_interface opt out instead of opt in (#15622) @mroeschke
- Large strings support for cudf::gather (#15621) @davidwendt
- Remove jni-docker-build workflow (#15619) @bdice
- Support
DurationType
in cudf parquet reader viaarrow:schema
(#15617) @mhaseeb123 - Drop Centos7 support (#15608) @NvTimLiu
- Use experimental make_strings_children for json/csv writers (#15599) @davidwendt
- Use experimental make_strings_children for strings join/url_encode/slice (#15598) @davidwendt
- Use experimental make_strings_children in nvtext APIs (#15595) @davidwendt
- Migrate to
{{ stdlib("c") }}
(#15594) @hcho3 - Deprecate
to/from_dask_dataframe
APIs in dask-cudf (#15592) @rjzamora - Minor fixups for future NumPy 2 compatibility (#15590) @seberg
- Delay materializing RangeIndex in .reset_index (#15588) @mroeschke
- Use experimental make_strings_children for capitalize/case/pad functions (#15587) @davidwendt
- Use experimental make_strings_children for strings replace/filter/translate (#15586) @davidwendt
- Add multithreaded parquet reader benchmarks. (#15585) @nvdbaranec
- Don't materialize column during RangeIndex methods (#15582) @mroeschke
- Improve performance for cudf::strings::count_re (#15578) @davidwendt
- Replace RangeIndex._start/_stop/_step with _range (#15576) @mroeschke
- add --rm and --name to devcontainer run args (#15572) @trxcllnt
- Change the default dictionary policy in Parquet writer from
ALWAYS
toADAPTIVE
(#15570) @mhaseeb123 - Rename experimental JSON tests. (#15568) @bdice
- Refactor JNI native dependency loading to allow returning of library path (#15566) @jlowe
- Remove protobuf and use parsed ORC statistics from libcudf (#15564) @bdice
- Deprecate legacy JSON reader options. (#15558) @bdice
- Use same .clang-format in cuDF JNI (#15557) @bdice
- Large strings support for cudf::fill (#15555) @davidwendt
- Upgrade upper bound pinning to
pandas-2.2.2
(#15554) @galipremsagar - Work around issues with cccl main (#15552) @miscco
- Enable pandas plotting unit tests for cudf.pandas (#15547) @mroeschke
- Move timezone conversion logic to
DatetimeColumn
(#15545) @mroeschke - Large strings support for cudf::interleave_columns (#15544) @davidwendt
- [skip ci] Switch back to 24.06 branch for pandas tests (#15543) @galipremsagar
- Remove checks dependency from static-configure test job. (#15542) @bdice
- Remove legacy JSON reader from Python (#15538) @bdice
- Enable more ignored pandas unit tests for cudf.pandas (#15535) @mroeschke
- Large strings support for cudf::clamp (#15533) @davidwendt
- Remove version hard-coding (#15529) @galipremsagar
- Removing all batching code from parquet writer (#15528) @mhaseeb123
- Make some private class properties not settable (#15527) @mroeschke
- Large strings support in regex replace APIs (#15524) @davidwendt
- Skip pandas unit tests that crash pytest workers in
cudf.pandas
(#15521) @mroeschke - Preserve column metadata during more DataFrame operations (#15519) @mroeschke
- Move to pandas-tests to a dedicated workflow file and trigger it from branch.yaml (#15516) @galipremsagar
- Large strings gtest fixture and utilities (#15513) @davidwendt
- Convert libcudf resource parameters to rmm::device_async_resource_ref (#15507) @harrism
- Relax protobuf lower bound to 3.20. (#15506) @bdice
- Clean up index methods (#15496) @mroeschke
- Update strings contains benchmarks to nvbench (#15495) @davidwendt
- Update NVBench fixture to use new hooks, fix pinned memory segfault. (#15492) @alliepiper
- Enable tests/scalar and test/series in cudf.pandas tests (#15486) @mroeschke
- Clean up cuda_array_interface handling in as_column (#15477) @mroeschke
- Avoid .ordered and .categories from being settable in CategoricalColumn and CategoricalDtype (#15475) @mroeschke
- Ignore pandas tests for cudf.pandas that need motoserver (#15468) @mroeschke
- Use cached_property for NumericColumn.nan_count instead of ._nan_count variable (#15466) @mroeschke
- Add to_arrow_device() functions that accept views (#15465) @davidwendt
- Add custom status check workflow (#15464) @galipremsagar
- Disable pandas 2.x clipboard tests in cudf.pandas tests (#15462) @mroeschke
- Enable tests/strings/test_api.py and tests/io/pytables in cudf.pandas tests (#15461) @mroeschke
- Enable test_parsing in cudf.pandas tests (#15460) @mroeschke
- Add
from_arrow_device
function to cudf interop using nanoarrow (#15458) @zeroshade - Remove deprecated strings offsets_begin (#15454) @davidwendt
- Enable tests/windows/ in cudf.pandas tests (#15444) @mroeschke
- Enable tests/interchange/test_impl.py in cudf.pandas tests (#15443) @mroeschke
- Enable tests/io/test_user_agent.py in cudf pandas tests (#15442) @mroeschke
- Performance improvement in libcudf case conversion for long strings (#15441) @davidwendt
- Remove prior test skipping in run-pandas-tests with testing 2.2.1 (#15440) @mroeschke
- Support orc and text IO with dask-expr using legacy conversion (#15439) @rjzamora
- Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
- Unify Copy-On-Write and Spilling (#15436) @madsbk
- Enable
dask_cudf
json and s3 tests with query-planning on (#15408) @rjzamora - Bump ruff and codespell pre-commit checks (#15407) @mroeschke
- Enable all tests for
arm
arch (#15402) @galipremsagar - Bind
read_parquet_metadata
API to libcudf instead of pyarrow and extractRowGroup
information (#15398) @mhaseeb123 - Optimizing multi-source byte range reading in JSON reader (#15396) @shrshi
- add correct labels to pandas_function_request.md (#15381) @raybellwaves
- Remove deprecated hash() and spark_murmurhash3_x86_32() (#15375) @davidwendt
- Large strings support in cudf::merge (#15374) @davidwendt
- Enable test-reporting for pandas pytests in CI (#15369) @galipremsagar
- Use logical types in Parquet reader (#15365) @etseidl
- Add experimental make_strings_children utility (#15363) @davidwendt
- Forward-merge branch-24.04 to branch-24.06 (#15349) @bdice
- Fix CMake files in libcudf C++ examples to use existing libcudf build if present (#15348) @mhaseeb123
- Use ruff pydocstyle over pydocstyle pre-commit hook (#15345) @mroeschke
- Refactor stream mode setup for gtests (#15337) @davidwendt
- Benchmark decimal <--> floating conversions. (#15334) @pmattione-nvidia
- Avoid duplicate dask-cudf testing (#15333) @rjzamora
- Skip decode steps in Parquet reader when nullable columns have no nulls (#15332) @etseidl
- Update udf_cpp to use rapids_cpm_cccl. (#15331) @bdice
- Forward-merge branch-24.04 into branch-24.06 [skip ci] (#15330) @rapids-bot[bot]
- Allow
numeric_only=True
for simple groupby reductions (#15326) @rjzamora - Drop CentOS 7 support. (#15323) @bdice
- Rework cudf::find_and_replace_all to use gather-based make_strings_column (#15305) @davidwendt
- First pass at adding testing for pylibcudf (#15300) @vyasr
- [FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
- Rework cudf::replace_nulls to use strings::detail::copy_if_else (#15286) @davidwendt
- Clean up special casing in
as_column
for non-typed input (#15276) @mroeschke - Large strings support in cudf::concatenate (#15195) @davidwendt
- Use less _is_categorical_dtype (#15148) @mroeschke
- Align date_range defaults with pandas, support tz (#15139) @mroeschke
ModuleAccelerator
performance: cache the result of checking if a caller is in the denylist (#15056) @shwina- Use offsetalator in cudf::strings::replace functions (#14824) @davidwendt
- Cleanup some timedelta/datetime column logic (#14715) @mroeschke
- Refactor numpy array input in as_column (#14651) @mroeschke
- Refactor joins for conditional semis and antis (#14646) @DanialJavady96
- Eagerly populate the class dict for cudf.pandas proxy types (#14534) @shwina
- Some additional kernel thread index refactoring. (#14107) @bdice