Release v24.10.00 · rapidsai/cudf

🚨 Breaking Changes

Whitespace normalization of nested column coerced as string column in JSONL inputs (#16759) @shrshi
Add libcudf wrappers around current_device_resource functions. (#16679) @harrism
Fix empty cluster handling in tdigest merge (#16675) @jihoonson
Remove java ColumnView.copyWithBooleanColumnAsValidity (#16660) @revans2
Support reading multiple PQ sources with mismatching nullability for columns (#16639) @mhaseeb123
Remove arrow_io_source (#16607) @vyasr
Remove legacy Arrow interop APIs (#16590) @vyasr
Remove NativeFile support from cudf Python (#16589) @vyasr
Revert "Make proxy NumPy arrays pass isinstance check in cudf.pandas" (#16586) @Matt711
Align public utility function signatures with pandas 2.x (#16565) @mroeschke
Disallow cudf.Index accepting column in favor of ._from_column (#16549) @mroeschke
Refactor dictionary encoding in PQ writer to migrate to the new cuco::static_map (#16541) @mhaseeb123
Change IPv4 convert APIs to support UINT32 instead of INT64 (#16489) @davidwendt
enable list to be forced as string in JSON reader. (#16472) @karthikeyann
Disallow cudf.Series to accept column in favor of ._from_column (#16454) @mroeschke
Align groupby APIs with pandas 2.x (#16403) @mroeschke
Align misc DataFrame and MultiIndex methods with pandas 2.x (#16402) @mroeschke
Align Index APIs with pandas 2.x (#16361) @mroeschke
Add stream param to stream compaction APIs (#16295) @JayjeetAtGithub

🐛 Bug Fixes

Add license to the pylibcudf wheel (#16976) @raydouglass
Parse newline as whitespace character while tokenizing JSONL inputs with non-newline delimiter (#16950) @shrshi
Add dask-cudf workaround for missing rename_axis support in cudf (#16899) @rjzamora
Update oldest deps for pyarrow & numpy (#16883) @galipremsagar
Update labeler for pylibcudf (#16868) @vyasr
Revert "Refactor mixed_semi_join using cuco::static_set" (#16855) @mhaseeb123
Fix metadata after implicit array conversion from Dask cuDF (#16842) @rjzamora
Add cudf.pandas dependencies.yaml to update-version.sh (#16840) @raydouglass
Use cupy 12.2.0 as oldest dependency pinning on CUDA 12 ARM (#16808) @bdice
Revert "Fix empty cluster handling in tdigest merge (#16675)" (#16800) @jihoonson
Intentionally leak thread_local CUDA resources to avoid crash (part 1) (#16787) @kingcrimsontianyu
Fix cov/corr bug in dask-cudf (#16786) @rjzamora
Fix slice_strings wide strings logic with multi-byte characters (#16777) @davidwendt
Fix nvbench output for sha512 (#16773) @davidwendt
Allow read_csv(header=None) to return int column labels in mode.pandas_compatible (#16769) @mroeschke
Whitespace normalization of nested column coerced as string column in JSONL inputs (#16759) @shrshi
Fix DataFrame.drop(columns=cudf.Series/Index, axis=1) (#16712) @mroeschke
Use merge base when calculating changed files (#16709) @KyleFromNVIDIA
Ensure we pass the has_nulls tparam to mixed_join kernels (#16708) @abellina
Add boost-devel to Java CI Docker image (#16707) @jlowe
[BUG] Add gpu node type to cudf-pandas 3rd-party integration nightly CI job (#16704) @Matt711
Fix typo in column_factories.hpp comment from 'depth 1' to 'depth 2' (#16700) @a-hirota
Fix Series.to_frame(name=None) setting a None name (#16698) @mroeschke
Disable gtests/ERROR_TEST during compute-sanitizer memcheck test (#16691) @davidwendt
Enable batched multi-source reading of JSONL files with large records (#16687) @shrshi
Handle ordered parameter in CategoricalIndex.__repr__ (#16683) @galipremsagar
Fix loc/iloc.setitem[:, loc] with non cupy types (#16677) @mroeschke
Fix empty cluster handling in tdigest merge (#16675) @jihoonson
Fix cudf::rank not getting enough params (#16666) @JayjeetAtGithub
Fix slowdown in CategoricalIndex.__repr__ (#16665) @galipremsagar
Remove java ColumnView.copyWithBooleanColumnAsValidity (#16660) @revans2
Fix slowdown in DataFrame repr in jupyter notebook (#16656) @galipremsagar
Preserve Series name in duplicated method. (#16655) @bdice
Fix interval_range right child non-zero offset (#16651) @mroeschke
fix libcudf wheel publishing, make package-type explicit in wheel publishing (#16650) @jameslamb
Revert "Hide all gtest symbols in cudftestutil (#16546)" (#16644) @robertmaynard
Fix integer overflow in indexalator pointer logic (#16643) @davidwendt
Allow for binops between two differently sized DecimalDtypes (#16638) @mroeschke
Move pragma once in rolling/jit/operation.hpp. (#16636) @bdice
Fix overflow bug in low-memory JSON reader (#16632) @shrshi
Add the missing num_aggregations axis for groupby_max_cardinality (#16630) @PointKernel
Fix strings::detail::copy_range when target contains nulls (#16626) @davidwendt
Fix function parameters with common dependency modified during their evaluation (#16620) @ttnghia
bug-fix: Don't enable the CUDA language if testing was requested when finding cudf (#16615) @cryos
bug-fix: cudf/io/json.hpp use after move (#16609) @NicolasDenoyelle
Remove CUDA whole compilation ODR violations (#16603) @robertmaynard
MAINT: Adapt to numpy hiding flagsobject away (#16593) @seberg
Revert "Make proxy NumPy arrays pass isinstance check in cudf.pandas" (#16586) @Matt711
Switch python version to 3.10 in cudf.pandas pandas test scripts (#16559) @galipremsagar
Hide all gtest symbols in cudftestutil (#16546) @robertmaynard
Update the java code to properly deal with lists being returned as strings (#16536) @revans2
Register read_parquet and read_csv with dask-expr (#16535) @rjzamora
Change cudf::empty_like to not include offsets for empty strings columns (#16529) @davidwendt
Fix DataFrame reductions with median returning scalar instead of Series (#16527) @mroeschke
Allow DataFrame.sort_values(by=) to select an index level (#16519) @mroeschke
Fix date_range(start, end, freq) when end-start is divisible by freq (#16516) @mroeschke
Preserve array name in MultiIndex.from_arrays (#16515) @mroeschke
Disallow indexing by selecting duplicate labels (#16514) @mroeschke
Fix .replace(Index, Index) raising a TypeError (#16513) @mroeschke
Check index bounds in compact protocol reader. (#16493) @bdice
Fix build failures with GCC 13 (#16488) @PointKernel
Fix all-empty input column for strings split APIs (#16466) @davidwendt
Fix segmented-sort overlapped input/output indices (#16463) @davidwendt
Fix merge conflict for auto merge 16447 (#16449) @davidwendt

📖 Documentation

Fix links in Dask cuDF documentation (#16929) @rjzamora
Improve aggregation documentation (#16822) @PointKernel
Add best practices page to Dask cuDF docs (#16821) @rjzamora
[DOC] Update Pylibcudf doc strings (#16810) @Matt711
Recommending miniforge for conda install (#16782) @mmccarty
Add labeling pylibcudf doc pages (#16779) @mroeschke
Migrate dask-cudf README improvements to dask-cudf sphinx docs (#16765) @rjzamora
[DOC] Remove out of date section from cudf.pandas docs (#16697) @Matt711
Add performance tips to cudf.pandas FAQ. (#16693) @bdice
Update documentation for Dask cuDF (#16671) @rjzamora
Add missing pylibcudf strings docs (#16471) @brandon-b-miller
DOC: Refresh pylibcudf guide (#15856) @lithomas1

🚀 New Features

Build cudf-polars with build.sh (#16898) @brandon-b-miller
Add polars to "all" dependency list. (#16875) @bdice
nvCOMP GZIP integration (#16770) @vuule
[FEA] Add support for cudf.NamedAgg (#16744) @Matt711
Add experimental filesystem="arrow" support in dask_cudf.read_parquet (#16684) @rjzamora
Relax Arrow pin (#16681) @vyasr
Add libcudf wrappers around current_device_resource functions. (#16679) @harrism
Move NDS-H examples into benchmarks (#16663) @JayjeetAtGithub
[FEA] Add third-party library integration testing of cudf.pandas to cudf (#16645) @Matt711
Make isinstance check pass for proxy ndarrays (#16601) @Matt711
[FEA] Add an environment variable to fail on fallback in cudf.pandas (#16562) @Matt711
[FEA] Add support for cudf.unique (#16554) @Matt711
[FEA] Support named aggregations in df.groupby().agg() (#16528) @Matt711
Change IPv4 convert APIs to support UINT32 instead of INT64 (#16489) @davidwendt
enable list to be forced as string in JSON reader. (#16472) @karthikeyann
Remove cuDF dependency from pylibcudf column from_device tests (#16441) @brandon-b-miller
Enable cudf.pandas REPL and -c command support (#16428) @bdice
Setup pylibcudf package (#16299) @lithomas1
Add a libcudf/thrust-based TPC-H derived datagen (#16294) @JayjeetAtGithub
Make proxy NumPy arrays pass isinstance check in cudf.pandas (#16286) @Matt711
Add skiprows and nrows to parquet reader (#16214) @lithomas1
Upgrade to nvcomp 4.0.1 (#16076) @vuule
Migrate ORC reader to pylibcudf (#16042) @lithomas1
JSON reader validation of values (#15968) @karthikeyann
Implement exposed null mask APIs in pylibcudf (#15908) @charlesbluca
Word-based nvtext::minhash function (#15368) @davidwendt

🛠️ Improvements

Make tests deterministic (#16910) @galipremsagar
Update update-version.sh to use packaging lib (#16891) @AyodeAwe
Pin polars for 24.10 and update polars test suite xfail list (#16886) @wence-
Add in support for setting delim when parsing JSON through java (#16867) (#16880) @revans2
Remove unnecessary flag from build.sh (#16879) @vyasr
Ignore numba warning specific to ARM runners (#16872) @galipremsagar
Display deltas for cudf.pandas test summary (#16864) @galipremsagar
Switch to using native traceback (#16851) @galipremsagar
JSON tree algorithm code reorg (#16836) @karthikeyann
Add string.repeats API to pylibcudf (#16834) @mroeschke
Use CI workflow branch 'branch-24.10' again (#16832) @jameslamb
Rename the NDS-H benchmark binaries (#16831) @JayjeetAtGithub
Add string.findall APIs to pylibcudf (#16825) @mroeschke
Add string.extract APIs to pylibcudf (#16823) @mroeschke
use get-pr-info from nv-gha-runners (#16819) @AyodeAwe
Add string.contains APIs to pylibcudf (#16814) @mroeschke
Forward-merge branch-24.08 to branch-24.10 (#16813) @bdice
Add io_type axis with default PINNED_BUFFER to nvbench PQ multithreaded reader (#16809) @mhaseeb123
Update fmt (to 11.0.2) and spdlog (to 1.14.1). (#16806) @jameslamb
Add ability to set parquet row group max #rows and #bytes in java (#16805) @pmattione-nvidia
Add in option for Java JSON APIs to do column pruning in CUDF (#16796) @revans2
Support drop_first in get_dummies (#16795) @mroeschke
Exposed stream-ordering to join API (#16793) @lamarrr
Add string.attributes APIs to pylibcudf (#16785) @mroeschke
Java: Make ColumnVector.fromViewWithContiguousAllocation public (#16784) @jlowe
Add partitioning APIs to pylibcudf (#16781) @mroeschke
Optimization of tdigest merge aggregation. (#16780) @nvdbaranec
use libkvikio wheels in wheel builds (#16778) @jameslamb
Exposed stream-ordering to datetime API (#16774) @lamarrr
Add io/timezone APIs to pylibcudf (#16771) @mroeschke
Remove MultiIndex._poplevel inplace implementation. (#16767) @mroeschke
allow pandas patch version to float in cudf-pandas unit tests (#16763) @jameslamb
Simplify the nvCOMP adapter (#16762) @vuule
Add labeling APIs to pylibcudf (#16761) @mroeschke
Add transform APIs to pylibcudf (#16760) @mroeschke
Add a benchmark to study Parquet reader's performance for wide tables (#16751) @mhaseeb123
Change the Parquet writer's default_row_group_size_bytes from 128MB to inf (#16750) @mhaseeb123
Add transpose API to pylibcudf (#16749) @mroeschke
Add support for Python 3.12, update Kafka dependencies to 2.5.x (#16745) @jameslamb
Generate GPU vs CPU usage metrics per pytest file in pandas testsuite for cudf.pandas (#16739) @galipremsagar
Refactor cudf pandas integration tests CI (#16728) @Matt711
Remove ERROR_TEST gtest from libcudf (#16722) @davidwendt
Use Series._from_column more consistently to avoid validation (#16716) @mroeschke
remove some unnecessary libcudf nightly builds (#16714) @jameslamb
Remove xfail from torch-cudf.pandas integration test (#16705) @Matt711
Add return type annotations to MultiIndex (#16696) @mroeschke
Add type annotations to Index classes, utilize _from_column more (#16695) @mroeschke
Have interval_range use IntervalIndex.from_breaks, remove column_empty_same_mask (#16694) @mroeschke
Increase timeouts for couple of tests (#16692) @galipremsagar
Replace raw device_memory_resource pointer in pylibcudf Cython (#16674) @harrism
switch from typing.Callable to collections.abc.Callable (#16670) @jameslamb
Update rapidsai/pre-commit-hooks (#16669) @KyleFromNVIDIA
Multi-file and Parquet-aware prefetching from remote storage (#16657) @rjzamora
Access Frame attributes instead of ColumnAccessor attributes when available (#16652) @mroeschke
Use non-mangled type names in nvbench output (#16649) @davidwendt
Add pylibcudf build dir in build.sh for clean (#16648) @galipremsagar
Prune workflows based on changed files (#16642) @KyleFromNVIDIA
Remove arrow dependency (#16640) @vyasr
Support reading multiple PQ sources with mismatching nullability for columns (#16639) @mhaseeb123
Drop Python 3.9 support (#16637) @jameslamb
Support DecimalDtype meta in dask_cudf (#16634) @mroeschke
Add num_multiprocessors utility (#16628) @PointKernel
Annotate ColumnAccessor._data labels as Hashable (#16623) @mroeschke
Remove build_categorical_column in favor of CategoricalColumn constructor (#16617) @mroeschke
Move apply_boolean_mask benchmark to nvbench (#16616) @davidwendt
Revise get_reader_filepath_or_buffer to handle a list of data sources (#16613) @rjzamora
do not install cudf in cudf_polars wheel tests (#16612) @jameslamb
remove streamz git dependency, standardize build dependency names, consolidate some dependency lists (#16611) @jameslamb
Fix C++ and Cython io types (#16610) @vyasr
Remove arrow_io_source (#16607) @vyasr
Remove thrust::optional from expression evaluator (#16604) @bdice
Add stricter typing and validation to ColumnAccessor (#16602) @mroeschke
make more use of YAML anchors in dependencies.yaml (#16597) @jameslamb
Enable testing cudf.pandas unit tests for all minor versions of pandas (#16595) @galipremsagar
Extend the Parquet writer's dictionary encoding benchmark. (#16591) @mhaseeb123
Remove legacy Arrow interop APIs (#16590) @vyasr
Remove NativeFile support from cudf Python (#16589) @vyasr
Add build job for pylibcudf (#16587) @vyasr
Add public qualifier for some member functions in Java class Schema (#16583) @ttnghia
Enable gtests previously disabled for compute-sanitizer bug (#16581) @davidwendt
[FEA] Add filesystem argument to cudf.read_parquet (#16577) @rjzamora
Ensure size is always passed to NumericalColumn (#16576) @mroeschke
standardize and consolidate wheel installations in testing scripts (#16575) @jameslamb
Performance improvement for strings::slice for wide strings (#16574) @davidwendt
Add ToCudfBackend expression to dask-cudf (#16573) @rjzamora
CI: Test against old versions of key dependencies (#16570) @seberg
Replace NativeFile dependency in dask-cudf Parquet reader (#16569) @rjzamora
Align public utility function signatures with pandas 2.x (#16565) @mroeschke
Move libcudf reduction google-benchmarks to nvbench (#16564) @davidwendt
Rework strings::slice benchmark to use nvbench (#16563) @davidwendt
Reenable arrow tests (#16556) @vyasr
Clean up reshaping ops (#16553) @mroeschke
Disallow cudf.Index accepting column in favor of ._from_column (#16549) @mroeschke
Rewrite remaining Python Arrow interop conversions using the C Data Interface (#16548) @vyasr
[REVIEW] JSON host tree algorithms (#16545) @shrshi
Refactor dictionary encoding in PQ writer to migrate to the new cuco::static_map (#16541) @mhaseeb123
Remove hardcoded versions from workflows. (#16540) @bdice
Ensure comparisons with pyints and integer series always succeed (#16532) @seberg
Remove unneeded output size parameter from internal count_matches utility (#16531) @davidwendt
Remove invalid column_view usage in string-scalar-to-column function (#16530) @davidwendt
Raise NotImplementedError for Series.rename that's not a scalar (#16525) @mroeschke
Remove deprecated public APIs from libcudf (#16524) @davidwendt
Return Interval object in pandas compat mode for IntervalIndex reductions (#16523) @mroeschke
Update json normalization to take device_buffer (#16520) @karthikeyann
Rework cudf::io::text::byte_range_info class member functions (#16518) @davidwendt
Remove unneeded pair-iterator benchmark (#16511) @davidwendt
Update pre-commit hooks (#16510) @KyleFromNVIDIA
Improve update-version.sh (#16506) @bdice
Use tool.scikit-build.cmake.version, set scikit-build-core minimum-version (#16503) @jameslamb
Pass batch size to JSON reader using environment variable (#16502) @shrshi
Remove a deprecated multibyte_split API (#16501) @davidwendt
Add interop example for arrow::StringViewArray to cudf::column (#16498) @JayjeetAtGithub
Add keep option to distinct nvbench (#16497) @bdice
Use more idomatic cudf APIs in dask_cudf meta generation (#16487) @mroeschke
Fix typo in dispatch_row_equal. (#16473) @bdice
Use explicit construction of column subclass instead of build_column when type is known (#16470) @mroeschke
Move exception handler into pylibcudf from cudf (#16468) @lithomas1
Make StructColumn.init strict (#16467) @mroeschke
Make ListColumn.init strict (#16465) @mroeschke
Make Timedelta/DatetimeColumn.init strict (#16464) @mroeschke
Make NumericalColumn.init strict (#16457) @mroeschke
Make CategoricalColumn.init strict (#16456) @mroeschke
Disallow cudf.Series to accept column in favor of ._from_column (#16454) @mroeschke
Expose stream param in transform APIs (#16452) @JayjeetAtGithub
Add upper bound pin for polars (#16442) @wence-
Make (Indexed)Frame.init require data (and index) (#16430) @mroeschke
Add Java APIs to copy column data to host asynchronously (#16429) @jlowe
Update docs of the TPC-H derived examples (#16423) @JayjeetAtGithub
Use RMM adaptor constructors instead of factories. (#16414) @bdice
Align ewm APIs with pandas 2.x (#16413) @mroeschke
Remove checking for specific tests in memcheck script (#16412) @davidwendt
Add stream parameter to reshape APIs (#16410) @davidwendt
Align groupby APIs with pandas 2.x (#16403) @mroeschke
Align misc DataFrame and MultiIndex methods with pandas 2.x (#16402) @mroeschke
update some branch references in GitHub Actions configs (#16397) @jameslamb
Support reading matching projected and filter cols from Parquet files with otherwise mismatched schemas (#16394) @mhaseeb123
Merge branch-24.08 into branch-24.10 (#16393) @jameslamb
Add query 10 to the TPC-H suite (#16392) @JayjeetAtGithub
Use make_host_vector instead of make_std_vector to facilitate pinned memory optimizations (#16386) @vuule
Fix some issues with deprecated / removed cccl facilities (#16377) @miscco
Align IntervalIndex APIs with pandas 2.x (#16371) @mroeschke
Align CategoricalIndex APIs with pandas 2.x (#16369) @mroeschke
Align TimedeltaIndex APIs with pandas 2.x (#16368) @mroeschke
Align DatetimeIndex APIs with pandas 2.x (#16367) @mroeschke
fix [tool.setuptools] reference in custreamz config (#16365) @jameslamb
Align Index APIs with pandas 2.x (#16361) @mroeschke
Rebuild for & Support NumPy 2 (#16300) @jakirkham
Add stream param to stream compaction APIs (#16295) @JayjeetAtGithub
Added batch memset to memset data and validity buffers in parquet reader (#16281) @sdrp713
Deduplicate decimal32/decimal64 to decimal128 conversion function (#16236) @mhaseeb123
Refactor mixed_semi_join using cuco::static_set (#16230) @srinivasyadav18
Improve performance of hash_character_ngrams using warp-per-string kernel (#16212) @davidwendt
Add environment variable to log cudf.pandas fallback calls (#16161) @mroeschke
Add libcudf example with large strings (#15983) @davidwendt
JSON tree algorithms refactor I: CSR data structure for column tree (#15979) @shrshi
Support multiple new-line characters in regex APIs (#15961) @davidwendt
adding wheel build for libcudf (#15483) @msarahan
Replace usages of thrust::optional with std::optional (#15091) @miscco

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v24.10.00

🚨 Breaking Changes

🐛 Bug Fixes

📖 Documentation

🚀 New Features

🛠️ Improvements

Contributors