v24.10.00
🚨 Breaking Changes
- Whitespace normalization of nested column coerced as string column in JSONL inputs (#16759) @shrshi
- Add libcudf wrappers around current_device_resource functions. (#16679) @harrism
- Fix empty cluster handling in tdigest merge (#16675) @jihoonson
- Remove java ColumnView.copyWithBooleanColumnAsValidity (#16660) @revans2
- Support reading multiple PQ sources with mismatching nullability for columns (#16639) @mhaseeb123
- Remove arrow_io_source (#16607) @vyasr
- Remove legacy Arrow interop APIs (#16590) @vyasr
- Remove NativeFile support from cudf Python (#16589) @vyasr
- Revert "Make proxy NumPy arrays pass isinstance check in
cudf.pandas
" (#16586) @Matt711 - Align public utility function signatures with pandas 2.x (#16565) @mroeschke
- Disallow cudf.Index accepting column in favor of ._from_column (#16549) @mroeschke
- Refactor dictionary encoding in PQ writer to migrate to the new
cuco::static_map
(#16541) @mhaseeb123 - Change IPv4 convert APIs to support UINT32 instead of INT64 (#16489) @davidwendt
- enable list to be forced as string in JSON reader. (#16472) @karthikeyann
- Disallow cudf.Series to accept column in favor of
._from_column
(#16454) @mroeschke - Align groupby APIs with pandas 2.x (#16403) @mroeschke
- Align misc DataFrame and MultiIndex methods with pandas 2.x (#16402) @mroeschke
- Align Index APIs with pandas 2.x (#16361) @mroeschke
- Add
stream
param to stream compaction APIs (#16295) @JayjeetAtGithub
🐛 Bug Fixes
- Add license to the pylibcudf wheel (#16976) @raydouglass
- Parse newline as whitespace character while tokenizing JSONL inputs with non-newline delimiter (#16950) @shrshi
- Add dask-cudf workaround for missing
rename_axis
support in cudf (#16899) @rjzamora - Update oldest deps for
pyarrow
&numpy
(#16883) @galipremsagar - Update labeler for pylibcudf (#16868) @vyasr
- Revert "Refactor mixed_semi_join using cuco::static_set" (#16855) @mhaseeb123
- Fix metadata after implicit array conversion from Dask cuDF (#16842) @rjzamora
- Add cudf.pandas dependencies.yaml to update-version.sh (#16840) @raydouglass
- Use cupy 12.2.0 as oldest dependency pinning on CUDA 12 ARM (#16808) @bdice
- Revert "Fix empty cluster handling in tdigest merge (#16675)" (#16800) @jihoonson
- Intentionally leak thread_local CUDA resources to avoid crash (part 1) (#16787) @kingcrimsontianyu
- Fix
cov
/corr
bug in dask-cudf (#16786) @rjzamora - Fix slice_strings wide strings logic with multi-byte characters (#16777) @davidwendt
- Fix nvbench output for sha512 (#16773) @davidwendt
- Allow read_csv(header=None) to return int column labels in
mode.pandas_compatible
(#16769) @mroeschke - Whitespace normalization of nested column coerced as string column in JSONL inputs (#16759) @shrshi
- Fix DataFrame.drop(columns=cudf.Series/Index, axis=1) (#16712) @mroeschke
- Use merge base when calculating changed files (#16709) @KyleFromNVIDIA
- Ensure we pass the has_nulls tparam to mixed_join kernels (#16708) @abellina
- Add boost-devel to Java CI Docker image (#16707) @jlowe
- [BUG] Add gpu node type to cudf-pandas 3rd-party integration nightly CI job (#16704) @Matt711
- Fix typo in column_factories.hpp comment from 'depth 1' to 'depth 2' (#16700) @a-hirota
- Fix Series.to_frame(name=None) setting a None name (#16698) @mroeschke
- Disable gtests/ERROR_TEST during compute-sanitizer memcheck test (#16691) @davidwendt
- Enable batched multi-source reading of JSONL files with large records (#16687) @shrshi
- Handle
ordered
parameter inCategoricalIndex.__repr__
(#16683) @galipremsagar - Fix loc/iloc.setitem[:, loc] with non cupy types (#16677) @mroeschke
- Fix empty cluster handling in tdigest merge (#16675) @jihoonson
- Fix
cudf::rank
not getting enough params (#16666) @JayjeetAtGithub - Fix slowdown in
CategoricalIndex.__repr__
(#16665) @galipremsagar - Remove java ColumnView.copyWithBooleanColumnAsValidity (#16660) @revans2
- Fix slowdown in DataFrame repr in jupyter notebook (#16656) @galipremsagar
- Preserve Series name in duplicated method. (#16655) @bdice
- Fix interval_range right child non-zero offset (#16651) @mroeschke
- fix libcudf wheel publishing, make package-type explicit in wheel publishing (#16650) @jameslamb
- Revert "Hide all gtest symbols in cudftestutil (#16546)" (#16644) @robertmaynard
- Fix integer overflow in indexalator pointer logic (#16643) @davidwendt
- Allow for binops between two differently sized DecimalDtypes (#16638) @mroeschke
- Move pragma once in rolling/jit/operation.hpp. (#16636) @bdice
- Fix overflow bug in low-memory JSON reader (#16632) @shrshi
- Add the missing
num_aggregations
axis forgroupby_max_cardinality
(#16630) @PointKernel - Fix strings::detail::copy_range when target contains nulls (#16626) @davidwendt
- Fix function parameters with common dependency modified during their evaluation (#16620) @ttnghia
- bug-fix: Don't enable the CUDA language if testing was requested when finding cudf (#16615) @cryos
- bug-fix: cudf/io/json.hpp use after move (#16609) @NicolasDenoyelle
- Remove CUDA whole compilation ODR violations (#16603) @robertmaynard
- MAINT: Adapt to numpy hiding flagsobject away (#16593) @seberg
- Revert "Make proxy NumPy arrays pass isinstance check in
cudf.pandas
" (#16586) @Matt711 - Switch python version to
3.10
incudf.pandas
pandas test scripts (#16559) @galipremsagar - Hide all gtest symbols in cudftestutil (#16546) @robertmaynard
- Update the java code to properly deal with lists being returned as strings (#16536) @revans2
- Register
read_parquet
andread_csv
with dask-expr (#16535) @rjzamora - Change cudf::empty_like to not include offsets for empty strings columns (#16529) @davidwendt
- Fix DataFrame reductions with median returning scalar instead of Series (#16527) @mroeschke
- Allow DataFrame.sort_values(by=) to select an index level (#16519) @mroeschke
- Fix
date_range(start, end, freq)
when end-start is divisible by freq (#16516) @mroeschke - Preserve array name in MultiIndex.from_arrays (#16515) @mroeschke
- Disallow indexing by selecting duplicate labels (#16514) @mroeschke
- Fix
.replace(Index, Index)
raising a TypeError (#16513) @mroeschke - Check index bounds in compact protocol reader. (#16493) @bdice
- Fix build failures with GCC 13 (#16488) @PointKernel
- Fix all-empty input column for strings split APIs (#16466) @davidwendt
- Fix segmented-sort overlapped input/output indices (#16463) @davidwendt
- Fix merge conflict for auto merge 16447 (#16449) @davidwendt
📖 Documentation
- Fix links in Dask cuDF documentation (#16929) @rjzamora
- Improve aggregation documentation (#16822) @PointKernel
- Add best practices page to Dask cuDF docs (#16821) @rjzamora
- [DOC] Update Pylibcudf doc strings (#16810) @Matt711
- Recommending
miniforge
for conda install (#16782) @mmccarty - Add labeling pylibcudf doc pages (#16779) @mroeschke
- Migrate dask-cudf README improvements to dask-cudf sphinx docs (#16765) @rjzamora
- [DOC] Remove out of date section from cudf.pandas docs (#16697) @Matt711
- Add performance tips to cudf.pandas FAQ. (#16693) @bdice
- Update documentation for Dask cuDF (#16671) @rjzamora
- Add missing pylibcudf strings docs (#16471) @brandon-b-miller
- DOC: Refresh pylibcudf guide (#15856) @lithomas1
🚀 New Features
- Build
cudf-polars
withbuild.sh
(#16898) @brandon-b-miller - Add polars to "all" dependency list. (#16875) @bdice
- nvCOMP GZIP integration (#16770) @vuule
- [FEA] Add support for
cudf.NamedAgg
(#16744) @Matt711 - Add experimental
filesystem="arrow"
support indask_cudf.read_parquet
(#16684) @rjzamora - Relax Arrow pin (#16681) @vyasr
- Add libcudf wrappers around current_device_resource functions. (#16679) @harrism
- Move NDS-H examples into benchmarks (#16663) @JayjeetAtGithub
- [FEA] Add third-party library integration testing of cudf.pandas to cudf (#16645) @Matt711
- Make isinstance check pass for proxy ndarrays (#16601) @Matt711
- [FEA] Add an environment variable to fail on fallback in
cudf.pandas
(#16562) @Matt711 - [FEA] Add support for
cudf.unique
(#16554) @Matt711 - [FEA] Support named aggregations in
df.groupby().agg()
(#16528) @Matt711 - Change IPv4 convert APIs to support UINT32 instead of INT64 (#16489) @davidwendt
- enable list to be forced as string in JSON reader. (#16472) @karthikeyann
- Remove cuDF dependency from pylibcudf column from_device tests (#16441) @brandon-b-miller
- Enable cudf.pandas REPL and -c command support (#16428) @bdice
- Setup pylibcudf package (#16299) @lithomas1
- Add a libcudf/thrust-based TPC-H derived datagen (#16294) @JayjeetAtGithub
- Make proxy NumPy arrays pass isinstance check in
cudf.pandas
(#16286) @Matt711 - Add skiprows and nrows to parquet reader (#16214) @lithomas1
- Upgrade to nvcomp 4.0.1 (#16076) @vuule
- Migrate ORC reader to pylibcudf (#16042) @lithomas1
- JSON reader validation of values (#15968) @karthikeyann
- Implement exposed null mask APIs in pylibcudf (#15908) @charlesbluca
- Word-based nvtext::minhash function (#15368) @davidwendt
🛠️ Improvements
- Make tests deterministic (#16910) @galipremsagar
- Update update-version.sh to use packaging lib (#16891) @AyodeAwe
- Pin polars for 24.10 and update polars test suite xfail list (#16886) @wence-
- Add in support for setting delim when parsing JSON through java (#16867) (#16880) @revans2
- Remove unnecessary flag from build.sh (#16879) @vyasr
- Ignore numba warning specific to ARM runners (#16872) @galipremsagar
- Display deltas for
cudf.pandas
test summary (#16864) @galipremsagar - Switch to using native
traceback
(#16851) @galipremsagar - JSON tree algorithm code reorg (#16836) @karthikeyann
- Add string.repeats API to pylibcudf (#16834) @mroeschke
- Use CI workflow branch 'branch-24.10' again (#16832) @jameslamb
- Rename the NDS-H benchmark binaries (#16831) @JayjeetAtGithub
- Add string.findall APIs to pylibcudf (#16825) @mroeschke
- Add string.extract APIs to pylibcudf (#16823) @mroeschke
- use get-pr-info from nv-gha-runners (#16819) @AyodeAwe
- Add string.contains APIs to pylibcudf (#16814) @mroeschke
- Forward-merge branch-24.08 to branch-24.10 (#16813) @bdice
- Add io_type axis with default
PINNED_BUFFER
to nvbench PQ multithreaded reader (#16809) @mhaseeb123 - Update fmt (to 11.0.2) and spdlog (to 1.14.1). (#16806) @jameslamb
- Add ability to set parquet row group max #rows and #bytes in java (#16805) @pmattione-nvidia
- Add in option for Java JSON APIs to do column pruning in CUDF (#16796) @revans2
- Support drop_first in get_dummies (#16795) @mroeschke
- Exposed stream-ordering to join API (#16793) @lamarrr
- Add string.attributes APIs to pylibcudf (#16785) @mroeschke
- Java: Make ColumnVector.fromViewWithContiguousAllocation public (#16784) @jlowe
- Add partitioning APIs to pylibcudf (#16781) @mroeschke
- Optimization of tdigest merge aggregation. (#16780) @nvdbaranec
- use libkvikio wheels in wheel builds (#16778) @jameslamb
- Exposed stream-ordering to datetime API (#16774) @lamarrr
- Add io/timezone APIs to pylibcudf (#16771) @mroeschke
- Remove
MultiIndex._poplevel
inplace implementation. (#16767) @mroeschke - allow pandas patch version to float in cudf-pandas unit tests (#16763) @jameslamb
- Simplify the nvCOMP adapter (#16762) @vuule
- Add labeling APIs to pylibcudf (#16761) @mroeschke
- Add transform APIs to pylibcudf (#16760) @mroeschke
- Add a benchmark to study Parquet reader's performance for wide tables (#16751) @mhaseeb123
- Change the Parquet writer's
default_row_group_size_bytes
from 128MB to inf (#16750) @mhaseeb123 - Add transpose API to pylibcudf (#16749) @mroeschke
- Add support for Python 3.12, update Kafka dependencies to 2.5.x (#16745) @jameslamb
- Generate GPU vs CPU usage metrics per pytest file in pandas testsuite for
cudf.pandas
(#16739) @galipremsagar - Refactor cudf pandas integration tests CI (#16728) @Matt711
- Remove ERROR_TEST gtest from libcudf (#16722) @davidwendt
- Use Series._from_column more consistently to avoid validation (#16716) @mroeschke
- remove some unnecessary libcudf nightly builds (#16714) @jameslamb
- Remove xfail from torch-cudf.pandas integration test (#16705) @Matt711
- Add return type annotations to MultiIndex (#16696) @mroeschke
- Add type annotations to Index classes, utilize _from_column more (#16695) @mroeschke
- Have interval_range use IntervalIndex.from_breaks, remove column_empty_same_mask (#16694) @mroeschke
- Increase timeouts for couple of tests (#16692) @galipremsagar
- Replace raw device_memory_resource pointer in pylibcudf Cython (#16674) @harrism
- switch from typing.Callable to collections.abc.Callable (#16670) @jameslamb
- Update rapidsai/pre-commit-hooks (#16669) @KyleFromNVIDIA
- Multi-file and Parquet-aware prefetching from remote storage (#16657) @rjzamora
- Access Frame attributes instead of ColumnAccessor attributes when available (#16652) @mroeschke
- Use non-mangled type names in nvbench output (#16649) @davidwendt
- Add pylibcudf build dir in build.sh for
clean
(#16648) @galipremsagar - Prune workflows based on changed files (#16642) @KyleFromNVIDIA
- Remove arrow dependency (#16640) @vyasr
- Support reading multiple PQ sources with mismatching nullability for columns (#16639) @mhaseeb123
- Drop Python 3.9 support (#16637) @jameslamb
- Support DecimalDtype meta in dask_cudf (#16634) @mroeschke
- Add
num_multiprocessors
utility (#16628) @PointKernel - Annotate
ColumnAccessor._data
labels asHashable
(#16623) @mroeschke - Remove build_categorical_column in favor of CategoricalColumn constructor (#16617) @mroeschke
- Move apply_boolean_mask benchmark to nvbench (#16616) @davidwendt
- Revise
get_reader_filepath_or_buffer
to handle a list of data sources (#16613) @rjzamora - do not install cudf in cudf_polars wheel tests (#16612) @jameslamb
- remove streamz git dependency, standardize build dependency names, consolidate some dependency lists (#16611) @jameslamb
- Fix C++ and Cython io types (#16610) @vyasr
- Remove arrow_io_source (#16607) @vyasr
- Remove thrust::optional from expression evaluator (#16604) @bdice
- Add stricter typing and validation to ColumnAccessor (#16602) @mroeschke
- make more use of YAML anchors in dependencies.yaml (#16597) @jameslamb
- Enable testing
cudf.pandas
unit tests for all minor versions of pandas (#16595) @galipremsagar - Extend the Parquet writer's dictionary encoding benchmark. (#16591) @mhaseeb123
- Remove legacy Arrow interop APIs (#16590) @vyasr
- Remove NativeFile support from cudf Python (#16589) @vyasr
- Add build job for pylibcudf (#16587) @vyasr
- Add
public
qualifier for some member functions in Java classSchema
(#16583) @ttnghia - Enable gtests previously disabled for compute-sanitizer bug (#16581) @davidwendt
- [FEA] Add filesystem argument to
cudf.read_parquet
(#16577) @rjzamora - Ensure size is always passed to NumericalColumn (#16576) @mroeschke
- standardize and consolidate wheel installations in testing scripts (#16575) @jameslamb
- Performance improvement for strings::slice for wide strings (#16574) @davidwendt
- Add
ToCudfBackend
expression to dask-cudf (#16573) @rjzamora - CI: Test against old versions of key dependencies (#16570) @seberg
- Replace
NativeFile
dependency in dask-cudf Parquet reader (#16569) @rjzamora - Align public utility function signatures with pandas 2.x (#16565) @mroeschke
- Move libcudf reduction google-benchmarks to nvbench (#16564) @davidwendt
- Rework strings::slice benchmark to use nvbench (#16563) @davidwendt
- Reenable arrow tests (#16556) @vyasr
- Clean up reshaping ops (#16553) @mroeschke
- Disallow cudf.Index accepting column in favor of ._from_column (#16549) @mroeschke
- Rewrite remaining Python Arrow interop conversions using the C Data Interface (#16548) @vyasr
- [REVIEW] JSON host tree algorithms (#16545) @shrshi
- Refactor dictionary encoding in PQ writer to migrate to the new
cuco::static_map
(#16541) @mhaseeb123 - Remove hardcoded versions from workflows. (#16540) @bdice
- Ensure comparisons with pyints and integer series always succeed (#16532) @seberg
- Remove unneeded output size parameter from internal count_matches utility (#16531) @davidwendt
- Remove invalid column_view usage in string-scalar-to-column function (#16530) @davidwendt
- Raise NotImplementedError for Series.rename that's not a scalar (#16525) @mroeschke
- Remove deprecated public APIs from libcudf (#16524) @davidwendt
- Return Interval object in pandas compat mode for IntervalIndex reductions (#16523) @mroeschke
- Update json normalization to take device_buffer (#16520) @karthikeyann
- Rework cudf::io::text::byte_range_info class member functions (#16518) @davidwendt
- Remove unneeded pair-iterator benchmark (#16511) @davidwendt
- Update pre-commit hooks (#16510) @KyleFromNVIDIA
- Improve update-version.sh (#16506) @bdice
- Use tool.scikit-build.cmake.version, set scikit-build-core minimum-version (#16503) @jameslamb
- Pass batch size to JSON reader using environment variable (#16502) @shrshi
- Remove a deprecated multibyte_split API (#16501) @davidwendt
- Add interop example for
arrow::StringViewArray
tocudf::column
(#16498) @JayjeetAtGithub - Add keep option to distinct nvbench (#16497) @bdice
- Use more idomatic cudf APIs in dask_cudf meta generation (#16487) @mroeschke
- Fix typo in dispatch_row_equal. (#16473) @bdice
- Use explicit construction of column subclass instead of
build_column
when type is known (#16470) @mroeschke - Move exception handler into pylibcudf from cudf (#16468) @lithomas1
- Make StructColumn.init strict (#16467) @mroeschke
- Make ListColumn.init strict (#16465) @mroeschke
- Make Timedelta/DatetimeColumn.init strict (#16464) @mroeschke
- Make NumericalColumn.init strict (#16457) @mroeschke
- Make CategoricalColumn.init strict (#16456) @mroeschke
- Disallow cudf.Series to accept column in favor of
._from_column
(#16454) @mroeschke - Expose
stream
param in transform APIs (#16452) @JayjeetAtGithub - Add upper bound pin for polars (#16442) @wence-
- Make (Indexed)Frame.init require data (and index) (#16430) @mroeschke
- Add Java APIs to copy column data to host asynchronously (#16429) @jlowe
- Update docs of the TPC-H derived examples (#16423) @JayjeetAtGithub
- Use RMM adaptor constructors instead of factories. (#16414) @bdice
- Align ewm APIs with pandas 2.x (#16413) @mroeschke
- Remove checking for specific tests in memcheck script (#16412) @davidwendt
- Add stream parameter to reshape APIs (#16410) @davidwendt
- Align groupby APIs with pandas 2.x (#16403) @mroeschke
- Align misc DataFrame and MultiIndex methods with pandas 2.x (#16402) @mroeschke
- update some branch references in GitHub Actions configs (#16397) @jameslamb
- Support reading matching projected and filter cols from Parquet files with otherwise mismatched schemas (#16394) @mhaseeb123
- Merge branch-24.08 into branch-24.10 (#16393) @jameslamb
- Add query 10 to the TPC-H suite (#16392) @JayjeetAtGithub
- Use
make_host_vector
instead ofmake_std_vector
to facilitate pinned memory optimizations (#16386) @vuule - Fix some issues with deprecated / removed cccl facilities (#16377) @miscco
- Align IntervalIndex APIs with pandas 2.x (#16371) @mroeschke
- Align CategoricalIndex APIs with pandas 2.x (#16369) @mroeschke
- Align TimedeltaIndex APIs with pandas 2.x (#16368) @mroeschke
- Align DatetimeIndex APIs with pandas 2.x (#16367) @mroeschke
- fix [tool.setuptools] reference in custreamz config (#16365) @jameslamb
- Align Index APIs with pandas 2.x (#16361) @mroeschke
- Rebuild for & Support NumPy 2 (#16300) @jakirkham
- Add
stream
param to stream compaction APIs (#16295) @JayjeetAtGithub - Added batch memset to memset data and validity buffers in parquet reader (#16281) @sdrp713
- Deduplicate decimal32/decimal64 to decimal128 conversion function (#16236) @mhaseeb123
- Refactor mixed_semi_join using cuco::static_set (#16230) @srinivasyadav18
- Improve performance of hash_character_ngrams using warp-per-string kernel (#16212) @davidwendt
- Add environment variable to log cudf.pandas fallback calls (#16161) @mroeschke
- Add libcudf example with large strings (#15983) @davidwendt
- JSON tree algorithms refactor I: CSR data structure for column tree (#15979) @shrshi
- Support multiple new-line characters in regex APIs (#15961) @davidwendt
- adding wheel build for libcudf (#15483) @msarahan
- Replace usages of
thrust::optional
withstd::optional
(#15091) @miscco