Releases: rapidsai/cudf
Releases · rapidsai/cudf
v24.04.01
🚨 Breaking Changes
- Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
- Change exceptions thrown by copying APIs (#15319) @vyasr
- Change strings_column_view::char_size to return int64 (#15197) @davidwendt
- Upgrade to
arrow-14.0.2
(#15108) @galipremsagar - Add support for
pandas-2.2
incudf
(#15100) @galipremsagar - Deprecate cudf::hashing::spark_murmurhash3_x86_32 (#15074) @davidwendt
- Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
- Raise an error on import for unsupported GPUs. (#15053) @bdice
- Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
- Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
- Add
future_stack
toDataFrame.stack
(#15015) @galipremsagar - Deprecate groupby fillna (#15000) @mroeschke
- Deprecate replace with categorical columns (#14988) @mroeschke
- Deprecate delim_whitespace in read_csv for pandas 2.2 (#14986) @mroeschke
- Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
- Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
- Add
pandas-2.x
support incudf
(#14916) @galipremsagar - Use cuco::static_set in the hash-based groupby (#14813) @PointKernel
🐛 Bug Fixes
- Fix an issue with creating a series from scalar when
dtype='category'
(#15476) @galipremsagar - Update pre-commit-hooks to v0.0.3 (#15355) @KyleFromNVIDIA
- [BUG][JNI] Trigger MemoryBuffer.onClosed after memory is freed (#15351) @abellina
- Fix an issue with multiple short list rowgroups using the Parquet chunked reader. (#15342) @nvdbaranec
- Avoid importing dask-expr if "query-planning" config is
False
(#15340) @rjzamora - Fix gtests/ERROR_TEST errors when run in Debug (#15317) @davidwendt
- Fix OOB read in
inflate_kernel
(#15309) @vuule - Work around a cuFile error when running CSV tests with memcheck (#15293) @vuule
- Fix Doxygen upload directory (#15291) @KyleFromNVIDIA
- Fix Doxygen check (#15289) @KyleFromNVIDIA
- Reintroduce PANDAS_GE_220 import (#15287) @wence-
- Fix mean computation for the geometric distribution in the data generator (#15282) @vuule
- Fix Parquet decimal64 stats (#15281) @etseidl
- Make linking of nvtx3-cpp BUILD_LOCAL_INTERFACE (#15271) @KyleFromNVIDIA
- Workaround compute-sanitizer memcheck bug (#15259) @davidwendt
- Cleanup
hostdevice_vector
and add more APIs (#15252) @ttnghia - Fix number of rows in randomly generated lists columns (#15248) @vuule
- Fix wrong output for
collect_list
/collect_set
of lists column (#15243) @ttnghia - Fix testchunkedPackTwoPasses to copy from the bounce buffer (#15220) @abellina
- Fix accessing
.columns
by an external API (#15212) @galipremsagar - [JNI] Disable testChunkedPackTwoPasses for now (#15210) @abellina
- Update labeler and codeowner configs for CMake files (#15208) @PointKernel
- Avoid dict normalization in
__dask_tokenize__
(#15187) @rjzamora - Fix memcheck error in distinct inner join (#15164) @PointKernel
- Remove unneeded script parameters in test_cpp_memcheck.sh (#15158) @davidwendt
- Fix
ListColumn.to_pandas()
to retainlist
type (#15155) @galipremsagar - Avoid factorization in MultiIndex.to_pandas (#15150) @mroeschke
- Fix GroupBy.get_group and GroupBy.indices (#15143) @wence-
- Remove
const
fromrange_window_bounds::_extent
. (#15138) @mythrocks - DataFrame.columns = ... retains RangeIndex & set dtype (#15129) @mroeschke
- Correctly handle output for
GroupBy.apply
when chunk results are reindexed series (#15109) @brandon-b-miller - Fix Series.groupby.shift with a MultiIndex (#15098) @mroeschke
- Fix reductions when DataFrame has MulitIndex columns (#15097) @mroeschke
- Fix deprecation warnings for deprecated hash() calls (#15095) @davidwendt
- Add support for arrow
large_string
incudf
(#15093) @galipremsagar - Fix
sort_values
pytest failure with pandas-2.x regression (#15092) @galipremsagar - Resolve path parsing issues in
get_json_object
(#15082) @SurajAralihalli - Fix bugs in handling of delta encodings (#15075) @etseidl
- Fix
is_device_write_preferred
invoid_sink
anduser_sink_wrapper
(#15064) @vuule - Eliminate duplicate allocation of nested string columns (#15061) @vuule
- Raise an error on import for unsupported GPUs. (#15053) @bdice
- Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
- Fix
Index.difference
to handle duplicate values when one of the inputs is empty (#15016) @galipremsagar - Add
future_stack
toDataFrame.stack
(#15015) @galipremsagar - Fix handling of values=None in pylibcudf GroupBy.get_groups (#14998) @shwina
- Fix
DataFrame.sort_index
to respectignore_index
on all axis (#14995) @galipremsagar - Raise for pyarrow array that is tz-aware (#14980) @mroeschke
- Direct
SeriesGroupBy.aggregate
toSeriesGroupBy.agg
(#14971) @rjzamora - Respect IntervalDtype and CategoricalDtype objects passed by users (#14961) @mroeschke
- unset
CUDF_SPILL
after a pytest (#14958) @galipremsagar - Fix Null literals to be not parsed as string when mixed types as string is enabled in JSON reader (#14939) @karthikeyann
- Fix chunked reads of Parquet delta encoded pages (#14921) @etseidl
- Fix reading offset for data stream in ORC reader (#14911) @ttnghia
- Enable sanitizer check for a test case testORCReadAndWriteForDecimal128 (#14897) @res-life
- Fix dask token normalization (#14829) @rjzamora
- Fix 24.04 versions (#14825) @raydouglass
- Ensure slow private attrs are maybe proxies (#14380) @mroeschke
📖 Documentation
- Ignore DLManagedTensor in the docs build (#15392) @davidwendt
- Revert "Temporarily disable docs errors. (#15265)" (#15269) @bdice
- Temporarily disable docs errors. (#15265) @bdice
- Update
developer_guide.md
with new guidance on quoted internal includes (#15238) @harrism - Fix broken link for developer guide (#15025) @sanjana098
- [DOC] Update typo in docs example of structs_column_wrapper (#14949) @karthikeyann
- Update cudf.pandas FAQ. (#14940) @bdice
- Optimize doc builds (#14856) @vyasr
- Add developer guideline to use east const. (#14836) @bdice
- Document how cuDF is pronounced (#14753) @pentschev
- Notes convert to Pandas-compat (#12641) @Touutae-lab
🚀 New Features
- Address inconsistency in single quote normalization in JSON reader (#15324) @shrshi
- Use JNI pinned pool resource with cuIO (#15255) @abellina
- Add DELTA_BYTE_ARRAY encoder for Parquet (#15239) @etseidl
- Migrate filling operations to pylibcudf (#15225) @brandon-b-miller
- [JNI] rmm based pinned pool (#15219) @abellina
- Implement zero-copy host buffer source instead of using an arrow implementation (#15189) @vuule
- Enable creation of columns from scalar (#15181) @vyasr
- Use NVTX from GitHub. (#15178) @bdice
- Implement
segmented_row_bit_count
for computing row sizes by segments of rows (#15169) @ttnghia - Implement search using pylibcudf (#15166) @vyasr
- Add distinct left join (#15149) @PointKernel
- Add cardinality control for groupby benchs with flat types (#15134) @PointKernel
- Add ability to request Parquet encodings on a per-column basis (#15081) @etseidl
- Automate include grouping order in .clang-format (#15063) @harrism
- Requesting a clean build directory also clears Jitify cache (#15052) @robertmaynard
- API for JSON unquoted whitespace normalization (#15033) @shrshi
- Implement concatenate, lists.explode, merge, sorting, and stream compaction in pylibcudf (#15011) @vyasr
- Implement replace in pylibcudf (#15005) @vyasr
- Add distinct key inner join (#14990) @PointKernel
- Implement rolling in pylibcudf (#14982) @vyasr
- Implement joins in pylibcudf (#14972) @vyasr
- Implement scans and reductions in pylibcudf (#14970) @vyasr
- Rewrite cudf internals using pylibcudf groupby (#14946) @vyasr
- Implement groupby in pylibcudf (#14945) @vyasr
- Support casting of Map type to string in JSON reader (#14936) @karthikeyann
- POC for whitespace removal in input JSON data using FST (#14931) @shrshi
- Support for LZ4 compression in ORC and Parquet (#14906) @vuule
- Remove supports_streams from cuDF custom memory resources. (#14857) @harrism
- Migrate unary operations to pylibcudf (#14850) @vyasr
- Migrate binary operations to pylibcudf (#14821) @vyasr
- Add row index and stripe size options to Python ORC chunked writer (#14785) @vuule
- Support CUDA 12.2 (#14712) @jameslamb
🛠️ Improvements
- Backport: Relax protobuf lower bound to 3.20. (#15506) (#15610) @bdice
- Use
conda env create --yes
instead of--force
(#15403) @bdice - Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
- Change exceptions thrown by copying APIs (#15319) @vyasr
- Enable branch testing for
cudf.pandas
(#15316) @galipremsagar - Replace black with ruff-format (#15312) @mroeschke
- This fixes an NPE when trying to read empty JSON data by adding a new API for missing information (#15307) @revans2
- Address poor performance of Parquet string decoding (#15304) @etseidl
- Update script input name (#15301) @AyodeAwe
- Make test_read_parquet_partitioned_filtered data deterministic (#15296) @mroeschke
- Add timeout for
cudf.pandas
pandas tests (#15284) @galipremsagar - Add upper bound to prevent usage of NumPy 2 (#15283) @bdice
- Fix cudf::test::to_host return of host_vector (#15263) @davidwendt
- Implement grouped product scan (#15254) @wence-
- Add CUDA 12.4 to supported PTX versions (#15247) @brandon-b-miller
- Implement DataFrame|Series.squeeze (#15244) @mroeschke
- Roll back ipow changes due to register pressure. (#15242) @pmattione-nvidia
- Remove create_chars_child_column utility (#15241) @davidwendt
- Update dlpack to version 0.8 (#15237) @dantegd
- Improve performance in JSON reader when
mixed_types_as_string
option is enabled (#15236) @shrshi - Remove row conversion code from libcudf (#15234) @ttnghia
- Use variable substitution for RAPIDS version in Doxyfile (#15231) @KyleFromNVIDIA
- Add ListColumns.to_pandas(arrow_type=) (#15228) @mroeSC...
v24.04.00
🚨 Breaking Changes
- Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
- Change exceptions thrown by copying APIs (#15319) @vyasr
- Change strings_column_view::char_size to return int64 (#15197) @davidwendt
- Upgrade to
arrow-14.0.2
(#15108) @galipremsagar - Add support for
pandas-2.2
incudf
(#15100) @galipremsagar - Deprecate cudf::hashing::spark_murmurhash3_x86_32 (#15074) @davidwendt
- Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
- Raise an error on import for unsupported GPUs. (#15053) @bdice
- Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
- Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
- Add
future_stack
toDataFrame.stack
(#15015) @galipremsagar - Deprecate groupby fillna (#15000) @mroeschke
- Deprecate replace with categorical columns (#14988) @mroeschke
- Deprecate delim_whitespace in read_csv for pandas 2.2 (#14986) @mroeschke
- Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
- Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
- Add
pandas-2.x
support incudf
(#14916) @galipremsagar - Use cuco::static_set in the hash-based groupby (#14813) @PointKernel
🐛 Bug Fixes
- Fix an issue with creating a series from scalar when
dtype='category'
(#15476) @galipremsagar - Update pre-commit-hooks to v0.0.3 (#15355) @KyleFromNVIDIA
- [BUG][JNI] Trigger MemoryBuffer.onClosed after memory is freed (#15351) @abellina
- Fix an issue with multiple short list rowgroups using the Parquet chunked reader. (#15342) @nvdbaranec
- Avoid importing dask-expr if "query-planning" config is
False
(#15340) @rjzamora - Fix gtests/ERROR_TEST errors when run in Debug (#15317) @davidwendt
- Fix OOB read in
inflate_kernel
(#15309) @vuule - Work around a cuFile error when running CSV tests with memcheck (#15293) @vuule
- Fix Doxygen upload directory (#15291) @KyleFromNVIDIA
- Fix Doxygen check (#15289) @KyleFromNVIDIA
- Reintroduce PANDAS_GE_220 import (#15287) @wence-
- Fix mean computation for the geometric distribution in the data generator (#15282) @vuule
- Fix Parquet decimal64 stats (#15281) @etseidl
- Make linking of nvtx3-cpp BUILD_LOCAL_INTERFACE (#15271) @KyleFromNVIDIA
- Workaround compute-sanitizer memcheck bug (#15259) @davidwendt
- Cleanup
hostdevice_vector
and add more APIs (#15252) @ttnghia - Fix number of rows in randomly generated lists columns (#15248) @vuule
- Fix wrong output for
collect_list
/collect_set
of lists column (#15243) @ttnghia - Fix testchunkedPackTwoPasses to copy from the bounce buffer (#15220) @abellina
- Fix accessing
.columns
by an external API (#15212) @galipremsagar - [JNI] Disable testChunkedPackTwoPasses for now (#15210) @abellina
- Update labeler and codeowner configs for CMake files (#15208) @PointKernel
- Avoid dict normalization in
__dask_tokenize__
(#15187) @rjzamora - Fix memcheck error in distinct inner join (#15164) @PointKernel
- Remove unneeded script parameters in test_cpp_memcheck.sh (#15158) @davidwendt
- Fix
ListColumn.to_pandas()
to retainlist
type (#15155) @galipremsagar - Avoid factorization in MultiIndex.to_pandas (#15150) @mroeschke
- Fix GroupBy.get_group and GroupBy.indices (#15143) @wence-
- Remove
const
fromrange_window_bounds::_extent
. (#15138) @mythrocks - DataFrame.columns = ... retains RangeIndex & set dtype (#15129) @mroeschke
- Correctly handle output for
GroupBy.apply
when chunk results are reindexed series (#15109) @brandon-b-miller - Fix Series.groupby.shift with a MultiIndex (#15098) @mroeschke
- Fix reductions when DataFrame has MulitIndex columns (#15097) @mroeschke
- Fix deprecation warnings for deprecated hash() calls (#15095) @davidwendt
- Add support for arrow
large_string
incudf
(#15093) @galipremsagar - Fix
sort_values
pytest failure with pandas-2.x regression (#15092) @galipremsagar - Resolve path parsing issues in
get_json_object
(#15082) @SurajAralihalli - Fix bugs in handling of delta encodings (#15075) @etseidl
- Fix
is_device_write_preferred
invoid_sink
anduser_sink_wrapper
(#15064) @vuule - Eliminate duplicate allocation of nested string columns (#15061) @vuule
- Raise an error on import for unsupported GPUs. (#15053) @bdice
- Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
- Fix
Index.difference
to handle duplicate values when one of the inputs is empty (#15016) @galipremsagar - Add
future_stack
toDataFrame.stack
(#15015) @galipremsagar - Fix handling of values=None in pylibcudf GroupBy.get_groups (#14998) @shwina
- Fix
DataFrame.sort_index
to respectignore_index
on all axis (#14995) @galipremsagar - Raise for pyarrow array that is tz-aware (#14980) @mroeschke
- Direct
SeriesGroupBy.aggregate
toSeriesGroupBy.agg
(#14971) @rjzamora - Respect IntervalDtype and CategoricalDtype objects passed by users (#14961) @mroeschke
- unset
CUDF_SPILL
after a pytest (#14958) @galipremsagar - Fix Null literals to be not parsed as string when mixed types as string is enabled in JSON reader (#14939) @karthikeyann
- Fix chunked reads of Parquet delta encoded pages (#14921) @etseidl
- Fix reading offset for data stream in ORC reader (#14911) @ttnghia
- Enable sanitizer check for a test case testORCReadAndWriteForDecimal128 (#14897) @res-life
- Fix dask token normalization (#14829) @rjzamora
- Fix 24.04 versions (#14825) @raydouglass
- Ensure slow private attrs are maybe proxies (#14380) @mroeschke
📖 Documentation
- Ignore DLManagedTensor in the docs build (#15392) @davidwendt
- Revert "Temporarily disable docs errors. (#15265)" (#15269) @bdice
- Temporarily disable docs errors. (#15265) @bdice
- Update
developer_guide.md
with new guidance on quoted internal includes (#15238) @harrism - Fix broken link for developer guide (#15025) @sanjana098
- [DOC] Update typo in docs example of structs_column_wrapper (#14949) @karthikeyann
- Update cudf.pandas FAQ. (#14940) @bdice
- Optimize doc builds (#14856) @vyasr
- Add developer guideline to use east const. (#14836) @bdice
- Document how cuDF is pronounced (#14753) @pentschev
- Notes convert to Pandas-compat (#12641) @Touutae-lab
🚀 New Features
- Address inconsistency in single quote normalization in JSON reader (#15324) @shrshi
- Use JNI pinned pool resource with cuIO (#15255) @abellina
- Add DELTA_BYTE_ARRAY encoder for Parquet (#15239) @etseidl
- Migrate filling operations to pylibcudf (#15225) @brandon-b-miller
- [JNI] rmm based pinned pool (#15219) @abellina
- Implement zero-copy host buffer source instead of using an arrow implementation (#15189) @vuule
- Enable creation of columns from scalar (#15181) @vyasr
- Use NVTX from GitHub. (#15178) @bdice
- Implement
segmented_row_bit_count
for computing row sizes by segments of rows (#15169) @ttnghia - Implement search using pylibcudf (#15166) @vyasr
- Add distinct left join (#15149) @PointKernel
- Add cardinality control for groupby benchs with flat types (#15134) @PointKernel
- Add ability to request Parquet encodings on a per-column basis (#15081) @etseidl
- Automate include grouping order in .clang-format (#15063) @harrism
- Requesting a clean build directory also clears Jitify cache (#15052) @robertmaynard
- API for JSON unquoted whitespace normalization (#15033) @shrshi
- Implement concatenate, lists.explode, merge, sorting, and stream compaction in pylibcudf (#15011) @vyasr
- Implement replace in pylibcudf (#15005) @vyasr
- Add distinct key inner join (#14990) @PointKernel
- Implement rolling in pylibcudf (#14982) @vyasr
- Implement joins in pylibcudf (#14972) @vyasr
- Implement scans and reductions in pylibcudf (#14970) @vyasr
- Rewrite cudf internals using pylibcudf groupby (#14946) @vyasr
- Implement groupby in pylibcudf (#14945) @vyasr
- Support casting of Map type to string in JSON reader (#14936) @karthikeyann
- POC for whitespace removal in input JSON data using FST (#14931) @shrshi
- Support for LZ4 compression in ORC and Parquet (#14906) @vuule
- Remove supports_streams from cuDF custom memory resources. (#14857) @harrism
- Migrate unary operations to pylibcudf (#14850) @vyasr
- Migrate binary operations to pylibcudf (#14821) @vyasr
- Add row index and stripe size options to Python ORC chunked writer (#14785) @vuule
- Support CUDA 12.2 (#14712) @jameslamb
🛠️ Improvements
- Use
conda env create --yes
instead of--force
(#15403) @bdice - Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
- Change exceptions thrown by copying APIs (#15319) @vyasr
- Enable branch testing for
cudf.pandas
(#15316) @galipremsagar - Replace black with ruff-format (#15312) @mroeschke
- This fixes an NPE when trying to read empty JSON data by adding a new API for missing information (#15307) @revans2
- Address poor performance of Parquet string decoding (#15304) @etseidl
- Update script input name (#15301) @AyodeAwe
- Make test_read_parquet_partitioned_filtered data deterministic (#15296) @mroeschke
- Add timeout for
cudf.pandas
pandas tests (#15284) @galipremsagar - Add upper bound to prevent usage of NumPy 2 (#15283) @bdice
- Fix cudf::test::to_host return of host_vector (#15263) @davidwendt
- Implement grouped product scan (#15254) @wence-
- Add CUDA 12.4 to supported PTX versions (#15247) @brandon-b-miller
- Implement DataFrame|Series.squeeze (#15244) @mroeschke
- Roll back ipow changes due to register pressure. (#15242) @pmattione-nvidia
- Remove create_chars_child_column utility (#15241) @davidwendt
- Update dlpack to version 0.8 (#15237) @dantegd
- Improve performance in JSON reader when
mixed_types_as_string
option is enabled (#15236) @shrshi - Remove row conversion code from libcudf (#15234) @ttnghia
- Use variable substitution for RAPIDS version in Doxyfile (#15231) @KyleFromNVIDIA
- Add ListColumns.to_pandas(arrow_type=) (#15228) @mroeschke
- Treat dask-cudf CI artifacts as pure wheels (#15223) @bdice
- Clean...
v24.02.02
🚨 Breaking Changes
- Remove **kwargs from astype (#14765) @mroeschke
- Remove mimesis as a testing dependency (#14723) @mroeschke
- Update to Dask's
shuffle_method
kwarg (#14708) @pentschev - Drop Pascal GPU support. (#14630) @bdice
- Update to CCCL 2.2.0. (#14576) @bdice
- Expunge as_frame conversions in Column algorithms (#14491) @wence-
- Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
- Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
- Include writer code and writerVersion in ORC files (#14458) @vuule
- Remove null mask for zero nulls in json readers (#14451) @karthikeyann
- REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
- Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
- Move chars column to parent data buffer in strings column (#14202) @karthikeyann
- Switch to scikit-build-core (#13531) @vyasr
🐛 Bug Fixes
- Bump to nvcomp 3.0.6. (#15128) @bdice
- [HOTFIX] Unpin numba<0.58 (#15031) @raydouglass
- Exclude tests from builds (#14981) @vyasr
- Fix the bounce buffer size in ORC writer (#14947) @vuule
- Revert sum/product aggregation to always produce
int64_t
type (#14907) @SurajAralihalli - Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
- Fix total_byte_size in Parquet row group metadata (#14802) @etseidl
- Fix index difference to follow the pandas format (#14789) @amiralimi
- Fix shared-workflows repo name (#14784) @raydouglass
- Remove unparseable attributes from all nodes (#14780) @vyasr
- Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
- Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
- Fix calls to deprecated strings factory API (#14771) @davidwendt
- Fix ptx file discovery in editable installs (#14767) @vyasr
- Revise
shuffle
deprecation to align with dask/dask (#14762) @rjzamora - Enable intermediate proxies to be picklable (#14752) @shwina
- Add CUDF_TEST_PROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
- Fix CMake args (#14746) @vyasr
- Fix logic bug introduced in #14730 (#14742) @wence-
- [Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
- Fix
Groupby.get_group
(#14728) @rjzamora - Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
- Split cuda versions for notebook testing (#14722) @raydouglass
- Fix to_numeric not preserving Series index and name (#14718) @mroeschke
- Update dask-cudf wheel name (#14713) @raydouglass
- Fix strings::contains matching end of string target (#14711) @davidwendt
- Update to Dask's
shuffle_method
kwarg (#14708) @pentschev - Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
- Potential fix for peformance regression in #14415 (#14706) @etseidl
- Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
- Skip numba test that fails on ARM (#14702) @brandon-b-miller
- Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
- Fix nan_as_null not being respected when passing arrow object (#14688) @mroeschke
- Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
- Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
- Add BaseOffset as a final proxy type to pass instancechecks for offsets against
BaseOffset
(#14678) @shwina - Add row conversion code from spark-rapids-jni (#14664) @ttnghia
- Unconditionally export the CCCL path (#14656) @vyasr
- Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
- Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
- Fix invalid memory access in Parquet reader (#14637) @etseidl
- Use column_empty over as_column([]) (#14632) @mroeschke
- Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
- Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
- Remove non-empty nulls in cudf::get_json_object (#14609) @davidwendt
- Remove
cuda::proclaim_return_type
from nested lambda (#14607) @ttnghia - Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
- Address potential race conditions in Parquet reader (#14602) @etseidl
- Fix DataFrame.reindex removing column name (#14601) @mroeschke
- Remove unsanitized input test data from copy gtests (#14600) @davidwendt
- Fix race detected in Parquet writer (#14598) @etseidl
- Correct invalid or missing return types (#14587) @robertmaynard
- Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
- Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
- Fix unsanitized nulls produced by
cudf::clamp
APIs (#14580) @davidwendt - Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
- Fixes a symbol group lookup table issue (#14561) @elstehle
- Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
- REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
- Improve memory footprint of isin by using contains (#14478) @wence-
- Move creation of env.yaml outside the current directory (#14476) @davidwendt
- Enable
pd.Timestamp
objects to be picklable whencudf.pandas
is active (#14474) @shwina - Correct dtype of count aggregations on empty dataframes (#14473) @wence-
- Avoid DataFrame conversion in
MultiIndex.from_pandas
(#14470) @mroeschke - JSON writer: avoid default stream use in
string_scalar
constructors (#14444) @vuule - Fix default stream use in the CSV reader (#14443) @vuule
- Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
- Defer PTX file load to runtime (#13690) @brandon-b-miller
📖 Documentation
- Disable parallel build (#14796) @vyasr
- Add pylibcudf to the docs (#14791) @vyasr
- Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
- Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
- More doxygen fixes (#14639) @vyasr
- Enable doxygen XML generation and fix issues (#14477) @vyasr
- Some doxygen improvements (#14469) @vyasr
- Remove warning in dask-cudf docs (#14454) @wence-
- Update README links with redirects. (#14378) @bdice
- Add pip install instructions to README (#13677) @shwina
🚀 New Features
- Add ci check for external kernels (#14768) @robertmaynard
- JSON single quote normalization API (#14729) @shrshi
- Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
- Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
- Don't constrain
numba<0.58
(#14616) @brandon-b-miller - Add DELTA_LENGTH_BYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
- JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
- JSON quote normalization (#14545) @shrshi
- Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
- Implement more copying APIs in pylibcudf (#14508) @vyasr
- Include writer code and writerVersion in ORC files (#14458) @vuule
- Parquet sub-rowgroup reading. (#14360) @nvdbaranec
- Move chars column to parent data buffer in strings column (#14202) @karthikeyann
- PARQUET-2261 Size Statistics (#14000) @etseidl
- Improve GroupBy JIT error handling (#13854) @brandon-b-miller
- Generate unified Python/C++ docs (#13846) @vyasr
- Expand JIT groupby test suite (#13813) @brandon-b-miller
🛠️ Improvements
- Pin
pytest<8
(#14920) @galipremsagar - Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
- Clean up
TimedeltaIndex.__init__
constructor (#14775) @mroeschke - Clean up
DatetimeIndex.__init__
constructor (#14774) @mroeschke - Some
frame.py
typing, move seldom used methods inframe.py
(#14766) @mroeschke - Remove **kwargs from astype (#14765) @mroeschke
- fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
- Add
pynvjitlink
as a dependency (#14763) @brandon-b-miller - Resolve degenerate performance in
create_structs_data
(#14761) @SurajAralihalli - Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
- Pin pytest-cases<3.8.2 (#14756) @mroeschke
- Use _from_data instead of _from_columns for initialzing Frame (#14755) @mroeschke
- Consolidate cudf object handling in as_column (#14754) @mroeschke
- Reduce execution time of Parquet C++ tests (#14750) @vuule
- Implement to_datetime(..., utc=True) (#14749) @mroeschke
- Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
- Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
- Implement
cudf.MultiIndex.from_arrays
(#14740) @mroeschke - Remove unused/single use methods (#14739) @mroeschke
- refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
- Remove unneeded methods in Column (#14730) @mroeschke
- Clean up base column methods (#14725) @mroeschke
- Ensure column.fillna signatures are consistent (#14724) @mroeschke
- Remove mimesis as a testing dependency (#14723) @mroeschke
- Replace as_numerical with as_numerical_column/codes (#14719) @mroeschke
- Use offsetalator in gather_chars (#14700) @davidwendt
- Use make_strings_children for fill() specialization logic (#14697) @davidwendt
- Change
io::detail::orc
namespace intoio::orc::detail
(#14696) @ttnghia - Fix call to deprecated factory function (#14695) @davidwendt
- Use as_column instead of arange for range like inputs (#14689) @mroeschke
- Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
- Split parquet test into multiple files (#14663) @etseidl
- Custom error messages for IO with nonexistent files (#14662) @vuule
- Explicitly pass .dtype into is_foo_dtype functions (#14657) @mroeschke
- Basic val...
v24.02.01
🚨 Breaking Changes
- Remove **kwargs from astype (#14765) @mroeschke
- Remove mimesis as a testing dependency (#14723) @mroeschke
- Update to Dask's
shuffle_method
kwarg (#14708) @pentschev - Drop Pascal GPU support. (#14630) @bdice
- Update to CCCL 2.2.0. (#14576) @bdice
- Expunge as_frame conversions in Column algorithms (#14491) @wence-
- Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
- Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
- Include writer code and writerVersion in ORC files (#14458) @vuule
- Remove null mask for zero nulls in json readers (#14451) @karthikeyann
- REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
- Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
- Move chars column to parent data buffer in strings column (#14202) @karthikeyann
- Switch to scikit-build-core (#13531) @vyasr
🐛 Bug Fixes
- [HOTFIX] Unpin numba<0.58 (#15031) @raydouglass
- Exclude tests from builds (#14981) @vyasr
- Fix the bounce buffer size in ORC writer (#14947) @vuule
- Revert sum/product aggregation to always produce
int64_t
type (#14907) @SurajAralihalli - Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
- Fix total_byte_size in Parquet row group metadata (#14802) @etseidl
- Fix index difference to follow the pandas format (#14789) @amiralimi
- Fix shared-workflows repo name (#14784) @raydouglass
- Remove unparseable attributes from all nodes (#14780) @vyasr
- Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
- Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
- Fix calls to deprecated strings factory API (#14771) @davidwendt
- Fix ptx file discovery in editable installs (#14767) @vyasr
- Revise
shuffle
deprecation to align with dask/dask (#14762) @rjzamora - Enable intermediate proxies to be picklable (#14752) @shwina
- Add CUDF_TEST_PROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
- Fix CMake args (#14746) @vyasr
- Fix logic bug introduced in #14730 (#14742) @wence-
- [Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
- Fix
Groupby.get_group
(#14728) @rjzamora - Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
- Split cuda versions for notebook testing (#14722) @raydouglass
- Fix to_numeric not preserving Series index and name (#14718) @mroeschke
- Update dask-cudf wheel name (#14713) @raydouglass
- Fix strings::contains matching end of string target (#14711) @davidwendt
- Update to Dask's
shuffle_method
kwarg (#14708) @pentschev - Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
- Potential fix for peformance regression in #14415 (#14706) @etseidl
- Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
- Skip numba test that fails on ARM (#14702) @brandon-b-miller
- Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
- Fix nan_as_null not being respected when passing arrow object (#14688) @mroeschke
- Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
- Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
- Add BaseOffset as a final proxy type to pass instancechecks for offsets against
BaseOffset
(#14678) @shwina - Add row conversion code from spark-rapids-jni (#14664) @ttnghia
- Unconditionally export the CCCL path (#14656) @vyasr
- Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
- Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
- Fix invalid memory access in Parquet reader (#14637) @etseidl
- Use column_empty over as_column([]) (#14632) @mroeschke
- Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
- Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
- Remove non-empty nulls in cudf::get_json_object (#14609) @davidwendt
- Remove
cuda::proclaim_return_type
from nested lambda (#14607) @ttnghia - Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
- Address potential race conditions in Parquet reader (#14602) @etseidl
- Fix DataFrame.reindex removing column name (#14601) @mroeschke
- Remove unsanitized input test data from copy gtests (#14600) @davidwendt
- Fix race detected in Parquet writer (#14598) @etseidl
- Correct invalid or missing return types (#14587) @robertmaynard
- Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
- Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
- Fix unsanitized nulls produced by
cudf::clamp
APIs (#14580) @davidwendt - Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
- Fixes a symbol group lookup table issue (#14561) @elstehle
- Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
- REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
- Improve memory footprint of isin by using contains (#14478) @wence-
- Move creation of env.yaml outside the current directory (#14476) @davidwendt
- Enable
pd.Timestamp
objects to be picklable whencudf.pandas
is active (#14474) @shwina - Correct dtype of count aggregations on empty dataframes (#14473) @wence-
- Avoid DataFrame conversion in
MultiIndex.from_pandas
(#14470) @mroeschke - JSON writer: avoid default stream use in
string_scalar
constructors (#14444) @vuule - Fix default stream use in the CSV reader (#14443) @vuule
- Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
- Defer PTX file load to runtime (#13690) @brandon-b-miller
📖 Documentation
- Disable parallel build (#14796) @vyasr
- Add pylibcudf to the docs (#14791) @vyasr
- Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
- Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
- More doxygen fixes (#14639) @vyasr
- Enable doxygen XML generation and fix issues (#14477) @vyasr
- Some doxygen improvements (#14469) @vyasr
- Remove warning in dask-cudf docs (#14454) @wence-
- Update README links with redirects. (#14378) @bdice
- Add pip install instructions to README (#13677) @shwina
🚀 New Features
- Add ci check for external kernels (#14768) @robertmaynard
- JSON single quote normalization API (#14729) @shrshi
- Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
- Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
- Don't constrain
numba<0.58
(#14616) @brandon-b-miller - Add DELTA_LENGTH_BYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
- JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
- JSON quote normalization (#14545) @shrshi
- Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
- Implement more copying APIs in pylibcudf (#14508) @vyasr
- Include writer code and writerVersion in ORC files (#14458) @vuule
- Parquet sub-rowgroup reading. (#14360) @nvdbaranec
- Move chars column to parent data buffer in strings column (#14202) @karthikeyann
- PARQUET-2261 Size Statistics (#14000) @etseidl
- Improve GroupBy JIT error handling (#13854) @brandon-b-miller
- Generate unified Python/C++ docs (#13846) @vyasr
- Expand JIT groupby test suite (#13813) @brandon-b-miller
🛠️ Improvements
- Pin
pytest<8
(#14920) @galipremsagar - Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
- Clean up
TimedeltaIndex.__init__
constructor (#14775) @mroeschke - Clean up
DatetimeIndex.__init__
constructor (#14774) @mroeschke - Some
frame.py
typing, move seldom used methods inframe.py
(#14766) @mroeschke - Remove **kwargs from astype (#14765) @mroeschke
- fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
- Add
pynvjitlink
as a dependency (#14763) @brandon-b-miller - Resolve degenerate performance in
create_structs_data
(#14761) @SurajAralihalli - Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
- Pin pytest-cases<3.8.2 (#14756) @mroeschke
- Use _from_data instead of _from_columns for initialzing Frame (#14755) @mroeschke
- Consolidate cudf object handling in as_column (#14754) @mroeschke
- Reduce execution time of Parquet C++ tests (#14750) @vuule
- Implement to_datetime(..., utc=True) (#14749) @mroeschke
- Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
- Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
- Implement
cudf.MultiIndex.from_arrays
(#14740) @mroeschke - Remove unused/single use methods (#14739) @mroeschke
- refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
- Remove unneeded methods in Column (#14730) @mroeschke
- Clean up base column methods (#14725) @mroeschke
- Ensure column.fillna signatures are consistent (#14724) @mroeschke
- Remove mimesis as a testing dependency (#14723) @mroeschke
- Replace as_numerical with as_numerical_column/codes (#14719) @mroeschke
- Use offsetalator in gather_chars (#14700) @davidwendt
- Use make_strings_children for fill() specialization logic (#14697) @davidwendt
- Change
io::detail::orc
namespace intoio::orc::detail
(#14696) @ttnghia - Fix call to deprecated factory function (#14695) @davidwendt
- Use as_column instead of arange for range like inputs (#14689) @mroeschke
- Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
- Split parquet test into multiple files (#14663) @etseidl
- Custom error messages for IO with nonexistent files (#14662) @vuule
- Explicitly pass .dtype into is_foo_dtype functions (#14657) @mroeschke
- Basic validation in reader benchmarks (#14647) @v...
v24.02.00
🚨 Breaking Changes
- Remove **kwargs from astype (#14765) @mroeschke
- Remove mimesis as a testing dependency (#14723) @mroeschke
- Update to Dask's
shuffle_method
kwarg (#14708) @pentschev - Drop Pascal GPU support. (#14630) @bdice
- Update to CCCL 2.2.0. (#14576) @bdice
- Expunge as_frame conversions in Column algorithms (#14491) @wence-
- Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
- Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
- Include writer code and writerVersion in ORC files (#14458) @vuule
- Remove null mask for zero nulls in json readers (#14451) @karthikeyann
- REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
- Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
- Move chars column to parent data buffer in strings column (#14202) @karthikeyann
- Switch to scikit-build-core (#13531) @vyasr
🐛 Bug Fixes
- Exclude tests from builds (#14981) @vyasr
- Fix the bounce buffer size in ORC writer (#14947) @vuule
- Revert sum/product aggregation to always produce
int64_t
type (#14907) @SurajAralihalli - Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
- Fix total_byte_size in Parquet row group metadata (#14802) @etseidl
- Fix index difference to follow the pandas format (#14789) @amiralimi
- Fix shared-workflows repo name (#14784) @raydouglass
- Remove unparseable attributes from all nodes (#14780) @vyasr
- Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
- Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
- Fix calls to deprecated strings factory API (#14771) @davidwendt
- Fix ptx file discovery in editable installs (#14767) @vyasr
- Revise
shuffle
deprecation to align with dask/dask (#14762) @rjzamora - Enable intermediate proxies to be picklable (#14752) @shwina
- Add CUDF_TEST_PROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
- Fix CMake args (#14746) @vyasr
- Fix logic bug introduced in #14730 (#14742) @wence-
- [Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
- Fix
Groupby.get_group
(#14728) @rjzamora - Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
- Split cuda versions for notebook testing (#14722) @raydouglass
- Fix to_numeric not preserving Series index and name (#14718) @mroeschke
- Update dask-cudf wheel name (#14713) @raydouglass
- Fix strings::contains matching end of string target (#14711) @davidwendt
- Update to Dask's
shuffle_method
kwarg (#14708) @pentschev - Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
- Potential fix for peformance regression in #14415 (#14706) @etseidl
- Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
- Skip numba test that fails on ARM (#14702) @brandon-b-miller
- Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
- Fix nan_as_null not being respected when passing arrow object (#14688) @mroeschke
- Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
- Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
- Add BaseOffset as a final proxy type to pass instancechecks for offsets against
BaseOffset
(#14678) @shwina - Add row conversion code from spark-rapids-jni (#14664) @ttnghia
- Unconditionally export the CCCL path (#14656) @vyasr
- Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
- Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
- Fix invalid memory access in Parquet reader (#14637) @etseidl
- Use column_empty over as_column([]) (#14632) @mroeschke
- Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
- Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
- Remove non-empty nulls in cudf::get_json_object (#14609) @davidwendt
- Remove
cuda::proclaim_return_type
from nested lambda (#14607) @ttnghia - Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
- Address potential race conditions in Parquet reader (#14602) @etseidl
- Fix DataFrame.reindex removing column name (#14601) @mroeschke
- Remove unsanitized input test data from copy gtests (#14600) @davidwendt
- Fix race detected in Parquet writer (#14598) @etseidl
- Correct invalid or missing return types (#14587) @robertmaynard
- Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
- Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
- Fix unsanitized nulls produced by
cudf::clamp
APIs (#14580) @davidwendt - Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
- Fixes a symbol group lookup table issue (#14561) @elstehle
- Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
- REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
- Improve memory footprint of isin by using contains (#14478) @wence-
- Move creation of env.yaml outside the current directory (#14476) @davidwendt
- Enable
pd.Timestamp
objects to be picklable whencudf.pandas
is active (#14474) @shwina - Correct dtype of count aggregations on empty dataframes (#14473) @wence-
- Avoid DataFrame conversion in
MultiIndex.from_pandas
(#14470) @mroeschke - JSON writer: avoid default stream use in
string_scalar
constructors (#14444) @vuule - Fix default stream use in the CSV reader (#14443) @vuule
- Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
- Defer PTX file load to runtime (#13690) @brandon-b-miller
📖 Documentation
- Disable parallel build (#14796) @vyasr
- Add pylibcudf to the docs (#14791) @vyasr
- Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
- Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
- More doxygen fixes (#14639) @vyasr
- Enable doxygen XML generation and fix issues (#14477) @vyasr
- Some doxygen improvements (#14469) @vyasr
- Remove warning in dask-cudf docs (#14454) @wence-
- Update README links with redirects. (#14378) @bdice
- Add pip install instructions to README (#13677) @shwina
🚀 New Features
- Add ci check for external kernels (#14768) @robertmaynard
- JSON single quote normalization API (#14729) @shrshi
- Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
- Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
- Don't constrain
numba<0.58
(#14616) @brandon-b-miller - Add DELTA_LENGTH_BYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
- JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
- JSON quote normalization (#14545) @shrshi
- Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
- Implement more copying APIs in pylibcudf (#14508) @vyasr
- Include writer code and writerVersion in ORC files (#14458) @vuule
- Parquet sub-rowgroup reading. (#14360) @nvdbaranec
- Move chars column to parent data buffer in strings column (#14202) @karthikeyann
- PARQUET-2261 Size Statistics (#14000) @etseidl
- Improve GroupBy JIT error handling (#13854) @brandon-b-miller
- Generate unified Python/C++ docs (#13846) @vyasr
- Expand JIT groupby test suite (#13813) @brandon-b-miller
🛠️ Improvements
- Pin
pytest<8
(#14920) @galipremsagar - Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
- Clean up
TimedeltaIndex.__init__
constructor (#14775) @mroeschke - Clean up
DatetimeIndex.__init__
constructor (#14774) @mroeschke - Some
frame.py
typing, move seldom used methods inframe.py
(#14766) @mroeschke - Remove **kwargs from astype (#14765) @mroeschke
- fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
- Add
pynvjitlink
as a dependency (#14763) @brandon-b-miller - Resolve degenerate performance in
create_structs_data
(#14761) @SurajAralihalli - Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
- Pin pytest-cases<3.8.2 (#14756) @mroeschke
- Use _from_data instead of _from_columns for initialzing Frame (#14755) @mroeschke
- Consolidate cudf object handling in as_column (#14754) @mroeschke
- Reduce execution time of Parquet C++ tests (#14750) @vuule
- Implement to_datetime(..., utc=True) (#14749) @mroeschke
- Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
- Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
- Implement
cudf.MultiIndex.from_arrays
(#14740) @mroeschke - Remove unused/single use methods (#14739) @mroeschke
- refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
- Remove unneeded methods in Column (#14730) @mroeschke
- Clean up base column methods (#14725) @mroeschke
- Ensure column.fillna signatures are consistent (#14724) @mroeschke
- Remove mimesis as a testing dependency (#14723) @mroeschke
- Replace as_numerical with as_numerical_column/codes (#14719) @mroeschke
- Use offsetalator in gather_chars (#14700) @davidwendt
- Use make_strings_children for fill() specialization logic (#14697) @davidwendt
- Change
io::detail::orc
namespace intoio::orc::detail
(#14696) @ttnghia - Fix call to deprecated factory function (#14695) @davidwendt
- Use as_column instead of arange for range like inputs (#14689) @mroeschke
- Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
- Split parquet test into multiple files (#14663) @etseidl
- Custom error messages for IO with nonexistent files (#14662) @vuule
- Explicitly pass .dtype into is_foo_dtype functions (#14657) @mroeschke
- Basic validation in reader benchmarks (#14647) @vuule
- Update dependencies.yaml to support CUDA 12.*....
v23.12.01
🚨 Breaking Changes
- Raise error in
reindex
whenindex
is not unique (#14400) @galipremsagar - Expose stream parameter to get_json_object API (#14297) @davidwendt
- Refactor cudf_kafka to use skbuild (#14292) @jdye64
- Expose stream parameter in public strings convert APIs (#14255) @davidwendt
- Upgrade to nvCOMP 3.0.4 (#13815) @vuule
🐛 Bug Fixes
- Fix synchronization issue when writing string columns with dictionary to ORC (#14595) @vuule
- Update actions/labeler to v4 (#14562) @raydouglass
- Fix data corruption when skipping rows (#14557) @etseidl
- Fix function name typo in
cudf.pandas
profiler (#14514) @galipremsagar - Fix intermediate type checking in expression parsing (#14445) @vyasr
- Forward merge
branch-23.10
intobranch-23.12
(#14435) @raydouglass - Remove needs: wheel-build-cudf. (#14427) @bdice
- Fix dask dependency in custreamz (#14420) @vyasr
- Ensure nvbench initializes nvml context when built statically (#14411) @robertmaynard
- Support java AST String literal with desired encoding (#14402) @winningsix
- Raise error in
reindex
whenindex
is not unique (#14400) @galipremsagar - Always build nvbench statically so we don't need to package it (#14399) @robertmaynard
- Fix token-count logic in nvtext::tokenize_with_vocabulary (#14393) @davidwendt
- Fix as_column(pd.Timestamp/Timedelta, length=) not respecting length (#14390) @mroeschke
- cudf.pandas: cuDF subpath checking in module
__getattr__
(#14388) @shwina - Fix and disable encoding for nanosecond statistics in ORC writer (#14367) @vuule
- Add the new manylinux builds to the build job (#14351) @vyasr
- cudf jit parser now supports .pragma instructions with quotes (#14348) @robertmaynard
- Fix overflow check in
cudf::merge
(#14345) @divyegala - Add cramjam (#14344) @vyasr
- Enable
dask_cudf/io
pytests in CI (#14338) @galipremsagar - Temporarily avoid the current build of pydata-sphinx-theme (#14332) @vyasr
- Fix host buffer access from device function in the Parquet reader (#14328) @vuule
- Run IO tests for Dask-cuDF (#14327) @rjzamora
- Fix logical type issues in the Parquet writer (#14322) @vuule
- Remove aws-sdk-pinning and revert to arrow 12.0.1 (#14319) @vyasr
- test is_valid before reading column data (#14318) @etseidl
- Fix gtest validity setting for TextTokenizeTest.Vocabulary (#14312) @davidwendt
- Fixes stack context for json lines format that recovers from invalid JSON lines (#14309) @elstehle
- Downgrade to Arrow 12.0.0 for aws-sdk-cpp and fix cudf_kafka builds for new CI containers (#14296) @vyasr
- fixing thread index overflow issue (#14290) @hyperbolic2346
- Fix memset error in nvtext::edit_distance_matrix (#14283) @davidwendt
- Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record (#14279) @elstehle
- Handle empty string correctly in Parquet statistics (#14257) @etseidl
- Fixes behaviour for incomplete lines when
recover_with_nulls
is enabled (#14252) @elstehle - cudf::detail::pinned_allocator doesn't throw from
deallocate
(#14251) @robertmaynard - Fix strings replace for adjacent, identical multi-byte UTF-8 character targets (#14235) @davidwendt
- Fix the precision when converting a decimal128 column to an arrow array (#14230) @jihoonson
- Fixing parquet list of struct interpretation (#13715) @hyperbolic2346
📖 Documentation
- Fix io reference in docs. (#14452) @bdice
- Update README (#14374) @shwina
- Example code for blog on new row comparators (#13795) @divyegala
🚀 New Features
- Expose streams in public unary APIs (#14342) @vyasr
- Add python tests for Parquet DELTA_BINARY_PACKED encoder (#14316) @etseidl
- Update rapids-cmake functions to non-deprecated signatures (#14265) @robertmaynard
- Expose streams in public null mask APIs (#14263) @vyasr
- Expose streams in binaryop APIs (#14187) @vyasr
- Add pylibcudf.Scalar that interoperates with Arrow scalars (#14133) @vyasr
- Add decoder for DELTA_BYTE_ARRAY to Parquet reader (#14101) @etseidl
- Add DELTA_BINARY_PACKED encoder for Parquet writer (#14100) @etseidl
- Add BytePairEncoder class to cuDF (#13891) @davidwendt
- Upgrade to nvCOMP 3.0.4 (#13815) @vuule
- Use
pynvjitlink
for CUDA 12+ MVC (#13650) @brandon-b-miller
🛠️ Improvements
- Build concurrency for nightly and merge triggers (#14441) @bdice
- Cleanup remaining usages of dask dependencies (#14407) @galipremsagar
- Update to Arrow 14.0.1. (#14387) @bdice
- Remove Cython libcpp wrappers (#14382) @vyasr
- Forward-merge branch-23.10 to branch-23.12 (#14372) @bdice
- Upgrade to arrow 14 (#14371) @galipremsagar
- Fix a pytest typo in
test_kurt_skew_error
(#14368) @galipremsagar - Use new rapids-dask-dependency metapackage for managing dask versions (#14364) @vyasr
- Change
nullable()
tohas_nulls()
incudf::detail::gather
(#14363) @divyegala - Split up scan_inclusive.cu to improve its compile time (#14358) @davidwendt
- Implement user_datasource_wrapper is_empty() and is_device_read_preferred(). (#14357) @tpn
- Added streams to CSV reader and writer api (#14340) @shrshi
- Upgrade wheels to use arrow 13 (#14339) @vyasr
- Rework nvtext::byte_pair_encoding API (#14337) @davidwendt
- Improve performance of nvtext::tokenize_with_vocabulary for long strings (#14336) @davidwendt
- Upgrade
arrow
to13
(#14330) @galipremsagar - Expose stream parameter in public nvtext replace APIs (#14329) @davidwendt
- Drop
pyorc
dependency and usepandas
/pyarrow
instead (#14323) @galipremsagar - Avoid
pyarrow.fs
import for local storage (#14321) @rjzamora - Unpin
dask
anddistributed
for23.12
development (#14320) @galipremsagar - Expose stream parameter in public nvtext tokenize APIs (#14317) @davidwendt
- Added streams to JSON reader and writer api (#14313) @shrshi
- Minor improvements in
source_info
(#14308) @vuule - Forward-merge branch-23.10 to branch-23.12 (#14307) @bdice
- Add stream parameter to Set Operations (Public List APIs) (#14305) @SurajAralihalli
- Expose stream parameter to get_json_object API (#14297) @davidwendt
- Sort dictionary data alphabetically in the ORC writer (#14295) @vuule
- Expose stream parameter in public strings filter APIs (#14293) @davidwendt
- Refactor cudf_kafka to use skbuild (#14292) @jdye64
- Update
shared-action-workflows
references (#14289) @AyodeAwe - Register
partd
encode dispatch indask_cudf
(#14287) @rjzamora - Update versioning strategy (#14285) @vyasr
- Move and rename byte-pair-encoding source files (#14284) @davidwendt
- Expose stream parameter in public strings combine APIs (#14281) @davidwendt
- Expose stream parameter in public strings contains APIs (#14280) @davidwendt
- Add stream parameter to List Sort and Filter APIs (#14272) @SurajAralihalli
- Use branch-23.12 workflows. (#14271) @bdice
- Refactor LogicalType for Parquet (#14264) @etseidl
- Centralize chunked reading code in the parquet reader to reader_impl_chunking.cu (#14262) @nvdbaranec
- Expose stream parameter in public strings replace APIs (#14261) @davidwendt
- Expose stream parameter in public strings APIs (#14260) @davidwendt
- Cleanup of namespaces in parquet code. (#14259) @nvdbaranec
- Make parquet schema index type consistent (#14256) @hyperbolic2346
- Expose stream parameter in public strings convert APIs (#14255) @davidwendt
- Add in java bindings for DataSource (#14254) @revans2
- Reimplement
cudf::merge
for nested types without using comparators (#14250) @divyegala - Add stream parameter to List Manipulation and Operations APIs (#14248) @SurajAralihalli
- Expose stream parameter in public strings split/partition APIs (#14247) @davidwendt
- Improve
contains_column
by invokingcontains_table
(#14238) @PointKernel - Detect and report errors in Parquet header parsing (#14237) @etseidl
- Normalizing offsets iterator (#14234) @davidwendt
- Forward merge
23.10
into23.12
(#14231) @galipremsagar - Return error if BOOL8 column-type is used with integers-to-hex (#14208) @davidwendt
- Enable indexalator for device code (#14206) @davidwendt
- Marginally reduce memory footprint of joins (#14197) @wence-
- Add nvtx annotations to spilling-based data movement (#14196) @wence-
- Optimize ORC writer for decimal columns (#14190) @vuule
- Remove the use of volatile in ORC (#14175) @vuule
- Add
bytes_per_second
to distinct_count of stream_compaction nvbench. (#14172) @Blonck - Add
bytes_per_second
to transpose benchmark (#14170) @Blonck - cuDF: Build CUDA 12.0 ARM conda packages. (#14112) @bdice
- Add
bytes_per_second
to shift benchmark (#13950) @Blonck - Extract
debug_utilities.hpp/cu
fromcolumn_utilities.hpp/cu
(#13720) @ttnghia
v23.12.00
🚨 Breaking Changes
- Raise error in
reindex
whenindex
is not unique (#14400) @galipremsagar - Expose stream parameter to get_json_object API (#14297) @davidwendt
- Refactor cudf_kafka to use skbuild (#14292) @jdye64
- Expose stream parameter in public strings convert APIs (#14255) @davidwendt
- Upgrade to nvCOMP 3.0.4 (#13815) @vuule
🐛 Bug Fixes
- Update actions/labeler to v4 (#14562) @raydouglass
- Fix data corruption when skipping rows (#14557) @etseidl
- Fix function name typo in
cudf.pandas
profiler (#14514) @galipremsagar - Fix intermediate type checking in expression parsing (#14445) @vyasr
- Forward merge
branch-23.10
intobranch-23.12
(#14435) @raydouglass - Remove needs: wheel-build-cudf. (#14427) @bdice
- Fix dask dependency in custreamz (#14420) @vyasr
- Ensure nvbench initializes nvml context when built statically (#14411) @robertmaynard
- Support java AST String literal with desired encoding (#14402) @winningsix
- Raise error in
reindex
whenindex
is not unique (#14400) @galipremsagar - Always build nvbench statically so we don't need to package it (#14399) @robertmaynard
- Fix token-count logic in nvtext::tokenize_with_vocabulary (#14393) @davidwendt
- Fix as_column(pd.Timestamp/Timedelta, length=) not respecting length (#14390) @mroeschke
- cudf.pandas: cuDF subpath checking in module
__getattr__
(#14388) @shwina - Fix and disable encoding for nanosecond statistics in ORC writer (#14367) @vuule
- Add the new manylinux builds to the build job (#14351) @vyasr
- cudf jit parser now supports .pragma instructions with quotes (#14348) @robertmaynard
- Fix overflow check in
cudf::merge
(#14345) @divyegala - Add cramjam (#14344) @vyasr
- Enable
dask_cudf/io
pytests in CI (#14338) @galipremsagar - Temporarily avoid the current build of pydata-sphinx-theme (#14332) @vyasr
- Fix host buffer access from device function in the Parquet reader (#14328) @vuule
- Run IO tests for Dask-cuDF (#14327) @rjzamora
- Fix logical type issues in the Parquet writer (#14322) @vuule
- Remove aws-sdk-pinning and revert to arrow 12.0.1 (#14319) @vyasr
- test is_valid before reading column data (#14318) @etseidl
- Fix gtest validity setting for TextTokenizeTest.Vocabulary (#14312) @davidwendt
- Fixes stack context for json lines format that recovers from invalid JSON lines (#14309) @elstehle
- Downgrade to Arrow 12.0.0 for aws-sdk-cpp and fix cudf_kafka builds for new CI containers (#14296) @vyasr
- fixing thread index overflow issue (#14290) @hyperbolic2346
- Fix memset error in nvtext::edit_distance_matrix (#14283) @davidwendt
- Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record (#14279) @elstehle
- Handle empty string correctly in Parquet statistics (#14257) @etseidl
- Fixes behaviour for incomplete lines when
recover_with_nulls
is enabled (#14252) @elstehle - cudf::detail::pinned_allocator doesn't throw from
deallocate
(#14251) @robertmaynard - Fix strings replace for adjacent, identical multi-byte UTF-8 character targets (#14235) @davidwendt
- Fix the precision when converting a decimal128 column to an arrow array (#14230) @jihoonson
- Fixing parquet list of struct interpretation (#13715) @hyperbolic2346
📖 Documentation
- Fix io reference in docs. (#14452) @bdice
- Update README (#14374) @shwina
- Example code for blog on new row comparators (#13795) @divyegala
🚀 New Features
- Expose streams in public unary APIs (#14342) @vyasr
- Add python tests for Parquet DELTA_BINARY_PACKED encoder (#14316) @etseidl
- Update rapids-cmake functions to non-deprecated signatures (#14265) @robertmaynard
- Expose streams in public null mask APIs (#14263) @vyasr
- Expose streams in binaryop APIs (#14187) @vyasr
- Add pylibcudf.Scalar that interoperates with Arrow scalars (#14133) @vyasr
- Add decoder for DELTA_BYTE_ARRAY to Parquet reader (#14101) @etseidl
- Add DELTA_BINARY_PACKED encoder for Parquet writer (#14100) @etseidl
- Add BytePairEncoder class to cuDF (#13891) @davidwendt
- Upgrade to nvCOMP 3.0.4 (#13815) @vuule
- Use
pynvjitlink
for CUDA 12+ MVC (#13650) @brandon-b-miller
🛠️ Improvements
- Build concurrency for nightly and merge triggers (#14441) @bdice
- Cleanup remaining usages of dask dependencies (#14407) @galipremsagar
- Update to Arrow 14.0.1. (#14387) @bdice
- Remove Cython libcpp wrappers (#14382) @vyasr
- Forward-merge branch-23.10 to branch-23.12 (#14372) @bdice
- Upgrade to arrow 14 (#14371) @galipremsagar
- Fix a pytest typo in
test_kurt_skew_error
(#14368) @galipremsagar - Use new rapids-dask-dependency metapackage for managing dask versions (#14364) @vyasr
- Change
nullable()
tohas_nulls()
incudf::detail::gather
(#14363) @divyegala - Split up scan_inclusive.cu to improve its compile time (#14358) @davidwendt
- Implement user_datasource_wrapper is_empty() and is_device_read_preferred(). (#14357) @tpn
- Added streams to CSV reader and writer api (#14340) @shrshi
- Upgrade wheels to use arrow 13 (#14339) @vyasr
- Rework nvtext::byte_pair_encoding API (#14337) @davidwendt
- Improve performance of nvtext::tokenize_with_vocabulary for long strings (#14336) @davidwendt
- Upgrade
arrow
to13
(#14330) @galipremsagar - Expose stream parameter in public nvtext replace APIs (#14329) @davidwendt
- Drop
pyorc
dependency and usepandas
/pyarrow
instead (#14323) @galipremsagar - Avoid
pyarrow.fs
import for local storage (#14321) @rjzamora - Unpin
dask
anddistributed
for23.12
development (#14320) @galipremsagar - Expose stream parameter in public nvtext tokenize APIs (#14317) @davidwendt
- Added streams to JSON reader and writer api (#14313) @shrshi
- Minor improvements in
source_info
(#14308) @vuule - Forward-merge branch-23.10 to branch-23.12 (#14307) @bdice
- Add stream parameter to Set Operations (Public List APIs) (#14305) @SurajAralihalli
- Expose stream parameter to get_json_object API (#14297) @davidwendt
- Sort dictionary data alphabetically in the ORC writer (#14295) @vuule
- Expose stream parameter in public strings filter APIs (#14293) @davidwendt
- Refactor cudf_kafka to use skbuild (#14292) @jdye64
- Update
shared-action-workflows
references (#14289) @AyodeAwe - Register
partd
encode dispatch indask_cudf
(#14287) @rjzamora - Update versioning strategy (#14285) @vyasr
- Move and rename byte-pair-encoding source files (#14284) @davidwendt
- Expose stream parameter in public strings combine APIs (#14281) @davidwendt
- Expose stream parameter in public strings contains APIs (#14280) @davidwendt
- Add stream parameter to List Sort and Filter APIs (#14272) @SurajAralihalli
- Use branch-23.12 workflows. (#14271) @bdice
- Refactor LogicalType for Parquet (#14264) @etseidl
- Centralize chunked reading code in the parquet reader to reader_impl_chunking.cu (#14262) @nvdbaranec
- Expose stream parameter in public strings replace APIs (#14261) @davidwendt
- Expose stream parameter in public strings APIs (#14260) @davidwendt
- Cleanup of namespaces in parquet code. (#14259) @nvdbaranec
- Make parquet schema index type consistent (#14256) @hyperbolic2346
- Expose stream parameter in public strings convert APIs (#14255) @davidwendt
- Add in java bindings for DataSource (#14254) @revans2
- Reimplement
cudf::merge
for nested types without using comparators (#14250) @divyegala - Add stream parameter to List Manipulation and Operations APIs (#14248) @SurajAralihalli
- Expose stream parameter in public strings split/partition APIs (#14247) @davidwendt
- Improve
contains_column
by invokingcontains_table
(#14238) @PointKernel - Detect and report errors in Parquet header parsing (#14237) @etseidl
- Normalizing offsets iterator (#14234) @davidwendt
- Forward merge
23.10
into23.12
(#14231) @galipremsagar - Return error if BOOL8 column-type is used with integers-to-hex (#14208) @davidwendt
- Enable indexalator for device code (#14206) @davidwendt
- Marginally reduce memory footprint of joins (#14197) @wence-
- Add nvtx annotations to spilling-based data movement (#14196) @wence-
- Optimize ORC writer for decimal columns (#14190) @vuule
- Remove the use of volatile in ORC (#14175) @vuule
- Add
bytes_per_second
to distinct_count of stream_compaction nvbench. (#14172) @Blonck - Add
bytes_per_second
to transpose benchmark (#14170) @Blonck - cuDF: Build CUDA 12.0 ARM conda packages. (#14112) @bdice
- Add
bytes_per_second
to shift benchmark (#13950) @Blonck - Extract
debug_utilities.hpp/cu
fromcolumn_utilities.hpp/cu
(#13720) @ttnghia
v23.10.02
🚨 Breaking Changes
- Raise error in
reindex
whenindex
is not unique (#14429) @galipremsagar - Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
- Raise
MixedTypeError
when a column of mixed-dtype is being constructed (#14050) @galipremsagar - Raise
NotImplementedError
forMultiIndex.to_series
(#14049) @galipremsagar - Create table_input_metadata from a table_metadata (#13920) @etseidl
- Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
- Change
NA
toNaT
fordatetime
andtimedelta
types (#13868) @galipremsagar - Fix
any
,all
reduction behavior foraxis=None
and warn for other reductions (#13831) @galipremsagar - Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
- Remove the libcudf cudf::offset_type type (#13788) @davidwendt
- Raise error when trying to join
datetime
andtimedelta
types with other types (#13786) @galipremsagar - Update to Cython 3.0.0 (#13777) @vyasr
- Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
- Enforce deprecations in
23.10
(#13732) @galipremsagar - Upgrade to arrow 12 (#13728) @galipremsagar
- Remove Arrow dependency from the
datasource.hpp
public header (#13698) @vuule
🐛 Bug Fixes
- Raise error in
reindex
whenindex
is not unique (#14429) @galipremsagar - Fix inaccurate ceil/floor and inaccurate rescaling casts of fixed-point values. (#14242) @bdice
- Fix inaccuracy in decimal128 rounding. (#14233) @bdice
- Workaround for illegal instruction error in sm90 for warp instrinsics with mask (#14201) @karthikeyann
- Fix pytorch related pytest (#14198) @galipremsagar
- Pin to
aws-sdk-cpp<1.11
(#14173) @pentschev - Fix assert failure for range window functions (#14168) @mythrocks
- Fix Memcheck error found in JSON_TEST JsonReaderTest.ErrorStrings (#14164) @karthikeyann
- Fix calls to copy_bitmask to pass stream parameter (#14158) @davidwendt
- Fix DataFrame from Series with different CategoricalIndexes (#14157) @mroeschke
- Pin to numpy<1.25 and numba<0.58 to avoid errors and deprecation warnings-as-errors. (#14156) @bdice
- Fix kernel launch error for cudf::io::orc::gpu::rowgroup_char_counts_kernel (#14139) @davidwendt
- Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
- Fix DataFrame.values with no columns but index (#14134) @mroeschke
- Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
- Add support for nested dict in
DataFrame
constructor (#14119) @galipremsagar - Restrict iterables of
DataFrame
's as input toDataFrame
constructor (#14118) @galipremsagar - Allow
numeric_only=True
for reduction operations on numeric types (#14111) @galipremsagar - Preserve name of the column while initializing a
DataFrame
(#14110) @galipremsagar - Correct numerous 20054-D: dynamic initialization errors found on arm+12.2 (#14108) @robertmaynard
- Drop
kwargs
fromSeries.count
(#14106) @galipremsagar - Fix naming issues with
Index.to_frame
andMultiIndex.to_frame
APIs (#14105) @galipremsagar - Only use memory resources that haven't been freed (#14103) @robertmaynard
- Add support for
__round__
inSeries
andDataFrame
(#14099) @galipremsagar - Validate ignore_index type in drop_duplicates (#14098) @mroeschke
- Fix renaming
Series
andIndex
(#14080) @galipremsagar - Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
- Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
- Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
- Use
conda mambabuild
rather thanmamba mambabuild
(#14067) @wence- - Raise NotImplementedError in to_datetime with dayfirst without infer_format (#14058) @mroeschke
- Fix various issues in
Index.intersection
(#14054) @galipremsagar - Fix
Index.difference
to match with pandas (#14053) @galipremsagar - Fix empty string column construction (#14052) @galipremsagar
- Fix
IntervalIndex.union
to preserve type-metadata (#14051) @galipremsagar - Raise
MixedTypeError
when a column of mixed-dtype is being constructed (#14050) @galipremsagar - Raise
NotImplementedError
forMultiIndex.to_series
(#14049) @galipremsagar - Ignore compile_commands.json (#14048) @harrism
- Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
- Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
- Implement
sort_remaining
forsort_index
(#14033) @wence- - Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
- Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
- Preserve types of scalar being returned when possible in
quantile
(#14014) @galipremsagar - Fix return type of
MultiIndex.difference
(#14009) @galipremsagar - Raise an error when timezone subtypes are encountered in
pd.IntervalDtype
(#14006) @galipremsagar - Fix map column can not be non-nullable for java (#14003) @res-life
- Fix
name
selection inIndex.difference
andIndex.intersection
(#13986) @galipremsagar - Restore column type metadata with
dropna
to fixfactorize
API (#13980) @galipremsagar - Use thread_index_type to avoid out of bounds accesses in conditional joins (#13971) @vyasr
- Fix
MultiIndex.to_numpy
to return numpy array with tuples (#13966) @galipremsagar - Use cudf::thread_index_type in get_json_object and tdigest kernels (#13962) @nvdbaranec
- Fix an issue with
IntervalIndex.repr
when null values are present (#13958) @galipremsagar - Fix type metadata issue preservation with
Column.unique
(#13957) @galipremsagar - Handle
Interval
scalars when passed in list-like inputs tocudf.Index
(#13956) @galipremsagar - Fix setting of categories order when
dtype
is passed to aCategoricalColumn
(#13955) @galipremsagar - Handle
as_index
inGroupBy.apply
(#13951) @brandon-b-miller - Raise error for string types in
nsmallest
andnlargest
(#13946) @galipremsagar - Fix
index
ofGroupby.apply
results when it is performed on empty objects (#13944) @galipremsagar - Fix integer overflow in shim
device_sum
functions (#13943) @brandon-b-miller - Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
- Fixed processed bytes calculation in APPLY_BOOLEAN_MASK benchmark. (#13937) @Blonck
- Fix construction of
Grouping
objects (#13932) @galipremsagar - Fix an issue with
loc
when column names isMultiIndex
(#13929) @galipremsagar - Fix handling of typecasting in
searchsorted
(#13925) @galipremsagar - Preserve index
name
inreindex
(#13917) @galipremsagar - Use
cudf::thread_index_type
in cuIO to prevent overflow in row indexing (#13910) @vuule - Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
- Use cudf::thread_index_type in concatenate.cu. (#13906) @bdice
- Use cudf::thread_index_type in replace.cu. (#13905) @bdice
- Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
- Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
- Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
- Use
thread_index_type
to avoid index overflow in grid-stride loops (#13895) @PointKernel - Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
- Raise error when trying to construct a
DataFrame
with mixed types (#13889) @galipremsagar - Return
nan
when one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller - Correctly detect the BOM mark in
read_csv
with compressed input (#13881) @vuule - Check for the presence of all values in
MultiIndex.isin
(#13879) @galipremsagar - Fix nvtext::generate_character_ngrams performance regression for longer strings (#13874) @davidwendt
- Fix return type of
MultiIndex.levels
(#13870) @galipremsagar - Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
- Disable construction of Index when
freq
is set in pandas-compatibility mode (#13857) @galipremsagar - Fix an issue with fetching
NA
from aTimedeltaColumn
(#13853) @galipremsagar - Simplify implementation of interval_range() and fix behaviour for floating
freq
(#13844) @shwina - Fix binary operations between
Series
andIndex
(#13842) @galipremsagar - Update make_lists_column_from_scalar to use make_offsets_child_column utility (#13841) @davidwendt
- Fix read out of bounds in string concatenate (#13838) @pentschev
- Raise error for more cases when
timezone-aware
data is passed toas_column
(#13835) @galipremsagar - Fix
any
,all
reduction behavior foraxis=None
and warn for other reductions (#13831) @galipremsagar - Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
- Fix cuFile I/O factories (#13829) @vuule
- DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
- Branch 23.10 merge 23.08 (#13822) @vyasr
- Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
- No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
- Raise error when mixed types are being constructed (#13816) @galipremsagar
- Fix unbounded sequence issue in
DataFrame
constructor (#13811) @galipremsagar - Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
- Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
- Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
- Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
- Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
- Raise error when trying to join
datetime
andtimedelta
types with other types (#13786) @galipremsagar - Fix negative unary operation for boolean type (#13780) @galipremsagar
- Fix contains(
in
) method forSeries
(#13779) @gal...
v23.10.00
🚨 Breaking Changes
- Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
- Raise
MixedTypeError
when a column of mixed-dtype is being constructed (#14050) @galipremsagar - Raise
NotImplementedError
forMultiIndex.to_series
(#14049) @galipremsagar - Create table_input_metadata from a table_metadata (#13920) @etseidl
- Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
- Change
NA
toNaT
fordatetime
andtimedelta
types (#13868) @galipremsagar - Fix
any
,all
reduction behavior foraxis=None
and warn for other reductions (#13831) @galipremsagar - Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
- Remove the libcudf cudf::offset_type type (#13788) @davidwendt
- Raise error when trying to join
datetime
andtimedelta
types with other types (#13786) @galipremsagar - Update to Cython 3.0.0 (#13777) @vyasr
- Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
- Enforce deprecations in
23.10
(#13732) @galipremsagar - Upgrade to arrow 12 (#13728) @galipremsagar
- Remove Arrow dependency from the
datasource.hpp
public header (#13698) @vuule
🐛 Bug Fixes
- Fix inaccurate ceil/floor and inaccurate rescaling casts of fixed-point values. (#14242) @bdice
- Fix inaccuracy in decimal128 rounding. (#14233) @bdice
- Workaround for illegal instruction error in sm90 for warp instrinsics with mask (#14201) @karthikeyann
- Fix pytorch related pytest (#14198) @galipremsagar
- Pin to
aws-sdk-cpp<1.11
(#14173) @pentschev - Fix assert failure for range window functions (#14168) @mythrocks
- Fix Memcheck error found in JSON_TEST JsonReaderTest.ErrorStrings (#14164) @karthikeyann
- Fix calls to copy_bitmask to pass stream parameter (#14158) @davidwendt
- Fix DataFrame from Series with different CategoricalIndexes (#14157) @mroeschke
- Pin to numpy<1.25 and numba<0.58 to avoid errors and deprecation warnings-as-errors. (#14156) @bdice
- Fix kernel launch error for cudf::io::orc::gpu::rowgroup_char_counts_kernel (#14139) @davidwendt
- Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
- Fix DataFrame.values with no columns but index (#14134) @mroeschke
- Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
- Add support for nested dict in
DataFrame
constructor (#14119) @galipremsagar - Restrict iterables of
DataFrame
's as input toDataFrame
constructor (#14118) @galipremsagar - Allow
numeric_only=True
for reduction operations on numeric types (#14111) @galipremsagar - Preserve name of the column while initializing a
DataFrame
(#14110) @galipremsagar - Correct numerous 20054-D: dynamic initialization errors found on arm+12.2 (#14108) @robertmaynard
- Drop
kwargs
fromSeries.count
(#14106) @galipremsagar - Fix naming issues with
Index.to_frame
andMultiIndex.to_frame
APIs (#14105) @galipremsagar - Only use memory resources that haven't been freed (#14103) @robertmaynard
- Add support for
__round__
inSeries
andDataFrame
(#14099) @galipremsagar - Validate ignore_index type in drop_duplicates (#14098) @mroeschke
- Fix renaming
Series
andIndex
(#14080) @galipremsagar - Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
- Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
- Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
- Use
conda mambabuild
rather thanmamba mambabuild
(#14067) @wence- - Raise NotImplementedError in to_datetime with dayfirst without infer_format (#14058) @mroeschke
- Fix various issues in
Index.intersection
(#14054) @galipremsagar - Fix
Index.difference
to match with pandas (#14053) @galipremsagar - Fix empty string column construction (#14052) @galipremsagar
- Fix
IntervalIndex.union
to preserve type-metadata (#14051) @galipremsagar - Raise
MixedTypeError
when a column of mixed-dtype is being constructed (#14050) @galipremsagar - Raise
NotImplementedError
forMultiIndex.to_series
(#14049) @galipremsagar - Ignore compile_commands.json (#14048) @harrism
- Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
- Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
- Implement
sort_remaining
forsort_index
(#14033) @wence- - Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
- Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
- Preserve types of scalar being returned when possible in
quantile
(#14014) @galipremsagar - Fix return type of
MultiIndex.difference
(#14009) @galipremsagar - Raise an error when timezone subtypes are encountered in
pd.IntervalDtype
(#14006) @galipremsagar - Fix map column can not be non-nullable for java (#14003) @res-life
- Fix
name
selection inIndex.difference
andIndex.intersection
(#13986) @galipremsagar - Restore column type metadata with
dropna
to fixfactorize
API (#13980) @galipremsagar - Use thread_index_type to avoid out of bounds accesses in conditional joins (#13971) @vyasr
- Fix
MultiIndex.to_numpy
to return numpy array with tuples (#13966) @galipremsagar - Use cudf::thread_index_type in get_json_object and tdigest kernels (#13962) @nvdbaranec
- Fix an issue with
IntervalIndex.repr
when null values are present (#13958) @galipremsagar - Fix type metadata issue preservation with
Column.unique
(#13957) @galipremsagar - Handle
Interval
scalars when passed in list-like inputs tocudf.Index
(#13956) @galipremsagar - Fix setting of categories order when
dtype
is passed to aCategoricalColumn
(#13955) @galipremsagar - Handle
as_index
inGroupBy.apply
(#13951) @brandon-b-miller - Raise error for string types in
nsmallest
andnlargest
(#13946) @galipremsagar - Fix
index
ofGroupby.apply
results when it is performed on empty objects (#13944) @galipremsagar - Fix integer overflow in shim
device_sum
functions (#13943) @brandon-b-miller - Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
- Fixed processed bytes calculation in APPLY_BOOLEAN_MASK benchmark. (#13937) @Blonck
- Fix construction of
Grouping
objects (#13932) @galipremsagar - Fix an issue with
loc
when column names isMultiIndex
(#13929) @galipremsagar - Fix handling of typecasting in
searchsorted
(#13925) @galipremsagar - Preserve index
name
inreindex
(#13917) @galipremsagar - Use
cudf::thread_index_type
in cuIO to prevent overflow in row indexing (#13910) @vuule - Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
- Use cudf::thread_index_type in concatenate.cu. (#13906) @bdice
- Use cudf::thread_index_type in replace.cu. (#13905) @bdice
- Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
- Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
- Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
- Use
thread_index_type
to avoid index overflow in grid-stride loops (#13895) @PointKernel - Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
- Raise error when trying to construct a
DataFrame
with mixed types (#13889) @galipremsagar - Return
nan
when one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller - Correctly detect the BOM mark in
read_csv
with compressed input (#13881) @vuule - Check for the presence of all values in
MultiIndex.isin
(#13879) @galipremsagar - Fix nvtext::generate_character_ngrams performance regression for longer strings (#13874) @davidwendt
- Fix return type of
MultiIndex.levels
(#13870) @galipremsagar - Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
- Disable construction of Index when
freq
is set in pandas-compatibility mode (#13857) @galipremsagar - Fix an issue with fetching
NA
from aTimedeltaColumn
(#13853) @galipremsagar - Simplify implementation of interval_range() and fix behaviour for floating
freq
(#13844) @shwina - Fix binary operations between
Series
andIndex
(#13842) @galipremsagar - Update make_lists_column_from_scalar to use make_offsets_child_column utility (#13841) @davidwendt
- Fix read out of bounds in string concatenate (#13838) @pentschev
- Raise error for more cases when
timezone-aware
data is passed toas_column
(#13835) @galipremsagar - Fix
any
,all
reduction behavior foraxis=None
and warn for other reductions (#13831) @galipremsagar - Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
- Fix cuFile I/O factories (#13829) @vuule
- DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
- Branch 23.10 merge 23.08 (#13822) @vyasr
- Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
- No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
- Raise error when mixed types are being constructed (#13816) @galipremsagar
- Fix unbounded sequence issue in
DataFrame
constructor (#13811) @galipremsagar - Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
- Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
- Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
- Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
- Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
- Raise error when trying to join
datetime
andtimedelta
types with other types (#13786) @galipremsagar - Fix negative unary operation for boolean type (#13780) @galipremsagar
- Fix contains(
in
) method forSeries
(#13779) @galipremsagar - Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
- Cast only time of day to nanos to avoid an overflow in...
v23.08.00
🚨 Breaking Changes
- Enforce deprecations and add clarifications around existing deprecations (#13710) @galipremsagar
- Separate MurmurHash32 from hash_functions.cuh (#13681) @davidwendt
- Avoid storing metadata in pointers in ORC and Parquet writers (#13648) @vuule
- Expose streams in all public copying APIs (#13629) @vyasr
- Remove deprecated cudf::strings::slice_strings (by delimiter) functions (#13628) @davidwendt
- Remove deprecated cudf.set_allocator. (#13591) @bdice
- Change build.sh to use pip install instead of setup.py (#13507) @vyasr
- Remove unused max_rows_tensor parameter from subword tokenizer (#13463) @davidwendt
- Fix decimal scale reductions in
_get_decimal_type
(#13224) @charlesbluca
🐛 Bug Fixes
- Add CUDA version to cudf_kafka and libcudf-example build strings. (#13769) @bdice
- Fix typo in wheels-test.yaml. (#13763) @bdice
- Don't test strings shorter than the requested ngram size (#13758) @vyasr
- Add CUDA version to custreamz build string. (#13754) @bdice
- Fix writing of ORC files with empty child string columns (#13745) @vuule
- Remove the erroneous "empty level" short-circuit from ORC reader (#13722) @vuule
- Fix character counting when writing sliced tables into ORC (#13721) @vuule
- Parquet uses row group row count if missing from header (#13712) @hyperbolic2346
- Fix reading of RLE encoded boolean data from parquet files with V2 page headers (#13707) @etseidl
- Fix a corner case of list lexicographic comparator (#13701) @ttnghia
- Fix combined filtering and column projection in
dask_cudf.read_parquet
(#13697) @rjzamora - Revert fetch-rapids changes (#13696) @vyasr
- Data generator - include offsets in the size estimate of list elments (#13688) @vuule
- Add
cuda-nvcc-impl
tocudf
fornumba
CUDA 12 (#13673) @jakirkham - Fix combined filtering and column projection in
read_parquet
(#13666) @rjzamora - Use
thrust::identity
as hash functions for byte pair encoding (#13665) @PointKernel - Fix loc-getitem ordering when index contains duplicate labels (#13659) @wence-
- [REVIEW] Introduce parity with pandas for
MultiIndex.loc
ordering & fix a bug inGroupby
withas_index
(#13657) @galipremsagar - Fix memcheck error found in nvtext tokenize functions (#13649) @davidwendt
- Fix
has_nonempty_nulls
ignoring column offset (#13647) @ttnghia - [Java] Avoid double-free corruption in case of an Exception while creating a ColumnView (#13645) @razajafri
- Fix memcheck error in ORC reader call to cudf::io::copy_uncompressed_kernel (#13643) @davidwendt
- Fix CUDA 12 conda environment to remove cubinlinker and ptxcompiler. (#13636) @bdice
- Fix inf/NaN comparisons for FLOAT orderby in window functions (#13635) @mythrocks
- Refactor
Index
search to simplify code and increase correctness (#13625) @wence- - Fix compile warning for unused variable in split_re.cu (#13621) @davidwendt
- Fix tz_localize for dask_cudf Series (#13610) @shwina
- Fix issue with no decompressed data in ORC reader (#13609) @vuule
- Fix floating point window range extents. (#13606) @mythrocks
- Fix
localize(None)
for timezone-naive columns (#13603) @shwina - Fixed a memory leak caused by Exception thrown while constructing a ColumnView (#13597) @razajafri
- Handle nullptr return value from bitmask_or in distinct_count (#13590) @wence-
- Bring parity with pandas in Index.join (#13589) @galipremsagar
- Fix cudf.melt when there are more than 255 columns (#13588) @hcho3
- Fix memory issues in cuIO due to removal of memory padding (#13586) @ttnghia
- Fix Parquet multi-file reading (#13584) @etseidl
- Fix memcheck error found in LISTS_TEST (#13579) @davidwendt
- Fix memcheck error found in STRINGS_TEST (#13578) @davidwendt
- Fix memcheck error found in INTEROP_TEST (#13577) @davidwendt
- Fix memcheck errors found in REDUCTION_TEST (#13574) @davidwendt
- Preemptive fix for hive-partitioning change in dask (#13564) @rjzamora
- Fix an issue with
dask_cudf.read_csv
when lines are needed to be skipped (#13555) @galipremsagar - Fix out-of-bounds memory write in cudf::dictionary::detail::concatenate (#13554) @davidwendt
- Fix the null mask size in json reader (#13537) @karthikeyann
- Fix cudf::strings::strip for all-empty input column (#13533) @davidwendt
- Make sure to build without isolation or installing dependencies (#13524) @vyasr
- Remove preload lib from CMake for now (#13519) @vyasr
- Fix missing separator after null values in JSON writer (#13503) @karthikeyann
- Ensure
single_lane_block_sum_reduce
is safe to call in a loop (#13488) @wence- - Update all versions in pyproject.toml files. (#13486) @bdice
- Remove applying nvbench that doesn't exist in 23.08 (#13484) @robertmaynard
- Fix chunked Parquet reader benchmark (#13482) @vuule
- Update JNI JSON reader column compatability for Spark (#13477) @revans2
- Fix unsanitized output of scan with strings (#13455) @davidwendt
- Reject functions without bytecode from
_can_be_jitted
in GroupBy Apply (#13429) @brandon-b-miller - Fix decimal scale reductions in
_get_decimal_type
(#13224) @charlesbluca
📖 Documentation
- Fix doxygen groups for io data sources and sinks (#13718) @davidwendt
- Add pandas compatibility note to DataFrame.query docstring (#13693) @beckernick
- Add pylibcudf to developer guide (#13639) @vyasr
- Fix repeated words in doxygen text (#13598) @karthikeyann
- Update docs for top-level API. (#13592) @bdice
- Fix the the doxygen text for cudf::concatenate and other places (#13561) @davidwendt
- Document stream validation approach used in testing (#13556) @vyasr
- Cleanup doc repetitions in libcudf (#13470) @karthikeyann
🚀 New Features
- Support
min
andmax
aggregations for list type in groupby and reduction (#13676) @ttnghia - Add nvtext::jaccard_index API for strings columns (#13669) @davidwendt
- Add read_parquet_metadata libcudf API (#13663) @karthikeyann
- Expose streams in all public copying APIs (#13629) @vyasr
- Add XXHash_64 hash function to cudf (#13612) @davidwendt
- Java support: Floating point order-by columns for RANGE window functions (#13595) @mythrocks
- Use
cuco::static_map
to build string dictionaries in ORC writer (#13580) @vuule - Add pylibcudf subpackage with gather implementation (#13562) @vyasr
- Add JNI for
lists::concatenate_list_elements
(#13547) @ttnghia - Enable nested types for
lists::concatenate_list_elements
(#13545) @ttnghia - Add unicode encoding for string columns in JSON writer (#13539) @karthikeyann
- Remove numba kernels from
find_index_of_val
(#13517) @brandon-b-miller - Floating point order-by columns for RANGE window functions (#13512) @mythrocks
- Parse column chunk metadata statistics in parquet reader (#13472) @karthikeyann
- Add
abs
function to apply (#13408) @brandon-b-miller - [FEA] AST filtering in parquet reader (#13348) @karthikeyann
- [FEA] Adds option to recover from invalid JSON lines in JSON tokenizer (#13344) @elstehle
- Ensure cccl packages don't clash with upstream version (#13235) @robertmaynard
- Update
struct_minmax_util
to experimental row comparator (#13069) @divyegala - Add stream parameter to hashing APIs (#12090) @vyasr
🛠️ Improvements
- Pin
dask
anddistributed
for23.08
release (#13802) @galipremsagar - Relax protobuf pinnings. (#13770) @bdice
- Switch fully unbounded window functions to use aggregations (#13727) @mythrocks
- Switch to new wheel building pipeline (#13723) @vyasr
- Revert CUDA 12.0 CI workflows to branch-23.08. (#13719) @bdice
- Adding identify minimum version requirement (#13713) @hyperbolic2346
- Enforce deprecations and add clarifications around existing deprecations (#13710) @galipremsagar
- Optimize ORC reader performance for list data (#13708) @vyasr
- fix limit overflow message in a docstring (#13703) @ahmet-uyar
- Alleviates JSON parser's need for multi-file sources to end with a newline (#13702) @elstehle
- Update cython-lint and replace flake8 with ruff (#13699) @vyasr
- Add
__dask_tokenize__
definitions to cudf classes (#13695) @rjzamora - Convert libcudf hashing benchmarks to nvbench (#13694) @davidwendt
- Separate MurmurHash32 from hash_functions.cuh (#13681) @davidwendt
- Improve performance of cudf::strings::split on whitespace (#13680) @davidwendt
- Allow ORC and Parquet writers to write nullable columns without nulls as non-nullable (#13675) @vuule
- Raise a NotImplementedError in to_datetime when utc is passed (#13670) @shwina
- Add rmm_mode parameter to nvbench base fixture (#13668) @davidwendt
- Fix multiindex loc ordering in pandas-compat mode (#13660) @wence-
- Add nvtext hash_character_ngrams function (#13654) @davidwendt
- Avoid storing metadata in pointers in ORC and Parquet writers (#13648) @vuule
- Acquire spill lock in to/from_arrow (#13646) @shwina
- Expose stable versions of libcudf sort routines (#13634) @wence-
- Separate out hash_test.cpp source for each hash API (#13633) @davidwendt
- Remove deprecated cudf::strings::slice_strings (by delimiter) functions (#13628) @davidwendt
- Create separate libcudf hash APIs for each supported hash function (#13626) @davidwendt
- Add convert_dtypes API (#13623) @shwina
- Clean up cupy in dependencies.yaml. (#13617) @bdice
- Use cuda-version to constrain cudatoolkit. (#13615) @bdice
- Add murmurhash3_x64_128 function to libcudf (#13604) @davidwendt
- Performance improvement for cudf::strings::like (#13594) @davidwendt
- Remove deprecated cudf.set_allocator. (#13591) @bdice
- Clean up cudf device atomic with
cuda::atomic_ref
(#13583) @PointKernel - Add java bindings for distinct count (#13573) @revans2
- Use nvcomp conda package. (#13566) @bdice
- Add exception to string_scalar if input string exceeds size_type (#13560) @davidwendt
- Add dispatch for
cudf.Dataframe
to/frompyarrow.Table
conversion (#13558) @rjzamora - Get rid of
cuco::pair_type
aliases (#13553) @PointKernel - Introduce parity with pandas when
sort=False
inGroupby
(#13551) @galipremsagar - Update CMake in docker to 3.26.4 (#13550) @NvTimLi...