Releases: rapidsai/cudf
Releases · rapidsai/cudf
v23.06.01
🚨 Breaking Changes
- Fix batch processing for parquet writer (#13438) @ttnghia
- Use <NA> instead of null to match pandas. (#13415) @bdice
- Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
- Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
- Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
- Remove null mask and null count from column_view constructors (#13311) @vyasr
- Change default value of the
observed=
argument in groupby toTrue
to reflect the actual behaviour (#13296) @shwina - Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
- Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
- Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
- Update minimum Python version to Python 3.9 (#13196) @shwina
- Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
- Cleanup Parquet chunked writer (#13094) @ttnghia
- Cleanup ORC chunked writer (#13091) @ttnghia
- Raise
NotImplementedError
when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina - Remove deprecated regex functions from libcudf (#13067) @davidwendt
- [REVIEW] Upgrade to
arrow-11
(#12757) @galipremsagar - Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller
🐛 Bug Fixes
- Fix valid count computation in offset_bitmask_binop kernel (#13489) @davidwendt
- Fix writing of ORC files with empty rowgroups (#13466) @vuule
- Fix cudf::repeat logic when count is zero (#13459) @davidwendt
- Fix batch processing for parquet writer (#13438) @ttnghia
- Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
- Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
- Fix cudf::strings::replace_with_backrefs hang on empty match result (#13418) @davidwendt
- Use <NA> instead of null to match pandas. (#13415) @bdice
- Fix tokenize with non-space delimiter (#13403) @shwina
- Fix groupby head/tail for empty dataframe (#13398) @shwina
- Default to closed="right" in
IntervalIndex
constructor (#13394) @shwina - Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
- Fix unused argument errors in nvcc 11.5 (#13387) @abellina
- Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
- Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
- Fix page size estimation in Parquet writer (#13364) @etseidl
- Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
- Support gcc 12 as the C++ compiler (#13316) @robertmaynard
- Correctly set bitmask size in
from_column_view
(#13315) @wence- - Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
- Fix parquet schema interpretation issue (#13277) @hyperbolic2346
- Fix 64bit shift bug in avro reader (#13276) @karthikeyann
- Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
- Clean up buffers in case AssertionError (#13262) @razajafri
- Allow empty input table in ast
compute_column
(#13245) @wence- - Fix structs_column_wrapper constructors to copy input column wrappers (#13243) @davidwendt
- Fix the row index stream order in ORC reader (#13242) @vuule
- Make
is_decompression_disabled
andis_compression_disabled
thread-safe (#13240) @vuule - Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
- Fix race in ORC string dictionary creation (#13214) @revans2
- Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
- Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
- Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
- Fix
hostdevice_vector::subspan
(#13187) @ttnghia - Use custom nvbench entry point to ensure
cudf::nvbench_base_fixture
usage (#13183) @robertmaynard - Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
- Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
- Fix a few clang-format style check errors (#13146) @davidwendt
- [REVIEW] Fix
Series
andDataFrame
constructors to validate index lengths (#13122) @galipremsagar - Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
- Fix GPU_ARCHS setting in Java CMake build and CMAKE_CUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
- Adds checks to make sure json reader won't overflow (#13115) @elstehle
- Fix
null_count
of columns returned bychunked_parquet_reader
(#13111) @vuule - Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
- [REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
- Use make_empty_lists_column instead of make_empty_column(type_id::LIST) (#13099) @davidwendt
- Raise
NotImplementedError
when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina - Fix column selection
read_parquet
benchmarks (#13082) @vuule - Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
- Add algorithm include in data_sink.hpp (#13068) @ahendriksen
- Fix tests/identify_stream_usage.cpp (#13066) @ahendriksen
- Prevent overflow with
skip_rows
in ORC and Parquet readers (#13063) @vuule - Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
- [REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
- Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
- Fix read_avro() skip_rows and num_rows. (#12912) @tpn
- Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
- Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina
🚀 New Features
- Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
- Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
- Use _compile_or_get in JIT groupby apply (#13350) @brandon-b-miller
- cuDF numba cuda 12 updates (#13337) @brandon-b-miller
- Add tz_convert method to convert between timestamps (#13328) @shwina
- Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
- Support the case=False argument to str.contains (#13290) @shwina
- Add an event handler for ColumnVector.close (#13279) @abellina
- JNI api for cudf::chunked_pack (#13278) @abellina
- Implement a chunked_pack API (#13260) @abellina
- Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
- JNI changes for range-extents in window functions. (#13199) @mythrocks
- Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
- Add IS_NULL operator to AST (#13145) @karthikeyann
- STRING order-by column for RANGE window functions (#13143) @mythrocks
- Update
contains_table
to experimental row hasher and equality comparator (#13119) @divyegala - Automatically select
GroupBy.apply
algorithm based on if the UDF is jittable (#13113) @brandon-b-miller - Refactor Parquet chunked writer (#13076) @ttnghia
- Add Python bindings for string literal support in AST (#13073) @karthikeyann
- Add Java bindings for string literal support in AST (#13072) @karthikeyann
- Add string scalar support in AST (#13061) @karthikeyann
- Log cuIO warnings using the libcudf logger (#13043) @vuule
- Update
mixed_join
to use experimental row hasher and comparator (#13028) @divyegala - Support structs of lists in row lexicographic comparator (#13005) @ttnghia
- Adding
hostdevice_span
that is a span createable fromhostdevice_vector
(#12981) @hyperbolic2346 - Add nvtext::minhash function (#12961) @davidwendt
- Support lists of structs in row lexicographic comparator (#12953) @ttnghia
- Update
join
to use experimental row hasher and comparator (#12787) @divyegala - Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller
🛠️ Improvements
- Bump typing_extensions minimum version to 4.0.0 (#13618) @shwina
- Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
- Handle some corner-cases in indexing with boolean masks (#13402) @wence-
- Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
- [JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
- Fix JNI method with mismatched parameter list (#13384) @ttnghia
- Split up experimental_row_operator_tests.cu to improve its compile time (#13382) @davidwendt
- Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
- Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
- Move some nvtext benchmarks to nvbench (#13368) @davidwendt
- run docs nightly too (#13366) @AyodeAwe
- Add warning for default
dtype
parameter inget_dummies
(#13365) @galipremsagar - Add log messages about kvikIO compatibility mode (#13363) @vuule
- Switch back to using primary shared-action-workflows branch (#13362) @vyasr
- Deprecate
StringIndex
and useIndex
instead (#13361) @galipremsagar - Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
- Expunge most uses of
TypeVar(bound="Foo")
(#13346) @wence- - Remove all references to UNKNOWN_NULL_COUNT in Python (#13345) @vyasr
- Improve
distinct_count
withcuco::static_set
(#13343) @PointKernel - Fix
contiguous_split
performance (#13342) @ttnghia - Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
- Update mypy to 1.3 (#13340) @wence-
- [Java] Purge non-empty nulls when setting validity (#13335) @razajafri
- Add row-wise filtering step to
read_parquet
(#13334) @rjzamora - Performance improvement for nvtext::minhash (#13333) @davidwendt
- Fix some libcudf functions to set the null count on returning columns (#13331) @davidwendt
- Change cudf::detail::concatenate_masks to return null-count (#13330) @davidwendt
- Move
meta
calculation in `dask_cu...
v23.06.00
🚨 Breaking Changes
- Fix batch processing for parquet writer (#13438) @ttnghia
- Use <NA> instead of null to match pandas. (#13415) @bdice
- Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
- Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
- Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
- Remove null mask and null count from column_view constructors (#13311) @vyasr
- Change default value of the
observed=
argument in groupby toTrue
to reflect the actual behaviour (#13296) @shwina - Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
- Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
- Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
- Update minimum Python version to Python 3.9 (#13196) @shwina
- Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
- Cleanup Parquet chunked writer (#13094) @ttnghia
- Cleanup ORC chunked writer (#13091) @ttnghia
- Raise
NotImplementedError
when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina - Remove deprecated regex functions from libcudf (#13067) @davidwendt
- [REVIEW] Upgrade to
arrow-11
(#12757) @galipremsagar - Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller
🐛 Bug Fixes
- Fix valid count computation in offset_bitmask_binop kernel (#13489) @davidwendt
- Fix writing of ORC files with empty rowgroups (#13466) @vuule
- Fix cudf::repeat logic when count is zero (#13459) @davidwendt
- Fix batch processing for parquet writer (#13438) @ttnghia
- Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
- Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
- Fix cudf::strings::replace_with_backrefs hang on empty match result (#13418) @davidwendt
- Use <NA> instead of null to match pandas. (#13415) @bdice
- Fix tokenize with non-space delimiter (#13403) @shwina
- Fix groupby head/tail for empty dataframe (#13398) @shwina
- Default to closed="right" in
IntervalIndex
constructor (#13394) @shwina - Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
- Fix unused argument errors in nvcc 11.5 (#13387) @abellina
- Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
- Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
- Fix page size estimation in Parquet writer (#13364) @etseidl
- Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
- Support gcc 12 as the C++ compiler (#13316) @robertmaynard
- Correctly set bitmask size in
from_column_view
(#13315) @wence- - Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
- Fix parquet schema interpretation issue (#13277) @hyperbolic2346
- Fix 64bit shift bug in avro reader (#13276) @karthikeyann
- Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
- Clean up buffers in case AssertionError (#13262) @razajafri
- Allow empty input table in ast
compute_column
(#13245) @wence- - Fix structs_column_wrapper constructors to copy input column wrappers (#13243) @davidwendt
- Fix the row index stream order in ORC reader (#13242) @vuule
- Make
is_decompression_disabled
andis_compression_disabled
thread-safe (#13240) @vuule - Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
- Fix race in ORC string dictionary creation (#13214) @revans2
- Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
- Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
- Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
- Fix
hostdevice_vector::subspan
(#13187) @ttnghia - Use custom nvbench entry point to ensure
cudf::nvbench_base_fixture
usage (#13183) @robertmaynard - Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
- Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
- Fix a few clang-format style check errors (#13146) @davidwendt
- [REVIEW] Fix
Series
andDataFrame
constructors to validate index lengths (#13122) @galipremsagar - Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
- Fix GPU_ARCHS setting in Java CMake build and CMAKE_CUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
- Adds checks to make sure json reader won't overflow (#13115) @elstehle
- Fix
null_count
of columns returned bychunked_parquet_reader
(#13111) @vuule - Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
- [REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
- Use make_empty_lists_column instead of make_empty_column(type_id::LIST) (#13099) @davidwendt
- Raise
NotImplementedError
when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina - Fix column selection
read_parquet
benchmarks (#13082) @vuule - Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
- Add algorithm include in data_sink.hpp (#13068) @ahendriksen
- Fix tests/identify_stream_usage.cpp (#13066) @ahendriksen
- Prevent overflow with
skip_rows
in ORC and Parquet readers (#13063) @vuule - Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
- [REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
- Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
- Fix read_avro() skip_rows and num_rows. (#12912) @tpn
- Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
- Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina
🚀 New Features
- Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
- Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
- Use _compile_or_get in JIT groupby apply (#13350) @brandon-b-miller
- cuDF numba cuda 12 updates (#13337) @brandon-b-miller
- Add tz_convert method to convert between timestamps (#13328) @shwina
- Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
- Support the case=False argument to str.contains (#13290) @shwina
- Add an event handler for ColumnVector.close (#13279) @abellina
- JNI api for cudf::chunked_pack (#13278) @abellina
- Implement a chunked_pack API (#13260) @abellina
- Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
- JNI changes for range-extents in window functions. (#13199) @mythrocks
- Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
- Add IS_NULL operator to AST (#13145) @karthikeyann
- STRING order-by column for RANGE window functions (#13143) @mythrocks
- Update
contains_table
to experimental row hasher and equality comparator (#13119) @divyegala - Automatically select
GroupBy.apply
algorithm based on if the UDF is jittable (#13113) @brandon-b-miller - Refactor Parquet chunked writer (#13076) @ttnghia
- Add Python bindings for string literal support in AST (#13073) @karthikeyann
- Add Java bindings for string literal support in AST (#13072) @karthikeyann
- Add string scalar support in AST (#13061) @karthikeyann
- Log cuIO warnings using the libcudf logger (#13043) @vuule
- Update
mixed_join
to use experimental row hasher and comparator (#13028) @divyegala - Support structs of lists in row lexicographic comparator (#13005) @ttnghia
- Adding
hostdevice_span
that is a span createable fromhostdevice_vector
(#12981) @hyperbolic2346 - Add nvtext::minhash function (#12961) @davidwendt
- Support lists of structs in row lexicographic comparator (#12953) @ttnghia
- Update
join
to use experimental row hasher and comparator (#12787) @divyegala - Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller
🛠️ Improvements
- Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
- Handle some corner-cases in indexing with boolean masks (#13402) @wence-
- Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
- [JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
- Fix JNI method with mismatched parameter list (#13384) @ttnghia
- Split up experimental_row_operator_tests.cu to improve its compile time (#13382) @davidwendt
- Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
- Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
- Move some nvtext benchmarks to nvbench (#13368) @davidwendt
- run docs nightly too (#13366) @AyodeAwe
- Add warning for default
dtype
parameter inget_dummies
(#13365) @galipremsagar - Add log messages about kvikIO compatibility mode (#13363) @vuule
- Switch back to using primary shared-action-workflows branch (#13362) @vyasr
- Deprecate
StringIndex
and useIndex
instead (#13361) @galipremsagar - Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
- Expunge most uses of
TypeVar(bound="Foo")
(#13346) @wence- - Remove all references to UNKNOWN_NULL_COUNT in Python (#13345) @vyasr
- Improve
distinct_count
withcuco::static_set
(#13343) @PointKernel - Fix
contiguous_split
performance (#13342) @ttnghia - Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
- Update mypy to 1.3 (#13340) @wence-
- [Java] Purge non-empty nulls when setting validity (#13335) @razajafri
- Add row-wise filtering step to
read_parquet
(#13334) @rjzamora - Performance improvement for nvtext::minhash (#13333) @davidwendt
- Fix some libcudf functions to set the null count on returning columns (#13331) @davidwendt
- Change cudf::detail::concatenate_masks to return null-count (#13330) @davidwendt
- Move
meta
calculation indask_cudf.read_parquet
(#13327) @rjzamora - Changes to support Numpy >...
v23.04.01
🚨 Breaking Changes
- Pin
dask
anddistributed
for release (#13070) @galipremsagar - Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
- Update minimum
pandas
andnumpy
pinnings (#12887) @galipremsagar - Deprecate
names
&dtype
inIndex.copy
(#12825) @galipremsagar - Deprecate
Index.is_*
methods (#12820) @galipremsagar - Deprecate
datetime_is_numeric
fromdescribe
(#12818) @galipremsagar - Deprecate
na_sentinel
infactorize
(#12817) @galipremsagar - Make string methods return a Series with a useful Index (#12814) @shwina
- Produce useful guidance on overflow error in
to_csv
(#12705) @wence- - Move
strings_udf
code into cuDF (#12669) @brandon-b-miller - Remove cudf::strings::repeat_strings_output_sizes and optional parameter from cudf::strings::repeat_strings (#12609) @davidwendt
- Replace message parsing with throwing more specific exceptions (#12426) @vyasr
🐛 Bug Fixes
- Pin curand version (#13127) @vyasr
- Fix memcheck script to execute only _TEST files found in bin/gtests/libcudf (#13006) @davidwendt
- Fix
DataFrame
constructor to broadcast scalar inputs properly (#12997) @galipremsagar - Drop
force_nullable_schema
from chunked parquet writer (#12996) @galipremsagar - Fix gtest column utility comparator diff reporting (#12995) @davidwendt
- Handle index names while performing
groupby
(#12992) @galipremsagar - Fix
__setitem__
on string columns when the scalar value ends in a null byte (#12991) @wence- - Fix
sort_values
when column is all empty strings (#12988) @eriknw - Remove unused variable and fix memory issue in ORC writer (#12984) @ttnghia
- Pre-emptive fix for upstream
dask.dataframe.read_parquet
changes (#12983) @rjzamora - Remove MANIFEST.in use auto-generated one for sdists and package_data for wheels (#12960) @vyasr
- Update to use rapids-export(COMPONENTS) feature. (#12959) @robertmaynard
- cudftestutil supports static gtest dependencies (#12957) @robertmaynard
- Include gtest in build environment. (#12956) @vyasr
- Correctly handle scalar indices in
Index.__getitem__
(#12955) @wence- - Avoid building cython twice (#12945) @galipremsagar
- Fix set index error for Series rolling window operations (#12942) @galipremsagar
- Fix calculation of null counts for Parquet statistics (#12938) @etseidl
- Preserve integer dtype of hive-partitioned column containing nulls (#12930) @rjzamora
- Use get_current_device_resource for intermediate allocations in COLLECT_LIST window code (#12927) @karthikeyann
- Mark dlpack tensor deleter as noexcept to match PyCapsule_Destructor signature. (#12921) @bdice
- Fix conda recipe post-link.sh typo (#12916) @pentschev
- min_rows and num_rows are swapped in ComputePageSizes declaration in Parquet reader (#12886) @etseidl
- Expect cupy to now support bool arrays for dlpack. (#12883) @vyasr
- Use python -m pytest for nightly wheel tests (#12871) @bdice
- Parquet writer column_size() should return a size_t (#12870) @etseidl
- Fix cudf::hash_partition kernel launch error with decimal128 types (#12863) @davidwendt
- Fix an issue with parquet chunked reader undercounting string lengths. (#12859) @nvdbaranec
- Remove tokenizers pre-install pinning. (#12854) @vyasr
- Fix parquet
RangeIndex
bug (#12838) @rjzamora - Remove KAFKA_HOST_TEST from compute-sanitizer check (#12831) @davidwendt
- Make string methods return a Series with a useful Index (#12814) @shwina
- Tell cudf_kafka to use header-only fmt (#12796) @vyasr
- Add
GroupBy.dtypes
(#12783) @galipremsagar - Fix a leak in a test and clarify some test names (#12781) @revans2
- Fix bug in all-null list due to join_list_elements special handling (#12767) @karthikeyann
- Add try/except for expected null-schema error in read_parquet (#12756) @rjzamora
- Throw an exception if an unsupported page encoding is detected in Parquet reader (#12754) @etseidl
- Fix a bug with
num_keys
in_scatter_by_slice
(#12749) @thomcom - Bump pinned rapids wheel deps to 23.4 (#12735) @sevagh
- Rework logic in cudf::strings::split_record to improve performance (#12729) @davidwendt
- Add
always_nullable
flag to Dremel encoding (#12727) @divyegala - Fix memcheck read error in compound segmented reduce (#12722) @davidwendt
- Fix faulty conditional logic in JIT
GroupBy.apply
(#12706) @brandon-b-miller - Produce useful guidance on overflow error in
to_csv
(#12705) @wence- - Handle parquet list data corner case (#12698) @nvdbaranec
- Fix missing trailing comma in json writer (#12688) @karthikeyann
- Remove child fom newCudaAsyncMemoryResource (#12681) @abellina
- Handle bool types in
round
API (#12670) @galipremsagar - Ensure all of device bitmask is initialized in from_arrow (#12668) @wence-
- Fix
from_arrow
to load a sliced arrow table (#12665) @galipremsagar - Fix dask-cudf read_parquet bug for multi-file aggregation (#12663) @rjzamora
- Fix AllocateLikeTest gtests reading uninitialized null-mask (#12643) @davidwendt
- Fix
find_common_dtype
andvalues
to handle complex dtypes (#12537) @galipremsagar - Fix fetching of MultiIndex values when a label is passed (#12521) @galipremsagar
- Fix
Series
comparison vs scalars (#12519) @brandon-b-miller - Allow casting from
UDFString
back toStringView
to call methods instrings_udf
(#12363) @brandon-b-miller
📖 Documentation
- Fix
GroupBy.apply
doc examples rendering (#12994) @brandon-b-miller - add sphinx building and s3 uploading for dask-cudf docs (#12982) @quasiben
- Add developer documentation forbidding default parameters in detail APIs (#12978) @vyasr
- Add README symlink for dask-cudf. (#12946) @bdice
- Remove return type from @return doxygen tags (#12908) @davidwendt
- Fix docs build to be
pydata-sphinx-theme=0.13.0
compatible (#12874) @galipremsagar - Add skeleton API and prose documentation for dask-cudf (#12725) @wence-
- Enable doctests for GroupBy methods (#12658) @brandon-b-miller
- Add comment about CUB patch for SegmentedSortInt.Bool gtest (#12611) @davidwendt
🚀 New Features
- Add JNI method for strings::replace multi variety (#12979) @NVnavkumar
- Add nunique aggregation support for cudf::segmented_reduce (#12972) @davidwendt
- Refactor orc chunked writer (#12949) @ttnghia
- Make Parquet writer
nullable
option application to single table writes (#12933) @vuule - Refactor
io::orc::ProtobufWriter
(#12877) @ttnghia - Make timezone table independent from ORC (#12805) @vuule
- Cache JIT
GroupBy.apply
functions (#12802) @brandon-b-miller - Implement initial support for avro logical types (#6482) (#12788) @tpn
- Update
tests/column_utilities
to useexperimental::equality
row comparator (#12777) @divyegala - Update
distinct/unique_count
toexperimental::row
hasher/comparator (#12776) @divyegala - Update
hash_partition
to useexperimental::row::row_hasher
(#12761) @divyegala - Update
is_sorted
to useexperimental::row::lexicographic
(#12752) @divyegala - Update default data source in cuio reader benchmarks (#12740) @PointKernel
- Reenable stream identification library in CI (#12714) @vyasr
- Add
regex_program
strings splitting java APIs and tests (#12713) @cindyyuanjiang - Add
regex_program
strings replacing java APIs and tests (#12701) @cindyyuanjiang - Add
regex_program
strings extract java APIs and tests (#12699) @cindyyuanjiang - Variable fragment sizes for Parquet writer (#12685) @etseidl
- Add segmented reduction support for fixed-point types (#12680) @davidwendt
- Move
strings_udf
code into cuDF (#12669) @brandon-b-miller - Add
regex_program
searching APIs and related java classes (#12666) @cindyyuanjiang - Add logging to libcudf (#12637) @vuule
- Add compound aggregations to cudf::segmented_reduce (#12573) @davidwendt
- Convert
rank
to use to experimental row comparators (#12481) @divyegala - Use rapids-cmake parallel testing feature (#12451) @robertmaynard
- Enable detection of undesired stream usage (#12089) @vyasr
🛠️ Improvements
- Pin
dask
anddistributed
for release (#13070) @galipremsagar - Pin cupy in wheel tests to supported versions (#13041) @vyasr
- Pin numba version (#13001) @vyasr
- Rework gtests SequenceTest to remove using namepace cudf (#12985) @davidwendt
- Stop setting package version attribute in wheels (#12977) @vyasr
- Move detail reduction functions to cudf::reduction::detail namespace (#12971) @davidwendt
- Remove default detail mrs: part7 (#12970) @vyasr
- Remove default detail mrs: part6 (#12969) @vyasr
- Remove default detail mrs: part5 (#12968) @vyasr
- Remove default detail mrs: part4 (#12967) @vyasr
- Remove default detail mrs: part3 (#12966) @vyasr
- Remove default detail mrs: part2 (#12965) @vyasr
- Remove default detail mrs: part1 (#12964) @vyasr
- Add
force_nullable_schema
parameter to Parquet writer. (#12952) @galipremsagar - Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
- Remove remaining default stream parameters (#12943) @vyasr
- Fix cudf::segmented_reduce gtest for ANY aggregation (#12940) @davidwendt
- Implement
groupby.head
andgroupby.tail
(#12939) @wence- - Fix libcudf gtests to pass null-count=0 for empty validity masks (#12923) @davidwendt
- Migrate parquet encoding to use experimental row operators (#12918) @PointKernel
- Fix benchmarks coded in namespace cudf and using namespace cudf (#12915) @karthikeyann
- Fix io/text gtests coded in namespace cudf::test (#12914) @karthikeyann
- Pass
SCCACHE_S3_USE_SSL
to conda builds (#12910) @ajschmidt8 - Fix FST, JSON gtests & benchmarks coded in namespace cudf::test (#12907) @karthikeyann
- Generate pyproject dependencies using dfg (#12906) @vyasr
- Update libcudf counting functions to specify cudf::size_type (#12904) @davidwendt
- Fix
moto
env vars & passAWS_SESSION_TOKEN
to conda builds (#12902) @ajschmidt8 - Rewrite CSV wri...
v23.04.00
🚨 Breaking Changes
- Pin
dask
anddistributed
for release (#13070) @galipremsagar - Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
- Update minimum
pandas
andnumpy
pinnings (#12887) @galipremsagar - Deprecate
names
&dtype
inIndex.copy
(#12825) @galipremsagar - Deprecate
Index.is_*
methods (#12820) @galipremsagar - Deprecate
datetime_is_numeric
fromdescribe
(#12818) @galipremsagar - Deprecate
na_sentinel
infactorize
(#12817) @galipremsagar - Make string methods return a Series with a useful Index (#12814) @shwina
- Produce useful guidance on overflow error in
to_csv
(#12705) @wence- - Move
strings_udf
code into cuDF (#12669) @brandon-b-miller - Remove cudf::strings::repeat_strings_output_sizes and optional parameter from cudf::strings::repeat_strings (#12609) @davidwendt
- Replace message parsing with throwing more specific exceptions (#12426) @vyasr
🐛 Bug Fixes
- Fix memcheck script to execute only _TEST files found in bin/gtests/libcudf (#13006) @davidwendt
- Fix
DataFrame
constructor to broadcast scalar inputs properly (#12997) @galipremsagar - Drop
force_nullable_schema
from chunked parquet writer (#12996) @galipremsagar - Fix gtest column utility comparator diff reporting (#12995) @davidwendt
- Handle index names while performing
groupby
(#12992) @galipremsagar - Fix
__setitem__
on string columns when the scalar value ends in a null byte (#12991) @wence- - Fix
sort_values
when column is all empty strings (#12988) @eriknw - Remove unused variable and fix memory issue in ORC writer (#12984) @ttnghia
- Pre-emptive fix for upstream
dask.dataframe.read_parquet
changes (#12983) @rjzamora - Remove MANIFEST.in use auto-generated one for sdists and package_data for wheels (#12960) @vyasr
- Update to use rapids-export(COMPONENTS) feature. (#12959) @robertmaynard
- cudftestutil supports static gtest dependencies (#12957) @robertmaynard
- Include gtest in build environment. (#12956) @vyasr
- Correctly handle scalar indices in
Index.__getitem__
(#12955) @wence- - Avoid building cython twice (#12945) @galipremsagar
- Fix set index error for Series rolling window operations (#12942) @galipremsagar
- Fix calculation of null counts for Parquet statistics (#12938) @etseidl
- Preserve integer dtype of hive-partitioned column containing nulls (#12930) @rjzamora
- Use get_current_device_resource for intermediate allocations in COLLECT_LIST window code (#12927) @karthikeyann
- Mark dlpack tensor deleter as noexcept to match PyCapsule_Destructor signature. (#12921) @bdice
- Fix conda recipe post-link.sh typo (#12916) @pentschev
- min_rows and num_rows are swapped in ComputePageSizes declaration in Parquet reader (#12886) @etseidl
- Expect cupy to now support bool arrays for dlpack. (#12883) @vyasr
- Use python -m pytest for nightly wheel tests (#12871) @bdice
- Parquet writer column_size() should return a size_t (#12870) @etseidl
- Fix cudf::hash_partition kernel launch error with decimal128 types (#12863) @davidwendt
- Fix an issue with parquet chunked reader undercounting string lengths. (#12859) @nvdbaranec
- Remove tokenizers pre-install pinning. (#12854) @vyasr
- Fix parquet
RangeIndex
bug (#12838) @rjzamora - Remove KAFKA_HOST_TEST from compute-sanitizer check (#12831) @davidwendt
- Make string methods return a Series with a useful Index (#12814) @shwina
- Tell cudf_kafka to use header-only fmt (#12796) @vyasr
- Add
GroupBy.dtypes
(#12783) @galipremsagar - Fix a leak in a test and clarify some test names (#12781) @revans2
- Fix bug in all-null list due to join_list_elements special handling (#12767) @karthikeyann
- Add try/except for expected null-schema error in read_parquet (#12756) @rjzamora
- Throw an exception if an unsupported page encoding is detected in Parquet reader (#12754) @etseidl
- Fix a bug with
num_keys
in_scatter_by_slice
(#12749) @thomcom - Bump pinned rapids wheel deps to 23.4 (#12735) @sevagh
- Rework logic in cudf::strings::split_record to improve performance (#12729) @davidwendt
- Add
always_nullable
flag to Dremel encoding (#12727) @divyegala - Fix memcheck read error in compound segmented reduce (#12722) @davidwendt
- Fix faulty conditional logic in JIT
GroupBy.apply
(#12706) @brandon-b-miller - Produce useful guidance on overflow error in
to_csv
(#12705) @wence- - Handle parquet list data corner case (#12698) @nvdbaranec
- Fix missing trailing comma in json writer (#12688) @karthikeyann
- Remove child fom newCudaAsyncMemoryResource (#12681) @abellina
- Handle bool types in
round
API (#12670) @galipremsagar - Ensure all of device bitmask is initialized in from_arrow (#12668) @wence-
- Fix
from_arrow
to load a sliced arrow table (#12665) @galipremsagar - Fix dask-cudf read_parquet bug for multi-file aggregation (#12663) @rjzamora
- Fix AllocateLikeTest gtests reading uninitialized null-mask (#12643) @davidwendt
- Fix
find_common_dtype
andvalues
to handle complex dtypes (#12537) @galipremsagar - Fix fetching of MultiIndex values when a label is passed (#12521) @galipremsagar
- Fix
Series
comparison vs scalars (#12519) @brandon-b-miller - Allow casting from
UDFString
back toStringView
to call methods instrings_udf
(#12363) @brandon-b-miller
📖 Documentation
- Fix
GroupBy.apply
doc examples rendering (#12994) @brandon-b-miller - add sphinx building and s3 uploading for dask-cudf docs (#12982) @quasiben
- Add developer documentation forbidding default parameters in detail APIs (#12978) @vyasr
- Add README symlink for dask-cudf. (#12946) @bdice
- Remove return type from @return doxygen tags (#12908) @davidwendt
- Fix docs build to be
pydata-sphinx-theme=0.13.0
compatible (#12874) @galipremsagar - Add skeleton API and prose documentation for dask-cudf (#12725) @wence-
- Enable doctests for GroupBy methods (#12658) @brandon-b-miller
- Add comment about CUB patch for SegmentedSortInt.Bool gtest (#12611) @davidwendt
🚀 New Features
- Add JNI method for strings::replace multi variety (#12979) @NVnavkumar
- Add nunique aggregation support for cudf::segmented_reduce (#12972) @davidwendt
- Refactor orc chunked writer (#12949) @ttnghia
- Make Parquet writer
nullable
option application to single table writes (#12933) @vuule - Refactor
io::orc::ProtobufWriter
(#12877) @ttnghia - Make timezone table independent from ORC (#12805) @vuule
- Cache JIT
GroupBy.apply
functions (#12802) @brandon-b-miller - Implement initial support for avro logical types (#6482) (#12788) @tpn
- Update
tests/column_utilities
to useexperimental::equality
row comparator (#12777) @divyegala - Update
distinct/unique_count
toexperimental::row
hasher/comparator (#12776) @divyegala - Update
hash_partition
to useexperimental::row::row_hasher
(#12761) @divyegala - Update
is_sorted
to useexperimental::row::lexicographic
(#12752) @divyegala - Update default data source in cuio reader benchmarks (#12740) @PointKernel
- Reenable stream identification library in CI (#12714) @vyasr
- Add
regex_program
strings splitting java APIs and tests (#12713) @cindyyuanjiang - Add
regex_program
strings replacing java APIs and tests (#12701) @cindyyuanjiang - Add
regex_program
strings extract java APIs and tests (#12699) @cindyyuanjiang - Variable fragment sizes for Parquet writer (#12685) @etseidl
- Add segmented reduction support for fixed-point types (#12680) @davidwendt
- Move
strings_udf
code into cuDF (#12669) @brandon-b-miller - Add
regex_program
searching APIs and related java classes (#12666) @cindyyuanjiang - Add logging to libcudf (#12637) @vuule
- Add compound aggregations to cudf::segmented_reduce (#12573) @davidwendt
- Convert
rank
to use to experimental row comparators (#12481) @divyegala - Use rapids-cmake parallel testing feature (#12451) @robertmaynard
- Enable detection of undesired stream usage (#12089) @vyasr
🛠️ Improvements
- Pin
dask
anddistributed
for release (#13070) @galipremsagar - Pin cupy in wheel tests to supported versions (#13041) @vyasr
- Pin numba version (#13001) @vyasr
- Rework gtests SequenceTest to remove using namepace cudf (#12985) @davidwendt
- Stop setting package version attribute in wheels (#12977) @vyasr
- Move detail reduction functions to cudf::reduction::detail namespace (#12971) @davidwendt
- Remove default detail mrs: part7 (#12970) @vyasr
- Remove default detail mrs: part6 (#12969) @vyasr
- Remove default detail mrs: part5 (#12968) @vyasr
- Remove default detail mrs: part4 (#12967) @vyasr
- Remove default detail mrs: part3 (#12966) @vyasr
- Remove default detail mrs: part2 (#12965) @vyasr
- Remove default detail mrs: part1 (#12964) @vyasr
- Add
force_nullable_schema
parameter to Parquet writer. (#12952) @galipremsagar - Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
- Remove remaining default stream parameters (#12943) @vyasr
- Fix cudf::segmented_reduce gtest for ANY aggregation (#12940) @davidwendt
- Implement
groupby.head
andgroupby.tail
(#12939) @wence- - Fix libcudf gtests to pass null-count=0 for empty validity masks (#12923) @davidwendt
- Migrate parquet encoding to use experimental row operators (#12918) @PointKernel
- Fix benchmarks coded in namespace cudf and using namespace cudf (#12915) @karthikeyann
- Fix io/text gtests coded in namespace cudf::test (#12914) @karthikeyann
- Pass
SCCACHE_S3_USE_SSL
to conda builds (#12910) @ajschmidt8 - Fix FST, JSON gtests & benchmarks coded in namespace cudf::test (#12907) @karthikeyann
- Generate pyproject dependencies using dfg (#12906) @vyasr
- Update libcudf counting functions to specify cudf::size_type (#12904) @davidwendt
- Fix
moto
env vars & passAWS_SESSION_TOKEN
to conda builds (#12902) @ajschmidt8 - Rewrite CSV writer benchmark with nvbench (#12901) @PointKernel
- Rework some code logic to reduce iterator and comparator inlining to improve compile time (#12900) @davidwendt
- Deprecate `line_te...
[NIGHTLY] v23.06.00
🔗 Links
🚨 Breaking Changes
- Fix batch processing for parquet writer (#13438) @ttnghia
- Use <NA> instead of null to match pandas. (#13415) @bdice
- Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
- Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
- Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
- Remove null mask and null count from column_view constructors (#13311) @vyasr
- Change default value of the
observed=
argument in groupby toTrue
to reflect the actual behaviour (#13296) @shwina - Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
- Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
- Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
- Update minimum Python version to Python 3.9 (#13196) @shwina
- Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
- Cleanup Parquet chunked writer (#13094) @ttnghia
- Cleanup ORC chunked writer (#13091) @ttnghia
- Raise
NotImplementedError
when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina - Remove deprecated regex functions from libcudf (#13067) @davidwendt
- [REVIEW] Upgrade to
arrow-11
(#12757) @galipremsagar - Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller
🐛 Bug Fixes
- Fix valid count computation in offset_bitmask_binop kernel (#13489) @davidwendt
- Fix writing of ORC files with empty rowgroups (#13466) @vuule
- Fix cudf::repeat logic when count is zero (#13459) @davidwendt
- Fix batch processing for parquet writer (#13438) @ttnghia
- Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
- Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
- Fix cudf::strings::replace_with_backrefs hang on empty match result (#13418) @davidwendt
- Use <NA> instead of null to match pandas. (#13415) @bdice
- Fix tokenize with non-space delimiter (#13403) @shwina
- Fix groupby head/tail for empty dataframe (#13398) @shwina
- Default to closed="right" in
IntervalIndex
constructor (#13394) @shwina - Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
- Fix unused argument errors in nvcc 11.5 (#13387) @abellina
- Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
- Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
- Fix page size estimation in Parquet writer (#13364) @etseidl
- Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
- Support gcc 12 as the C++ compiler (#13316) @robertmaynard
- Correctly set bitmask size in
from_column_view
(#13315) @wence- - Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
- Fix parquet schema interpretation issue (#13277) @hyperbolic2346
- Fix 64bit shift bug in avro reader (#13276) @karthikeyann
- Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
- Clean up buffers in case AssertionError (#13262) @razajafri
- Allow empty input table in ast
compute_column
(#13245) @wence- - Fix structs_column_wrapper constructors to copy input column wrappers (#13243) @davidwendt
- Fix the row index stream order in ORC reader (#13242) @vuule
- Make
is_decompression_disabled
andis_compression_disabled
thread-safe (#13240) @vuule - Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
- Fix race in ORC string dictionary creation (#13214) @revans2
- Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
- Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
- Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
- Fix
hostdevice_vector::subspan
(#13187) @ttnghia - Use custom nvbench entry point to ensure
cudf::nvbench_base_fixture
usage (#13183) @robertmaynard - Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
- Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
- Fix a few clang-format style check errors (#13146) @davidwendt
- [REVIEW] Fix
Series
andDataFrame
constructors to validate index lengths (#13122) @galipremsagar - Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
- Fix GPU_ARCHS setting in Java CMake build and CMAKE_CUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
- Adds checks to make sure json reader won't overflow (#13115) @elstehle
- Fix
null_count
of columns returned bychunked_parquet_reader
(#13111) @vuule - Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
- [REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
- Use make_empty_lists_column instead of make_empty_column(type_id::LIST) (#13099) @davidwendt
- Raise
NotImplementedError
when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina - Fix column selection
read_parquet
benchmarks (#13082) @vuule - Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
- Add algorithm include in data_sink.hpp (#13068) @ahendriksen
- Fix tests/identify_stream_usage.cpp (#13066) @ahendriksen
- Prevent overflow with
skip_rows
in ORC and Parquet readers (#13063) @vuule - Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
- [REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
- Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
- Fix read_avro() skip_rows and num_rows. (#12912) @tpn
- Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
- Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina
🚀 New Features
- Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
- Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
- Use _compile_or_get in JIT groupby apply (#13350) @brandon-b-miller
- cuDF numba cuda 12 updates (#13337) @brandon-b-miller
- Add tz_convert method to convert between timestamps (#13328) @shwina
- Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
- Support the case=False argument to str.contains (#13290) @shwina
- Add an event handler for ColumnVector.close (#13279) @abellina
- JNI api for cudf::chunked_pack (#13278) @abellina
- Implement a chunked_pack API (#13260) @abellina
- Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
- JNI changes for range-extents in window functions. (#13199) @mythrocks
- Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
- Add IS_NULL operator to AST (#13145) @karthikeyann
- STRING order-by column for RANGE window functions (#13143) @mythrocks
- Update
contains_table
to experimental row hasher and equality comparator (#13119) @divyegala - Automatically select
GroupBy.apply
algorithm based on if the UDF is jittable (#13113) @brandon-b-miller - Refactor Parquet chunked writer (#13076) @ttnghia
- Add Python bindings for string literal support in AST (#13073) @karthikeyann
- Add Java bindings for string literal support in AST (#13072) @karthikeyann
- Add string scalar support in AST (#13061) @karthikeyann
- Log cuIO warnings using the libcudf logger (#13043) @vuule
- Update
mixed_join
to use experimental row hasher and comparator (#13028) @divyegala - Support structs of lists in row lexicographic comparator (#13005) @ttnghia
- Adding
hostdevice_span
that is a span createable fromhostdevice_vector
(#12981) @hyperbolic2346 - Add nvtext::minhash function (#12961) @davidwendt
- Support lists of structs in row lexicographic comparator (#12953) @ttnghia
- Update
join
to use experimental row hasher and comparator (#12787) @divyegala - Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller
🛠️ Improvements
- Bump typing_extensions minimum version to 4.0.0 (#13618) @shwina
- Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
- Handle some corner-cases in indexing with boolean masks (#13402) @wence-
- Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
- [JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
- Fix JNI method with mismatched parameter list (#13384) @ttnghia
- Split up experimental_row_operator_tests.cu to improve its compile time (#13382) @davidwendt
- Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
- Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
- Move some nvtext benchmarks to nvbench (#13368) @davidwendt
- run docs nightly too (#13366) @AyodeAwe
- Add warning for default
dtype
parameter inget_dummies
(#13365) @galipremsagar - Add log messages about kvikIO compatibility mode (#13363) @vuule
- Switch back to using primary shared-action-workflows branch (#13362) @vyasr
- Deprecate
StringIndex
and useIndex
instead (#13361) @galipremsagar - Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
- Expunge most uses of
TypeVar(bound="Foo")
(#13346) @wence- - Remove all references to UNKNOWN_NULL_COUNT in Python (#13345) @vyasr
- Improve
distinct_count
withcuco::static_set
(#13343) @PointKernel - Fix
contiguous_split
performance (#13342) @ttnghia - Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
- Update mypy to 1.3 (#13340) @wence-
- [Java] Purge non-empty nulls when setting validity (#13335) @razajafri
- Add row-wise filtering step to
read_parquet
(#13334) @rjzamora - Performance improvement for nvtext::minhash (#13333) @davidwendt
- Fix some libcudf functions to ...
v23.02.00
🚨 Breaking Changes
- Pin
dask
anddistributed
for release (#12695) @galipremsagar - Change ways to access
ptr
inBuffer
(#12587) @galipremsagar - Remove column names (#12578) @vuule
- Default
cudf::io::read_json
to nested JSON parser (#12544) @vuule - Switch
engine=cudf
to the newJSON
reader (#12509) @galipremsagar - Add trailing comma support for nested JSON reader (#12448) @karthikeyann
- Upgrade to
arrow-10.0.1
(#12327) @galipremsagar - Fail loudly to avoid data corruption with unsupported input in
read_orc
(#12325) @vuule - CSV, JSON reader to infer integer column with nulls as int64 instead of float64 (#12309) @karthikeyann
- Remove deprecated code for 23.02 (#12281) @vyasr
- Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
- Purge non-empty nulls for
superimpose_nulls
andpush_down_nulls
(#12239) @ttnghia - Rename
cudf::structs::detail::superimpose_parent_nulls
APIs (#12230) @ttnghia - Remove JIT type names, refactor id_to_type. (#12158) @bdice
- Floor division uses integer division for integral arguments (#12131) @wence-
🐛 Bug Fixes
- Fix a mask data corruption in UDF (#12647) @galipremsagar
- pre-commit: Update isort version to 5.12.0 (#12645) @wence-
- tests: Skip cuInit tests if cuda-gdb is not found or not working (#12644) @wence-
- Revert regex program java APIs and tests (#12639) @cindyyuanjiang
- Fix leaks in ColumnVectorTest (#12625) @jlowe
- Handle when spillable buffers own each other (#12607) @madsbk
- Fix incorrect null counts for sliced columns in JCudfSerialization (#12589) @jlowe
- lists: Transfer dtypes correctly through list.get (#12586) @wence-
- timedelta: Don't go via float intermediates for floordiv (#12585) @wence-
- Fixing BUG,
get_next_chunk()
should use the blocking functiondevice_read()
(#12584) @madsbk - Make JNI QuoteStyle accessible outside ai.rapids.cudf (#12572) @mythrocks
partition_by_hash()
: support index (#12554) @madsbk- Mixed Join benchmark bug due to wrong conditional column (#12553) @divyegala
- Update List Lexicographical Comparator (#12538) @divyegala
- Dynamically read PTX version (#12534) @brandon-b-miller
- build.sh switch to use
RAPIDS
magic value (#12525) @robertmaynard - Loosen runtime arrow pinning (#12522) @vyasr
- Enable metadata transfer for complex types in transpose (#12491) @galipremsagar
- Fix issues with parquet chunked reader (#12488) @nvdbaranec
- Fix missing metadata transfer in concat for
ListColumn
(#12487) @galipremsagar - Rename libcudf substring source files to slice (#12484) @davidwendt
- Fix compile issue with arrow 10 (#12465) @ttnghia
- Fix List offsets bug in mixed type list column in nested JSON reader (#12447) @karthikeyann
- Fix xfail incompatibilities (#12423) @vyasr
- Fix bug in Parquet column index encoding (#12404) @etseidl
- When building Arrow shared look for a shared OpenSSL (#12396) @robertmaynard
- Fix get_json_object to return empty column on empty input (#12384) @davidwendt
- Pin arrow 9 in testing dependencies to prevent conda solve issues (#12377) @vyasr
- Fix reductions any/all return value for empty input (#12374) @davidwendt
- Fix debug compile errors in parquet.hpp (#12372) @davidwendt
- Purge non-empty nulls in
cudf::make_lists_column
(#12370) @ttnghia - Use correct memory resource in io::make_column (#12364) @vyasr
- Add code to detect possible malformed page data in parquet files. (#12360) @nvdbaranec
- Fail loudly to avoid data corruption with unsupported input in
read_orc
(#12325) @vuule - Fix NumericPairIteratorTest for float values (#12306) @davidwendt
- Fixes memory allocation in nested JSON tokenizer (#12300) @elstehle
- Reconstruct dtypes correctly for list aggs of struct columns (#12290) @wence-
- Fix regex \A and \Z to strictly match string begin/end (#12282) @davidwendt
- Fix compile issue in
json_chunked_reader.cpp
(#12280) @ttnghia - Change reductions any/all to return valid values for empty input (#12279) @davidwendt
- Only exclude join keys that are indices from key columns (#12271) @wence-
- Fix spill to device limit (#12252) @madsbk
- Correct behaviour of sort in
concat
for singleton concatenations (#12247) @wence- - Purge non-empty nulls for
superimpose_nulls
andpush_down_nulls
(#12239) @ttnghia - Patch CUB DeviceSegmentedSort and remove workaround (#12234) @davidwendt
- Fix memory leak in udf_string::assign(&&) function (#12206) @davidwendt
- Workaround thrust-copy-if limit in json get_tree_representation (#12190) @davidwendt
- Fix page size calculation in Parquet writer (#12182) @etseidl
- Add cudf::detail::sizes_to_offsets_iterator to allow checking overflow in offsets (#12180) @davidwendt
- Workaround thrust-copy-if limit in wordpiece-tokenizer (#12168) @davidwendt
- Floor division uses integer division for integral arguments (#12131) @wence-
📖 Documentation
- Fix link to NVTX (#12598) @sameerz
- Include missing groupby functions in documentation (#12580) @quasiben
- Fix documentation author (#12527) @bdice
- Update libcudf reduction docs for casting output types (#12526) @davidwendt
- Add JSON reader page in user guide (#12499) @GregoryKimball
- Link unsupported iteration API docstrings (#12482) @galipremsagar
strings_udf
doc update (#12469) @brandon-b-miller- Update cudf_assert docs with correct NDEBUG behavior (#12464) @robertmaynard
- Update pre-commit hooks guide (#12395) @bdice
- Update test docs to not use detail comparison utilities (#12332) @PointKernel
- Fix doxygen description for regex_program::compute_working_memory_size (#12329) @davidwendt
- Add eval to docs. (#12322) @vyasr
- Turn on xfail_strict=true (#12244) @wence-
- Update 10 minutes to cuDF (#12114) @wence-
🚀 New Features
- Use kvikIO as the default IO backend (#12574) @vuule
- Use
has_nonempty_nulls
instead ofmay_contain_non_empty_nulls
insuperimpose_nulls
andpush_down_nulls
(#12560) @ttnghia - Add strings methods removeprefix and removesuffix (#12557) @davidwendt
- Add
regex_program
java APIs and unit tests (#12548) @cindyyuanjiang - Default
cudf::io::read_json
to nested JSON parser (#12544) @vuule - Make string quoting optional on CSV write (#12539) @mythrocks
- Use new nvCOMP API to optimize the compression temp memory size (#12533) @vuule
- Support "values" orient (array of arrays) in Nested JSON reader (#12498) @karthikeyann
one_hot_encode
to use experimental row comparators (#12478) @divyegala- Support %W and %w format specifiers in cudf::strings::to_timestamps (#12475) @davidwendt
- Add JSON Writer (#12474) @karthikeyann
- Refactor
thrust_copy_if
intocudf::detail::copy_if_safe
(#12455) @ttnghia - Add trailing comma support for nested JSON reader (#12448) @karthikeyann
- Extract
tokenize_json.hpp
detail header fromsrc/io/json/nested_json.hpp
(#12432) @ttnghia - JNI bindings to write CSV (#12425) @mythrocks
- Nested JSON depth benchmark (#12371) @karthikeyann
- Implement
lists::reverse
(#12336) @ttnghia - Use
device_read
in experimentalread_json
(#12314) @vuule - Implement JNI for
strings::reverse
(#12283) @ttnghia - Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
- Add cudf::strings:like function with multiple patterns (#12269) @davidwendt
- Add environment variable to control host memory allocation in
hostdevice_vector
(#12251) @vuule - Add cudf::strings::reverse function (#12227) @davidwendt
- Selectively use dictionary encoding in Parquet writer (#12211) @etseidl
- Support
replace
instrings_udf
(#12207) @brandon-b-miller - Add support to read binary encoded decimals in parquet (#12205) @PointKernel
- Support regex EOL where the string ends with a new-line character (#12181) @davidwendt
- Updating
stream_compaction/unique
to use new row comparators (#12159) @divyegala - Add device buffer datasource (#12024) @PointKernel
- Implement groupby apply with JIT (#11452) @bwyogatama
🛠️ Improvements
- Update shared workflow branches (#12696) @ajschmidt8
- Pin
dask
anddistributed
for release (#12695) @galipremsagar - Don't upload
libcudf-example
to Anaconda.org (#12671) @ajschmidt8 - Pin wheel dependencies to same RAPIDS release (#12659) @sevagh
- Use CTK 118/cp310 branch of wheel workflows (#12602) @sevagh
- Change ways to access
ptr
inBuffer
(#12587) @galipremsagar - Version a parquet writer xfail (#12579) @galipremsagar
- Remove column names (#12578) @vuule
- Parquet reader optimization to address V100 regression. (#12577) @nvdbaranec
- Add support for
category
dtypes in CSV reader (#12571) @galipremsagar - Remove
spill_lock
parameter fromSpillableBuffer.get_ptr()
(#12564) @madsbk - Optimize
cudf::make_lists_column
(#12547) @ttnghia - Remove
cudf::strings::repeat_strings_output_sizes
from Java and JNI (#12546) @ttnghia - Test that cuInit is not called when RAPIDS_NO_INITIALIZE is set (#12545) @wence-
- Rework repeat_strings to use sizes-to-offsets utility (#12543) @davidwendt
- Replace exclusive_scan with sizes_to_offsets in cudf::lists::sequences (#12541) @davidwendt
- Rework nvtext::ngrams_tokenize to use sizes-to-offsets utility (#12540) @davidwendt
- Fix binary-ops gtests coded in namespace cudf::test (#12536) @davidwendt
- More
@acquire_spill_lock()
andas_buffer(..., exposed=False)
(#12535) @madsbk - Guard CUDA runtime APIs with error checking (#12531) @PointKernel
- Update TODOs from issue 10432. (#12528) @bdice
- Update rapids-cmake definitions version in GitHub Actions style checks. (#12511) @bdice
- Switch
engine=cudf
to the newJSON
reader (#12509) @galipremsagar - Fix SUM/MEAN aggregation type support. (#12503) @bdice
- Stop using pandas._testing (#12492) @vyasr
- Fix ROLLING_TEST gtests coded in namespace cudf::test (#12490) @davidwendt
- Fix erroneously skipped ORC ZSTD test (#12486) @vuule
- Rework nvtext::generate_character_ngrams to use make_strings_children (#12480) @davidwendt
- Raise warnings as errors in the test suite (#12468) @v...
v22.12.01
🚨 Breaking Changes
- Add JNI for
substring
without 'end' parameter. (#12113) @firestarman - Refactor
purge_nonempty_nulls
(#12111) @ttnghia - Create an
int8
column inread_csv
when all elements are missing (#12110) @vuule - Throw an error when libcudf is built without cuFile and
LIBCUDF_CUFILE_POLICY
is set to"ALWAYS"
(#12080) @vuule - Fix type promotion edge cases in numerical binops (#12074) @wence-
- Reduce/Remove reliance on
**kwargs
and*args
inIO
readers & writers (#12025) @galipremsagar - Rollback of
DeviceBufferLike
(#12009) @madsbk - Remove unused
managed_allocator
(#12005) @vyasr - Pass column names to
write_csv
instead oftable_metadata
pointer (#11972) @vuule - Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
- Default to equal NaNs in make_merge_sets_aggregation. (#11952) @bdice
- Remove validation that requires introspection (#11938) @vyasr
- Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
- Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
- Support nested types as groupby keys in libcudf (#11792) @PointKernel
- Default to equal NaNs in make_collect_set_aggregation. (#11621) @bdice
- Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
- part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source
🐛 Bug Fixes
- strings_udf: use libcudf caching of character tables (#12343) @wence-
- Fix include line for IO Cython modules (#12250) @vyasr
- Make dask pinning looser (#12231) @vyasr
- Workaround for CUB segmented-sort bug with boolean keys (#12217) @davidwendt
- Fix
from_dict
backend dispatch to match upstreamdask
(#12203) @galipremsagar - Merge branch-22.10 into branch-22.12 (#12198) @davidwendt
- Fix compression in ORC writer (#12194) @vuule
- Don't use CMake 3.25.0 as it has a show stopping FindCUDAToolkit bug (#12188) @robertmaynard
- Fix data corruption when reading ORC files with empty stripes (#12160) @vuule
- Fix decimal binary operations (#12142) @galipremsagar
- Ensure dlpack include is provided to cudf interop lib (#12139) @robertmaynard
- Safely allocate
udf_string
pointers instrings_udf
(#12138) @brandon-b-miller - Fix/disable jitify lto (#12122) @robertmaynard
- Fix conditional_full_join benchmark (#12121) @GregoryKimball
- Fix regex working-memory-size refactor error (#12119) @davidwendt
- Add in negative size checks for columns (#12118) @revans2
- Add JNI for
substring
without 'end' parameter. (#12113) @firestarman - Fix reading of CSV files with blank second row (#12098) @vuule
- Fix an error in IO with
GzipFile
type (#12085) @galipremsagar - Workaround groupby aggregate thrust::copy_if overflow (#12079) @davidwendt
- Fix alignment of compressed blocks in ORC writer (#12077) @vuule
- Fix singleton-range
__setitem__
edge case (#12075) @wence- - Fix type promotion edge cases in numerical binops (#12074) @wence-
- Force using old fmt in nvbench. (#12067) @vyasr
- Fixes List offset bug in Nested JSON reader (#12060) @karthikeyann
- Allow falling back to
shim_60.ptx
by default instrings_udf
(#12056) @brandon-b-miller - Force black exclusions for pre-commit. (#12036) @bdice
- Add
memory_usage
&items
implementation forStruct
column & dtype (#12033) @galipremsagar - Reduce/Remove reliance on
**kwargs
and*args
inIO
readers & writers (#12025) @galipremsagar - Fixes bug in csv_reader_options construction in cython (#12021) @karthikeyann
- Fix issues when both
usecols
andnames
options are used inread_csv
(#12018) @vuule - Port thrust's pinned_allocator to cudf, since Thrust 1.17 removes the type (#12004) @robertmaynard
- Revert "Replace most of preprocessor usage in nvcomp adapter with
constexpr
" (#11999) @vuule - Fix bug where
df.loc
resulting in single row could give wrong index (#11998) @eriknw - Switch to DISABLE_DEPRECATION_WARNINGS to match other RAPIDS projects (#11989) @robertmaynard
- Fix maximum page size estimate in Parquet writer (#11962) @vuule
- Fix local offset handling in bgzip reader (#11918) @upsj
- Fix an issue reading struct-of-list types in Parquet. (#11910) @nvdbaranec
- Fix memcheck error in TypeInference.Timestamp gtest (#11905) @davidwendt
- Fix type casting in Series.setitem (#11904) @wence-
- Fix memcheck error in get_dremel_data (#11903) @davidwendt
- Fixes Unsupported column type error due to empty list columns in Nested JSON reader (#11897) @karthikeyann
- Fix segmented-sort to ignore indices outside the offsets (#11888) @davidwendt
- Fix cudf::stable_sorted_order for NaN and -NaN in FLOAT64 columns (#11874) @davidwendt
- Fix writing of Parquet files with many fragments (#11869) @etseidl
- Fix RangeIndex unary operators. (#11868) @vyasr
- JNI Avoid NPE for reading host binary data (#11865) @revans2
- Fix decimal benchmark input data generation (#11863) @karthikeyann
- Fix pre-commit copyright check (#11860) @galipremsagar
- Fix Parquet support for seconds and milliseconds duration types (#11854) @vuule
- Ensure better compiler cache results between cudf cal-ver branches (#11835) @robertmaynard
- Fix make_column_from_scalar for all-null strings column (#11807) @davidwendt
- Tell jitify_preprocess where to search for libnvrtc (#11787) @robertmaynard
- add V2 page header support to parquet reader (#11778) @etseidl
- Parquet reader: bug fix for a num_rows/skip_rows corner case, w/optimization for nested preprocessing (#11752) @nvdbaranec
- Determine if Arrow has S3 support at runtime in unit test. (#11560) @bdice
📖 Documentation
- Use rapidsai CODE_OF_CONDUCT.md (#12166) @bdice
- Add symlinks to notebooks. (#12128) @bdice
- Add
truncate
API to python doc pages (#12109) @galipremsagar - Update Numba docs links. (#12107) @bdice
- Remove "Multi-GPU with Dask-cuDF" notebook. (#12095) @bdice
- Fix link to c++ developer guide from
CONTRIBUTING.md
(#12084) @brandon-b-miller - Add pivot_table and crosstab to docs. (#12014) @bdice
- Fix doxygen text for cudf::dictionary::encode (#11991) @davidwendt
- Replace default_stream_value with get_default_stream in docs. (#11985) @vyasr
- Add dtype docs pages and docstrings for
cudf
specific dtypes (#11974) @galipremsagar - Update Unit Testing in libcudf guidelines to code tests outside the cudf::test namespace (#11959) @davidwendt
- Rename libcudf++ to libcudf. (#11953) @bdice
- Fix documentation referring to removed as_gpu_matrix method. (#11937) @bdice
- Remove "experimental" warning for struct columns in ORC reader and writer (#11880) @vuule
- Initial draft of policies and guidelines for libcudf usage. (#11853) @vyasr
- Add clear indication of non-GPU accelerated parameters in read_json docstring (#11825) @GregoryKimball
- Add developer docs for writing tests (#11199) @vyasr
🚀 New Features
- Adds an EventHandler to Java MemoryBuffer to be invoked on close (#12125) @abellina
- Support
+
instrings_udf
(#12117) @brandon-b-miller - Support
upper
andlower
instrings_udf
(#12099) @brandon-b-miller - Add wheel builds (#12096) @vyasr
- Allow setting malloc heap size in string udfs (#12094) @brandon-b-miller
- Support
strip
,lstrip
, andrstrip
instrings_udf
(#12091) @brandon-b-miller - Mark nvcomp zstd compression stable (#12059) @jbrennan333
- Add debug-only onAllocated/onDeallocated to RmmEventHandler (#12054) @abellina
- Enable building against the libarrow contained in pyarrow (#12034) @vyasr
- Add strings
like
jni and native method (#12032) @cindyyuanjiang - Cleanup common parsing code in JSON, CSV reader (#12022) @karthikeyann
- byte_range support for JSON Lines format (#12017) @karthikeyann
- Minor cleanup of root CMakeLists.txt for better organization (#11988) @robertmaynard
- Add inplace arithmetic operators to
MaskedType
(#11987) @brandon-b-miller - Implement JNI for chunked Parquet reader (#11961) @ttnghia
- Add method argument to DataFrame.quantile (#11957) @rjzamora
- Add gpu memory watermark apis to JNI (#11950) @abellina
- Adds retryCount to RmmEventHandler.onAllocFailure (#11940) @abellina
- Enable returning string data from UDFs used through
apply
(#11933) @brandon-b-miller - Switch over to rapids-cmake patches for thrust (#11921) @robertmaynard
- Add strings udf C++ classes and functions for phase II (#11912) @davidwendt
- Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
- Enable CEC for
strings_udf
(#11884) @brandon-b-miller - ArrowIPCTableWriter writes en empty batch in the case of an empty table. (#11883) @firestarman
- Implement chunked Parquet reader (#11867) @ttnghia
- Add
read_orc_metadata
to libcudf (#11815) @vuule - Support nested types as groupby keys in libcudf (#11792) @PointKernel
- Adding feature Truncate to DataFrame and Series (#11435) @VamsiTallam95
🛠️ Improvements
- Reduce number of tests marked
spilling
(#12197) @madsbk - Pin
dask
anddistributed
for release (#12165) @galipremsagar - Don't rely on GNU find in headers_test.sh (#12164) @wence-
- Update cp.clip call (#12148) @quasiben
- Enable automatic column projection in groupby().agg (#12124) @rjzamora
- Refactor
purge_nonempty_nulls
(#12111) @ttnghia - Create an
int8
column inread_csv
when all elements are missing (#12110) @vuule - Spilling to host memory (#12106) @madsbk
- First pass of
pd.read_orc
changes in tests (#12103) @galipremsagar - Expose engine argument in dask_cudf.read_json (#12101) @rjzamora
- Remove CUDA 10 compatibility code. (#12088) @bdice
- Move and update
dask
nigthly install in CI (#12082) @galipremsagar - Throw an error when libcudf is built without cuFile and
LIBCUDF_CUFILE_POLICY
is set to"ALWAYS"
(#12080) @vuule - Remove macros that inspect the contents of exceptions (#12076) @vyasr
- Fix ingest_raw_data performance issue in Nested JSON reader due to RVO (#12070) @karthikeyann
- Remove overflow err...
v22.12.00
🚨 Breaking Changes
- Add JNI for
substring
without 'end' parameter. (#12113) @firestarman - Refactor
purge_nonempty_nulls
(#12111) @ttnghia - Create an
int8
column inread_csv
when all elements are missing (#12110) @vuule - Throw an error when libcudf is built without cuFile and
LIBCUDF_CUFILE_POLICY
is set to"ALWAYS"
(#12080) @vuule - Fix type promotion edge cases in numerical binops (#12074) @wence-
- Reduce/Remove reliance on
**kwargs
and*args
inIO
readers & writers (#12025) @galipremsagar - Rollback of
DeviceBufferLike
(#12009) @madsbk - Remove unused
managed_allocator
(#12005) @vyasr - Pass column names to
write_csv
instead oftable_metadata
pointer (#11972) @vuule - Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
- Default to equal NaNs in make_merge_sets_aggregation. (#11952) @bdice
- Remove validation that requires introspection (#11938) @vyasr
- Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
- Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
- Support nested types as groupby keys in libcudf (#11792) @PointKernel
- Default to equal NaNs in make_collect_set_aggregation. (#11621) @bdice
- Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
- part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source
🐛 Bug Fixes
- Fix include line for IO Cython modules (#12250) @vyasr
- Make dask pinning looser (#12231) @vyasr
- Workaround for CUB segmented-sort bug with boolean keys (#12217) @davidwendt
- Fix
from_dict
backend dispatch to match upstreamdask
(#12203) @galipremsagar - Merge branch-22.10 into branch-22.12 (#12198) @davidwendt
- Fix compression in ORC writer (#12194) @vuule
- Don't use CMake 3.25.0 as it has a show stopping FindCUDAToolkit bug (#12188) @robertmaynard
- Fix data corruption when reading ORC files with empty stripes (#12160) @vuule
- Fix decimal binary operations (#12142) @galipremsagar
- Ensure dlpack include is provided to cudf interop lib (#12139) @robertmaynard
- Safely allocate
udf_string
pointers instrings_udf
(#12138) @brandon-b-miller - Fix/disable jitify lto (#12122) @robertmaynard
- Fix conditional_full_join benchmark (#12121) @GregoryKimball
- Fix regex working-memory-size refactor error (#12119) @davidwendt
- Add in negative size checks for columns (#12118) @revans2
- Add JNI for
substring
without 'end' parameter. (#12113) @firestarman - Fix reading of CSV files with blank second row (#12098) @vuule
- Fix an error in IO with
GzipFile
type (#12085) @galipremsagar - Workaround groupby aggregate thrust::copy_if overflow (#12079) @davidwendt
- Fix alignment of compressed blocks in ORC writer (#12077) @vuule
- Fix singleton-range
__setitem__
edge case (#12075) @wence- - Fix type promotion edge cases in numerical binops (#12074) @wence-
- Force using old fmt in nvbench. (#12067) @vyasr
- Fixes List offset bug in Nested JSON reader (#12060) @karthikeyann
- Allow falling back to
shim_60.ptx
by default instrings_udf
(#12056) @brandon-b-miller - Force black exclusions for pre-commit. (#12036) @bdice
- Add
memory_usage
&items
implementation forStruct
column & dtype (#12033) @galipremsagar - Reduce/Remove reliance on
**kwargs
and*args
inIO
readers & writers (#12025) @galipremsagar - Fixes bug in csv_reader_options construction in cython (#12021) @karthikeyann
- Fix issues when both
usecols
andnames
options are used inread_csv
(#12018) @vuule - Port thrust's pinned_allocator to cudf, since Thrust 1.17 removes the type (#12004) @robertmaynard
- Revert "Replace most of preprocessor usage in nvcomp adapter with
constexpr
" (#11999) @vuule - Fix bug where
df.loc
resulting in single row could give wrong index (#11998) @eriknw - Switch to DISABLE_DEPRECATION_WARNINGS to match other RAPIDS projects (#11989) @robertmaynard
- Fix maximum page size estimate in Parquet writer (#11962) @vuule
- Fix local offset handling in bgzip reader (#11918) @upsj
- Fix an issue reading struct-of-list types in Parquet. (#11910) @nvdbaranec
- Fix memcheck error in TypeInference.Timestamp gtest (#11905) @davidwendt
- Fix type casting in Series.setitem (#11904) @wence-
- Fix memcheck error in get_dremel_data (#11903) @davidwendt
- Fixes Unsupported column type error due to empty list columns in Nested JSON reader (#11897) @karthikeyann
- Fix segmented-sort to ignore indices outside the offsets (#11888) @davidwendt
- Fix cudf::stable_sorted_order for NaN and -NaN in FLOAT64 columns (#11874) @davidwendt
- Fix writing of Parquet files with many fragments (#11869) @etseidl
- Fix RangeIndex unary operators. (#11868) @vyasr
- JNI Avoid NPE for reading host binary data (#11865) @revans2
- Fix decimal benchmark input data generation (#11863) @karthikeyann
- Fix pre-commit copyright check (#11860) @galipremsagar
- Fix Parquet support for seconds and milliseconds duration types (#11854) @vuule
- Ensure better compiler cache results between cudf cal-ver branches (#11835) @robertmaynard
- Fix make_column_from_scalar for all-null strings column (#11807) @davidwendt
- Tell jitify_preprocess where to search for libnvrtc (#11787) @robertmaynard
- add V2 page header support to parquet reader (#11778) @etseidl
- Parquet reader: bug fix for a num_rows/skip_rows corner case, w/optimization for nested preprocessing (#11752) @nvdbaranec
- Determine if Arrow has S3 support at runtime in unit test. (#11560) @bdice
📖 Documentation
- Use rapidsai CODE_OF_CONDUCT.md (#12166) @bdice
- Add symlinks to notebooks. (#12128) @bdice
- Add
truncate
API to python doc pages (#12109) @galipremsagar - Update Numba docs links. (#12107) @bdice
- Remove "Multi-GPU with Dask-cuDF" notebook. (#12095) @bdice
- Fix link to c++ developer guide from
CONTRIBUTING.md
(#12084) @brandon-b-miller - Add pivot_table and crosstab to docs. (#12014) @bdice
- Fix doxygen text for cudf::dictionary::encode (#11991) @davidwendt
- Replace default_stream_value with get_default_stream in docs. (#11985) @vyasr
- Add dtype docs pages and docstrings for
cudf
specific dtypes (#11974) @galipremsagar - Update Unit Testing in libcudf guidelines to code tests outside the cudf::test namespace (#11959) @davidwendt
- Rename libcudf++ to libcudf. (#11953) @bdice
- Fix documentation referring to removed as_gpu_matrix method. (#11937) @bdice
- Remove "experimental" warning for struct columns in ORC reader and writer (#11880) @vuule
- Initial draft of policies and guidelines for libcudf usage. (#11853) @vyasr
- Add clear indication of non-GPU accelerated parameters in read_json docstring (#11825) @GregoryKimball
- Add developer docs for writing tests (#11199) @vyasr
🚀 New Features
- Adds an EventHandler to Java MemoryBuffer to be invoked on close (#12125) @abellina
- Support
+
instrings_udf
(#12117) @brandon-b-miller - Support
upper
andlower
instrings_udf
(#12099) @brandon-b-miller - Add wheel builds (#12096) @vyasr
- Allow setting malloc heap size in string udfs (#12094) @brandon-b-miller
- Support
strip
,lstrip
, andrstrip
instrings_udf
(#12091) @brandon-b-miller - Mark nvcomp zstd compression stable (#12059) @jbrennan333
- Add debug-only onAllocated/onDeallocated to RmmEventHandler (#12054) @abellina
- Enable building against the libarrow contained in pyarrow (#12034) @vyasr
- Add strings
like
jni and native method (#12032) @cindyyuanjiang - Cleanup common parsing code in JSON, CSV reader (#12022) @karthikeyann
- byte_range support for JSON Lines format (#12017) @karthikeyann
- Minor cleanup of root CMakeLists.txt for better organization (#11988) @robertmaynard
- Add inplace arithmetic operators to
MaskedType
(#11987) @brandon-b-miller - Implement JNI for chunked Parquet reader (#11961) @ttnghia
- Add method argument to DataFrame.quantile (#11957) @rjzamora
- Add gpu memory watermark apis to JNI (#11950) @abellina
- Adds retryCount to RmmEventHandler.onAllocFailure (#11940) @abellina
- Enable returning string data from UDFs used through
apply
(#11933) @brandon-b-miller - Switch over to rapids-cmake patches for thrust (#11921) @robertmaynard
- Add strings udf C++ classes and functions for phase II (#11912) @davidwendt
- Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
- Enable CEC for
strings_udf
(#11884) @brandon-b-miller - ArrowIPCTableWriter writes en empty batch in the case of an empty table. (#11883) @firestarman
- Implement chunked Parquet reader (#11867) @ttnghia
- Add
read_orc_metadata
to libcudf (#11815) @vuule - Support nested types as groupby keys in libcudf (#11792) @PointKernel
- Adding feature Truncate to DataFrame and Series (#11435) @VamsiTallam95
🛠️ Improvements
- Reduce number of tests marked
spilling
(#12197) @madsbk - Pin
dask
anddistributed
for release (#12165) @galipremsagar - Don't rely on GNU find in headers_test.sh (#12164) @wence-
- Update cp.clip call (#12148) @quasiben
- Enable automatic column projection in groupby().agg (#12124) @rjzamora
- Refactor
purge_nonempty_nulls
(#12111) @ttnghia - Create an
int8
column inread_csv
when all elements are missing (#12110) @vuule - Spilling to host memory (#12106) @madsbk
- First pass of
pd.read_orc
changes in tests (#12103) @galipremsagar - Expose engine argument in dask_cudf.read_json (#12101) @rjzamora
- Remove CUDA 10 compatibility code. (#12088) @bdice
- Move and update
dask
nigthly install in CI (#12082) @galipremsagar - Throw an error when libcudf is built without cuFile and
LIBCUDF_CUFILE_POLICY
is set to"ALWAYS"
(#12080) @vuule - Remove macros that inspect the contents of exceptions (#12076) @vyasr
- Fix ingest_raw_data performance issue in Nested JSON reader due to RVO (#12070) @karthikeyann
- Remove overflow error during decimal binops (#12063) @galipremsagar
- Change cudf::detail::...
[NIGHTLY] v23.02.00
🔗 Links
🚨 Breaking Changes
- Pin
dask
anddistributed
for release (#12695) @galipremsagar - Change ways to access
ptr
inBuffer
(#12587) @galipremsagar - Remove column names (#12578) @vuule
- Default
cudf::io::read_json
to nested JSON parser (#12544) @vuule - Switch
engine=cudf
to the newJSON
reader (#12509) @galipremsagar - Add trailing comma support for nested JSON reader (#12448) @karthikeyann
- Upgrade to
arrow-10.0.1
(#12327) @galipremsagar - Fail loudly to avoid data corruption with unsupported input in
read_orc
(#12325) @vuule - CSV, JSON reader to infer integer column with nulls as int64 instead of float64 (#12309) @karthikeyann
- Remove deprecated code for 23.02 (#12281) @vyasr
- Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
- Purge non-empty nulls for
superimpose_nulls
andpush_down_nulls
(#12239) @ttnghia - Rename
cudf::structs::detail::superimpose_parent_nulls
APIs (#12230) @ttnghia - Remove JIT type names, refactor id_to_type. (#12158) @bdice
- Floor division uses integer division for integral arguments (#12131) @wence-
🐛 Bug Fixes
- Fix update-version.sh (#12745) @raydouglass
- Fix a mask data corruption in UDF (#12647) @galipremsagar
- pre-commit: Update isort version to 5.12.0 (#12645) @wence-
- tests: Skip cuInit tests if cuda-gdb is not found or not working (#12644) @wence-
- Revert regex program java APIs and tests (#12639) @cindyyuanjiang
- Fix leaks in ColumnVectorTest (#12625) @jlowe
- Handle when spillable buffers own each other (#12607) @madsbk
- Fix incorrect null counts for sliced columns in JCudfSerialization (#12589) @jlowe
- lists: Transfer dtypes correctly through list.get (#12586) @wence-
- timedelta: Don't go via float intermediates for floordiv (#12585) @wence-
- Fixing BUG,
get_next_chunk()
should use the blocking functiondevice_read()
(#12584) @madsbk - Make JNI QuoteStyle accessible outside ai.rapids.cudf (#12572) @mythrocks
partition_by_hash()
: support index (#12554) @madsbk- Mixed Join benchmark bug due to wrong conditional column (#12553) @divyegala
- Update List Lexicographical Comparator (#12538) @divyegala
- Dynamically read PTX version (#12534) @brandon-b-miller
- build.sh switch to use
RAPIDS
magic value (#12525) @robertmaynard - Loosen runtime arrow pinning (#12522) @vyasr
- Enable metadata transfer for complex types in transpose (#12491) @galipremsagar
- Fix issues with parquet chunked reader (#12488) @nvdbaranec
- Fix missing metadata transfer in concat for
ListColumn
(#12487) @galipremsagar - Rename libcudf substring source files to slice (#12484) @davidwendt
- Fix compile issue with arrow 10 (#12465) @ttnghia
- Fix List offsets bug in mixed type list column in nested JSON reader (#12447) @karthikeyann
- Fix xfail incompatibilities (#12423) @vyasr
- Fix bug in Parquet column index encoding (#12404) @etseidl
- When building Arrow shared look for a shared OpenSSL (#12396) @robertmaynard
- Fix get_json_object to return empty column on empty input (#12384) @davidwendt
- Pin arrow 9 in testing dependencies to prevent conda solve issues (#12377) @vyasr
- Fix reductions any/all return value for empty input (#12374) @davidwendt
- Fix debug compile errors in parquet.hpp (#12372) @davidwendt
- Purge non-empty nulls in
cudf::make_lists_column
(#12370) @ttnghia - Use correct memory resource in io::make_column (#12364) @vyasr
- Add code to detect possible malformed page data in parquet files. (#12360) @nvdbaranec
- Fail loudly to avoid data corruption with unsupported input in
read_orc
(#12325) @vuule - Fix NumericPairIteratorTest for float values (#12306) @davidwendt
- Fixes memory allocation in nested JSON tokenizer (#12300) @elstehle
- Reconstruct dtypes correctly for list aggs of struct columns (#12290) @wence-
- Fix regex \A and \Z to strictly match string begin/end (#12282) @davidwendt
- Fix compile issue in
json_chunked_reader.cpp
(#12280) @ttnghia - Change reductions any/all to return valid values for empty input (#12279) @davidwendt
- Only exclude join keys that are indices from key columns (#12271) @wence-
- Fix spill to device limit (#12252) @madsbk
- Correct behaviour of sort in
concat
for singleton concatenations (#12247) @wence- - Purge non-empty nulls for
superimpose_nulls
andpush_down_nulls
(#12239) @ttnghia - Patch CUB DeviceSegmentedSort and remove workaround (#12234) @davidwendt
- Fix memory leak in udf_string::assign(&&) function (#12206) @davidwendt
- Workaround thrust-copy-if limit in json get_tree_representation (#12190) @davidwendt
- Fix page size calculation in Parquet writer (#12182) @etseidl
- Add cudf::detail::sizes_to_offsets_iterator to allow checking overflow in offsets (#12180) @davidwendt
- Workaround thrust-copy-if limit in wordpiece-tokenizer (#12168) @davidwendt
- Floor division uses integer division for integral arguments (#12131) @wence-
📖 Documentation
- Fix link to NVTX (#12598) @sameerz
- Include missing groupby functions in documentation (#12580) @quasiben
- Fix documentation author (#12527) @bdice
- Update libcudf reduction docs for casting output types (#12526) @davidwendt
- Add JSON reader page in user guide (#12499) @GregoryKimball
- Link unsupported iteration API docstrings (#12482) @galipremsagar
strings_udf
doc update (#12469) @brandon-b-miller- Update cudf_assert docs with correct NDEBUG behavior (#12464) @robertmaynard
- Update pre-commit hooks guide (#12395) @bdice
- Update test docs to not use detail comparison utilities (#12332) @PointKernel
- Fix doxygen description for regex_program::compute_working_memory_size (#12329) @davidwendt
- Add eval to docs. (#12322) @vyasr
- Turn on xfail_strict=true (#12244) @wence-
- Update 10 minutes to cuDF (#12114) @wence-
🚀 New Features
- Use kvikIO as the default IO backend (#12574) @vuule
- Use
has_nonempty_nulls
instead ofmay_contain_non_empty_nulls
insuperimpose_nulls
andpush_down_nulls
(#12560) @ttnghia - Add strings methods removeprefix and removesuffix (#12557) @davidwendt
- Add
regex_program
java APIs and unit tests (#12548) @cindyyuanjiang - Default
cudf::io::read_json
to nested JSON parser (#12544) @vuule - Make string quoting optional on CSV write (#12539) @mythrocks
- Use new nvCOMP API to optimize the compression temp memory size (#12533) @vuule
- Support "values" orient (array of arrays) in Nested JSON reader (#12498) @karthikeyann
one_hot_encode
to use experimental row comparators (#12478) @divyegala- Support %W and %w format specifiers in cudf::strings::to_timestamps (#12475) @davidwendt
- Add JSON Writer (#12474) @karthikeyann
- Refactor
thrust_copy_if
intocudf::detail::copy_if_safe
(#12455) @ttnghia - Add trailing comma support for nested JSON reader (#12448) @karthikeyann
- Extract
tokenize_json.hpp
detail header fromsrc/io/json/nested_json.hpp
(#12432) @ttnghia - JNI bindings to write CSV (#12425) @mythrocks
- Nested JSON depth benchmark (#12371) @karthikeyann
- Implement
lists::reverse
(#12336) @ttnghia - Use
device_read
in experimentalread_json
(#12314) @vuule - Implement JNI for
strings::reverse
(#12283) @ttnghia - Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
- Add cudf::strings:like function with multiple patterns (#12269) @davidwendt
- Add environment variable to control host memory allocation in
hostdevice_vector
(#12251) @vuule - Add cudf::strings::reverse function (#12227) @davidwendt
- Selectively use dictionary encoding in Parquet writer (#12211) @etseidl
- Support
replace
instrings_udf
(#12207) @brandon-b-miller - Add support to read binary encoded decimals in parquet (#12205) @PointKernel
- Support regex EOL where the string ends with a new-line character (#12181) @davidwendt
- Updating
stream_compaction/unique
to use new row comparators (#12159) @divyegala - Add device buffer datasource (#12024) @PointKernel
- Implement groupby apply with JIT (#11452) @bwyogatama
🛠️ Improvements
- Update shared workflow branches (#12696) @ajschmidt8
- Pin
dask
anddistributed
for release (#12695) @galipremsagar - Don't upload
libcudf-example
to Anaconda.org (#12671) @ajschmidt8 - Pin wheel dependencies to same RAPIDS release (#12659) @sevagh
- Use CTK 118/cp310 branch of wheel workflows (#12602) @sevagh
- Change ways to access
ptr
inBuffer
(#12587) @galipremsagar - Version a parquet writer xfail (#12579) @galipremsagar
- Remove column names (#12578) @vuule
- Parquet reader optimization to address V100 regression. (#12577) @nvdbaranec
- Add support for
category
dtypes in CSV reader (#12571) @galipremsagar - Remove
spill_lock
parameter fromSpillableBuffer.get_ptr()
(#12564) @madsbk - Optimize
cudf::make_lists_column
(#12547) @ttnghia - Remove
cudf::strings::repeat_strings_output_sizes
from Java and JNI (#12546) @ttnghia - Test that cuInit is not called when RAPIDS_NO_INITIALIZE is set (#12545) @wence-
- Rework repeat_strings to use sizes-to-offsets utility (#12543) @davidwendt
- Replace exclusive_scan with sizes_to_offsets in cudf::lists::sequences (#12541) @davidwendt
- Rework nvtext::ngrams_tokenize to use sizes-to-offsets utility (#12540) @davidwendt
- Fix binary-ops gtests coded in namespace cudf::test (#12536) @davidwendt
- More
@acquire_spill_lock()
andas_buffer(..., exposed=False)
(#12535) @madsbk - Guard CUDA runtime APIs with error checking (#12531) @PointKernel
- Update TODOs from issue 10432. (#12528) @bdice
- Update rapids-cmake definitions version in GitHub Actions style checks. (#12511) @bdice
- Switch
engine=cudf
to the newJSON
reader (#12509) @galipremsagar - Fix SUM/MEAN aggregation type support. (#12503) @bdice
- Stop using pandas._testing (#12492) @vyasr
- Fix ROLLING_TEST gtests coded in namespace cudf::test...
v22.10.01
🚨 Breaking Changes
- Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
- Disable nvCOMP DEFLATE integration (#11811) @vuule
- Fix return type of
Index.isna
&Index.notna
(#11769) @galipremsagar - Remove
kwargs
inread_csv
&to_csv
(#11762) @galipremsagar - Fix
cudf::partition*
APIs that do not return offsets for empty output table (#11709) @ttnghia - Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
- Update zfill to match Python output (#11634) @davidwendt
- Upgrade
pandas
to1.5
(#11617) @galipremsagar - Change default value of
ordered
toFalse
inCategoricalDtype
(#11604) @galipremsagar - Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
- Adding optional parquet reader schema (#11524) @hyperbolic2346
- Deprecate
skiprows
andnum_rows
inread_orc
(#11522) @galipremsagar - Remove support for skip_rows / num_rows options in the parquet reader. (#11503) @nvdbaranec
- Drop support for
skiprows
andnum_rows
incudf.read_parquet
(#11480) @galipremsagar - Disable Arrow S3 support by default. (#11470) @bdice
- Convert thrust::optional usages to std::optional (#11455) @robertmaynard
- Remove unused is_struct trait. (#11450) @bdice
- Refactor the
Buffer
class (#11447) @madsbk - Return empty dataframe when reading an ORC file using empty
columns
option (#11446) @vuule - Refactor pad_side and strip_type enums into side_type enum (#11438) @davidwendt
- Remove HASH_SERIAL_MURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
- Use the new JSON parser when the experimental reader is selected (#11364) @vuule
- Remove deprecated Series.applymap. (#11031) @bdice
- Remove deprecated expand parameter from str.findall. (#11030) @bdice
🐛 Bug Fixes
- Update cuda-python dependency to 11.7.1 (#11994) @shwina
- Fixes bug in temporary decompression space estimation before calling nvcomp (#11879) @abellina
- Handle
ptx
file paths duringstrings_udf
import (#11862) @galipremsagar - Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
- Reset
strings_udf
CEC and solve several related issues (#11846) @brandon-b-miller - Fix bug in new shuffle-based groupby implementation (#11836) @rjzamora
- Fix
is_valid
checks inScalar._binaryop
(#11818) @wence- - Fix operator
NotImplemented
issue withnumpy
(#11816) @galipremsagar - Disable nvCOMP DEFLATE integration (#11811) @vuule
- Build
strings_udf
package with other python packages in nightlies (#11808) @brandon-b-miller - Revert problematic shuffle=explicit-comms changes (#11803) @rjzamora
- Fix regex out-of-bounds write in strided rows logic (#11797) @davidwendt
- Build
cudf
locally before buildingstrings_udf
conda packages in CI (#11785) @brandon-b-miller - Fix an issue in cudf::row_bit_count involving structs and lists at multiple levels. (#11779) @nvdbaranec
- Fix return type of
Index.isna
&Index.notna
(#11769) @galipremsagar - Fix issue with set-item incase of
list
andstruct
types (#11760) @galipremsagar - Ensure all libcudf APIs run on cudf's default stream (#11759) @vyasr
- Resolve dask_cudf failures caused by upstream groupby changes (#11755) @rjzamora
- Fix ORC string sum statistics (#11740) @vuule
- Add
strings_udf
package for python 3.9 (#11730) @brandon-b-miller - Ensure that all tests launch kernels on cudf's default stream (#11726) @vyasr
- Don't assume stream is a compile-time constant expression (#11725) @vyasr
- Fix get_thrust.cmake format at patch command (#11715) @davidwendt
- Fix
cudf::partition*
APIs that do not return offsets for empty output table (#11709) @ttnghia - Fix cudf::lists::sort_lists for NaN and Infinity values (#11703) @davidwendt
- Modify ORC reader timestamp parsing to match the apache reader behavior (#11699) @vuule
- Fix
DataFrame.from_arrow
to preserve type metadata (#11698) @galipremsagar - Fix compile error due to missing header (#11697) @ttnghia
- Default to Snappy compression in
to_orc
when using cuDF or Dask (#11690) @vuule - Fix an issue related to
Multindex
whengroup_keys=True
(#11689) @galipremsagar - Transfer correct dtype to exploded column (#11687) @wence-
- Ignore protobuf generated files in
mypy
checks (#11685) @galipremsagar - Maintain the index name after
.loc
(#11677) @shwina - Fix issue with extracting nested column data & dtype preservation (#11671) @galipremsagar
- Ensure that all cudf tests and benchmarks are conda env aware (#11666) @robertmaynard
- Update to Thrust 1.17.2 to fix cub ODR issues (#11665) @robertmaynard
- Fix multi-file remote datasource bug (#11655) @rjzamora
- Fix invalid regex quantifier check to not include alternation (#11654) @davidwendt
- Fix bug in
device_write()
: it uses an incorrect size (#11651) @madsbk - fixes overflows in benchmarks (#11649) @elstehle
- Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
- Fix compile error in benchmark nested_json.cpp (#11637) @davidwendt
- Update zfill to match Python output (#11634) @davidwendt
- Removed converted type for INT32 and INT64 since they do not convert (#11627) @hyperbolic2346
- Fix host scalars construction of nested types (#11612) @galipremsagar
- Fix compile warning in nested_json_gpu.cu (#11607) @davidwendt
- Change default value of
ordered
toFalse
inCategoricalDtype
(#11604) @galipremsagar - Preserve order if necessary when deduping categoricals internally (#11597) @brandon-b-miller
- Add is_timestamp test for leap second (60) (#11594) @davidwendt
- Fix an issue with
to_arrow
when column name type is not a string (#11590) @galipremsagar - Fix exception in segmented-reduce benchmark (#11588) @davidwendt
- Fix encode/decode of negative timestamps in ORC reader/writer (#11586) @vuule
- Correct distribution data type in
quantiles
benchmark (#11584) @vuule - Fix multibyte_split benchmark for host buffers (#11583) @upsj
- xfail custreamz display test for now (#11567) @shwina
- Fix JNI for TableWithMeta to use schema_info instead of column_names (#11566) @jlowe
- Reduce code duplication for
dask
&distributed
nightly/stable installs (#11565) @galipremsagar - Fix groupby failures in dask_cudf CI (#11561) @rjzamora
- Fix for pivot: error when 'values' is a multicharacter string (#11538) @shaswat-indian
- find_package(cudf) + arrow9 usable with cudf build directory (#11535) @robertmaynard
- Fixing crash when writing binary nested data in parquet (#11526) @hyperbolic2346
- Fix for: error when assigning a value to an empty series (#11523) @shaswat-indian
- Fix invalid results from conditional-left-anti-join in debug build (#11517) @davidwendt
- Fix cmake error after upgrading to Arrow 9 (#11513) @ttnghia
- Fix reverse binary operators acting on a host value and cudf.Scalar (#11512) @bdice
- Update parquet fuzz tests to drop support for
skiprows
&num_rows
(#11505) @galipremsagar - Use rapids-cmake 22.10 best practice for RAPIDS.cmake location (#11493) @robertmaynard
- Handle some zero-sized corner cases in dlpack interop (#11449) @wence-
- Return empty dataframe when reading an ORC file using empty
columns
option (#11446) @vuule - libcudf c++ example updated to CPM version 0.35.3 (#11417) @robertmaynard
- Fix regex quantifier check to include capture groups (#11373) @davidwendt
- Fix read_text when byte_range is aligned with field (#11371) @upsj
- Fix to_timestamps truncated subsecond calculation (#11367) @davidwendt
- column: calculate null_count before release()ing the cudf::column (#11365) @wence-
📖 Documentation
- Update
guide-to-udfs
notebook (#11861) @brandon-b-miller - Update docstring for cudf.read_text (#11799) @GregoryKimball
- Add doc section for
list
&struct
handling (#11770) @galipremsagar - Document that minimum required CMake version is now 3.23.1 (#11751) @robertmaynard
- Update libcudf documentation build command in DOCUMENTATION.md (#11735) @davidwendt
- Add docs for use of string data to
DataFrame.apply
andSeries.apply
and update guide to UDFs notebook (#11733) @brandon-b-miller - Enable more Pydocstyle rules (#11582) @bdice
- Remove unused cpp/img folder (#11554) @davidwendt
- Publish C++ developer docs (#11475) @vyasr
- Fix a misalignment in
cudf.get_dummies
docstring (#11443) @galipremsagar - Update contributing doc to include links to the developer guides (#11390) @davidwendt
- Fix table_view_base doxygen format (#11340) @davidwendt
- Create main developer guide for Python (#11235) @vyasr
- Add developer documentation for benchmarking (#11122) @vyasr
- cuDF error handling document (#7917) @isVoid
🚀 New Features
- Add hasNull statistic reading ability to ORC (#11747) @devavret
- Add
istitle
to string UDFs (#11738) @brandon-b-miller - JSON Column creation in GPU (#11714) @karthikeyann
- Adds option to take explicit nested schema for nested JSON reader (#11682) @elstehle
- Add BGZIP
data_chunk_reader
(#11652) @upsj - Support DECIMAL order-by for RANGE window functions (#11645) @mythrocks
- changing version of cmake to 3.23.3 (#11619) @hyperbolic2346
- Generate unique keys table in java JNI
contiguousSplitGroups
(#11614) @res-life - Generic type casting to support the new nested JSON reader (#11613) @elstehle
- JSON tree traversal (#11610) @karthikeyann
- Add casting operators to masked UDFs (#11578) @brandon-b-miller
- Adds type inference and type conversion for leaf-columns to the nested JSON parser (#11574) @elstehle
- Add strings 'like' function (#11558) @davidwendt
- Handle hyphen as literal for regex cclass when incomplete range (#11557) @davidwendt
- Enable ZSTD compression in ORC and Parquet writers (#11551) @vuule
- Adds support for json lines format to the nested JSON reader (#11534) @elstehle
- Adding optional parquet reader schema (#11524) @hyperbolic2346
- Adds GPU implementation of JSON-token-stream to JSON-tree (#11518) @karthikeyann
- Add
gdb
pretty-printers for simple types (#...