v22.10.00
🚨 Breaking Changes
- Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
- Disable nvCOMP DEFLATE integration (#11811) @vuule
- Fix return type of
Index.isna
&Index.notna
(#11769) @galipremsagar - Remove
kwargs
inread_csv
&to_csv
(#11762) @galipremsagar - Fix
cudf::partition*
APIs that do not return offsets for empty output table (#11709) @ttnghia - Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
- Update zfill to match Python output (#11634) @davidwendt
- Upgrade
pandas
to1.5
(#11617) @galipremsagar - Change default value of
ordered
toFalse
inCategoricalDtype
(#11604) @galipremsagar - Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
- Adding optional parquet reader schema (#11524) @hyperbolic2346
- Deprecate
skiprows
andnum_rows
inread_orc
(#11522) @galipremsagar - Remove support for skip_rows / num_rows options in the parquet reader. (#11503) @nvdbaranec
- Drop support for
skiprows
andnum_rows
incudf.read_parquet
(#11480) @galipremsagar - Disable Arrow S3 support by default. (#11470) @bdice
- Convert thrust::optional usages to std::optional (#11455) @robertmaynard
- Remove unused is_struct trait. (#11450) @bdice
- Refactor the
Buffer
class (#11447) @madsbk - Return empty dataframe when reading an ORC file using empty
columns
option (#11446) @vuule - Refactor pad_side and strip_type enums into side_type enum (#11438) @davidwendt
- Remove HASH_SERIAL_MURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
- Use the new JSON parser when the experimental reader is selected (#11364) @vuule
- Remove deprecated Series.applymap. (#11031) @bdice
- Remove deprecated expand parameter from str.findall. (#11030) @bdice
🐛 Bug Fixes
- Fixes bug in temporary decompression space estimation before calling nvcomp (#11879) @abellina
- Handle
ptx
file paths duringstrings_udf
import (#11862) @galipremsagar - Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
- Reset
strings_udf
CEC and solve several related issues (#11846) @brandon-b-miller - Fix bug in new shuffle-based groupby implementation (#11836) @rjzamora
- Fix
is_valid
checks inScalar._binaryop
(#11818) @wence- - Fix operator
NotImplemented
issue withnumpy
(#11816) @galipremsagar - Disable nvCOMP DEFLATE integration (#11811) @vuule
- Build
strings_udf
package with other python packages in nightlies (#11808) @brandon-b-miller - Revert problematic shuffle=explicit-comms changes (#11803) @rjzamora
- Fix regex out-of-bounds write in strided rows logic (#11797) @davidwendt
- Build
cudf
locally before buildingstrings_udf
conda packages in CI (#11785) @brandon-b-miller - Fix an issue in cudf::row_bit_count involving structs and lists at multiple levels. (#11779) @nvdbaranec
- Fix return type of
Index.isna
&Index.notna
(#11769) @galipremsagar - Fix issue with set-item incase of
list
andstruct
types (#11760) @galipremsagar - Ensure all libcudf APIs run on cudf's default stream (#11759) @vyasr
- Resolve dask_cudf failures caused by upstream groupby changes (#11755) @rjzamora
- Fix ORC string sum statistics (#11740) @vuule
- Add
strings_udf
package for python 3.9 (#11730) @brandon-b-miller - Ensure that all tests launch kernels on cudf's default stream (#11726) @vyasr
- Don't assume stream is a compile-time constant expression (#11725) @vyasr
- Fix get_thrust.cmake format at patch command (#11715) @davidwendt
- Fix
cudf::partition*
APIs that do not return offsets for empty output table (#11709) @ttnghia - Fix cudf::lists::sort_lists for NaN and Infinity values (#11703) @davidwendt
- Modify ORC reader timestamp parsing to match the apache reader behavior (#11699) @vuule
- Fix
DataFrame.from_arrow
to preserve type metadata (#11698) @galipremsagar - Fix compile error due to missing header (#11697) @ttnghia
- Default to Snappy compression in
to_orc
when using cuDF or Dask (#11690) @vuule - Fix an issue related to
Multindex
whengroup_keys=True
(#11689) @galipremsagar - Transfer correct dtype to exploded column (#11687) @wence-
- Ignore protobuf generated files in
mypy
checks (#11685) @galipremsagar - Maintain the index name after
.loc
(#11677) @shwina - Fix issue with extracting nested column data & dtype preservation (#11671) @galipremsagar
- Ensure that all cudf tests and benchmarks are conda env aware (#11666) @robertmaynard
- Update to Thrust 1.17.2 to fix cub ODR issues (#11665) @robertmaynard
- Fix multi-file remote datasource bug (#11655) @rjzamora
- Fix invalid regex quantifier check to not include alternation (#11654) @davidwendt
- Fix bug in
device_write()
: it uses an incorrect size (#11651) @madsbk - fixes overflows in benchmarks (#11649) @elstehle
- Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
- Fix compile error in benchmark nested_json.cpp (#11637) @davidwendt
- Update zfill to match Python output (#11634) @davidwendt
- Removed converted type for INT32 and INT64 since they do not convert (#11627) @hyperbolic2346
- Fix host scalars construction of nested types (#11612) @galipremsagar
- Fix compile warning in nested_json_gpu.cu (#11607) @davidwendt
- Change default value of
ordered
toFalse
inCategoricalDtype
(#11604) @galipremsagar - Preserve order if necessary when deduping categoricals internally (#11597) @brandon-b-miller
- Add is_timestamp test for leap second (60) (#11594) @davidwendt
- Fix an issue with
to_arrow
when column name type is not a string (#11590) @galipremsagar - Fix exception in segmented-reduce benchmark (#11588) @davidwendt
- Fix encode/decode of negative timestamps in ORC reader/writer (#11586) @vuule
- Correct distribution data type in
quantiles
benchmark (#11584) @vuule - Fix multibyte_split benchmark for host buffers (#11583) @upsj
- xfail custreamz display test for now (#11567) @shwina
- Fix JNI for TableWithMeta to use schema_info instead of column_names (#11566) @jlowe
- Reduce code duplication for
dask
&distributed
nightly/stable installs (#11565) @galipremsagar - Fix groupby failures in dask_cudf CI (#11561) @rjzamora
- Fix for pivot: error when 'values' is a multicharacter string (#11538) @shaswat-indian
- find_package(cudf) + arrow9 usable with cudf build directory (#11535) @robertmaynard
- Fixing crash when writing binary nested data in parquet (#11526) @hyperbolic2346
- Fix for: error when assigning a value to an empty series (#11523) @shaswat-indian
- Fix invalid results from conditional-left-anti-join in debug build (#11517) @davidwendt
- Fix cmake error after upgrading to Arrow 9 (#11513) @ttnghia
- Fix reverse binary operators acting on a host value and cudf.Scalar (#11512) @bdice
- Update parquet fuzz tests to drop support for
skiprows
&num_rows
(#11505) @galipremsagar - Use rapids-cmake 22.10 best practice for RAPIDS.cmake location (#11493) @robertmaynard
- Handle some zero-sized corner cases in dlpack interop (#11449) @wence-
- Return empty dataframe when reading an ORC file using empty
columns
option (#11446) @vuule - libcudf c++ example updated to CPM version 0.35.3 (#11417) @robertmaynard
- Fix regex quantifier check to include capture groups (#11373) @davidwendt
- Fix read_text when byte_range is aligned with field (#11371) @upsj
- Fix to_timestamps truncated subsecond calculation (#11367) @davidwendt
- column: calculate null_count before release()ing the cudf::column (#11365) @wence-
📖 Documentation
- Update
guide-to-udfs
notebook (#11861) @brandon-b-miller - Update docstring for cudf.read_text (#11799) @GregoryKimball
- Add doc section for
list
&struct
handling (#11770) @galipremsagar - Document that minimum required CMake version is now 3.23.1 (#11751) @robertmaynard
- Update libcudf documentation build command in DOCUMENTATION.md (#11735) @davidwendt
- Add docs for use of string data to
DataFrame.apply
andSeries.apply
and update guide to UDFs notebook (#11733) @brandon-b-miller - Enable more Pydocstyle rules (#11582) @bdice
- Remove unused cpp/img folder (#11554) @davidwendt
- Publish C++ developer docs (#11475) @vyasr
- Fix a misalignment in
cudf.get_dummies
docstring (#11443) @galipremsagar - Update contributing doc to include links to the developer guides (#11390) @davidwendt
- Fix table_view_base doxygen format (#11340) @davidwendt
- Create main developer guide for Python (#11235) @vyasr
- Add developer documentation for benchmarking (#11122) @vyasr
- cuDF error handling document (#7917) @isVoid
🚀 New Features
- Add hasNull statistic reading ability to ORC (#11747) @devavret
- Add
istitle
to string UDFs (#11738) @brandon-b-miller - JSON Column creation in GPU (#11714) @karthikeyann
- Adds option to take explicit nested schema for nested JSON reader (#11682) @elstehle
- Add BGZIP
data_chunk_reader
(#11652) @upsj - Support DECIMAL order-by for RANGE window functions (#11645) @mythrocks
- changing version of cmake to 3.23.3 (#11619) @hyperbolic2346
- Generate unique keys table in java JNI
contiguousSplitGroups
(#11614) @res-life - Generic type casting to support the new nested JSON reader (#11613) @elstehle
- JSON tree traversal (#11610) @karthikeyann
- Add casting operators to masked UDFs (#11578) @brandon-b-miller
- Adds type inference and type conversion for leaf-columns to the nested JSON parser (#11574) @elstehle
- Add strings 'like' function (#11558) @davidwendt
- Handle hyphen as literal for regex cclass when incomplete range (#11557) @davidwendt
- Enable ZSTD compression in ORC and Parquet writers (#11551) @vuule
- Adds support for json lines format to the nested JSON reader (#11534) @elstehle
- Adding optional parquet reader schema (#11524) @hyperbolic2346
- Adds GPU implementation of JSON-token-stream to JSON-tree (#11518) @karthikeyann
- Add
gdb
pretty-printers for simple types (#11499) @upsj - Add
create_random_column
function to the data generator (#11490) @vuule - Add fluent API builder to
data_profile
(#11479) @vuule - Adds Nested Json benchmark (#11466) @karthikeyann
- Convert thrust::optional usages to std::optional (#11455) @robertmaynard
- Python API for the future experimental JSON reader (#11426) @vuule
- Return schema info from JSON reader (#11419) @vuule
- Add regex ASCII flag support for matching builtin character classes (#11404) @davidwendt
- Truncate parquet column indexes (#11403) @etseidl
- Adds the end-to-end JSON parser implementation (#11388) @elstehle
- Use the new JSON parser when the experimental reader is selected (#11364) @vuule
- Add placeholder for the experimental JSON reader (#11334) @vuule
- Add read-only functions on string dtypes to
DataFrame.apply
andSeries.apply
(#11319) @brandon-b-miller - Added 'crosstab' and 'pivot_table' features (#11314) @shaswat-indian
- Quickly error out when trying to build with unsupported nvcc versions (#11297) @robertmaynard
- Adds JSON tokenizer (#11264) @elstehle
- List lexicographic comparator (#11129) @devavret
- Add generic type inference for cuIO (#11121) @PointKernel
- Fully support nested types in
cudf::contains
(#10656) @ttnghia - Support nested types in
lists::contains
(#10548) @ttnghia
🛠️ Improvements
- Pin
dask
anddistributed
for release (#11822) @galipremsagar - Add examples for Nested JSON reader (#11814) @GregoryKimball
- Support shuffle-based groupby aggregations in dask_cudf (#11800) @rjzamora
- Update strings udf version updater script (#11772) @galipremsagar
- Remove
kwargs
inread_csv
&to_csv
(#11762) @galipremsagar - Pass
dtype
param to avoidpd.Series
warnings (#11761) @galipremsagar - Enable
schema_element
&keep_quotes
support in json reader (#11746) @galipremsagar - Add ability to construct
ListColumn
when size isNone
(#11745) @galipremsagar - Reduces memory requirements in JSON parser and adds bytes/s and peak memory usage to benchmarks (#11732) @elstehle
- Add missing copyright headers. (#11712) @bdice
- Fix copyright check issues in pre-commit (#11711) @bdice
- Include decimal in supported types for range window order-by columns (#11710) @mythrocks
- Disable very large column gtest for contiguous-split (#11706) @davidwendt
- Drop split_out=None test from groupby.agg (#11704) @wence-
- Use CubinLinker for CUDA Minor Version Compatibility (#11701) @gmarkall
- Add regex capture-group parameter to auto convert to non-capture groups (#11695) @davidwendt
- Add a
__dataframe__
method to the protocol dataframe object (#11692) @rgommers - Special-case multibyte_split for single-byte delimiter (#11681) @upsj
- Remove isort exclusions (#11680) @bdice
- Refactor CSV reader benchmarks with nvbench (#11678) @PointKernel
- Check conda recipe headers with pre-commit (#11669) @bdice
- Remove redundant style check for clang-format. (#11668) @bdice
- Add support for
group_keys
ingroupby
(#11659) @galipremsagar - Fix pandoc pinning. (#11658) @bdice
- Revert removal of skip_rows / num_rows options from the Parquet reader. (#11657) @nvdbaranec
- Update git metadata (#11647) @bdice
- Call set_null_count on a returning column if null-count is known (#11646) @davidwendt
- Fix some libcudf detail calls not passing the stream variable (#11642) @davidwendt
- Update to mypy 0.971 (#11640) @wence-
- Refactor strings strip functor to details header (#11635) @davidwendt
- Fix incorrect
nullCount
inget_json_object
(#11633) @trxcllnt - Simplify
hostdevice_vector
(#11631) @upsj - Refactor parquet writer benchmarks with nvbench (#11623) @PointKernel
- Rework contains_scalar to check nulls at runtime (#11622) @davidwendt
- Fix incorrect memory resource used in rolling temp columns (#11618) @mythrocks
- Upgrade
pandas
to1.5
(#11617) @galipremsagar - Move type-dispatcher calls from traits.hpp to traits.cpp (#11616) @davidwendt
- Refactor parquet reader benchmarks with nvbench (#11611) @PointKernel
- Forward-merge branch-22.08 to branch-22.10 (#11608) @bdice
- Use stream in Java API. (#11601) @bdice
- Refactors of public/detail APIs, CUDF_FUNC_RANGE, stream handling. (#11600) @bdice
- Improve ORC writer benchmark with nvbench (#11598) @PointKernel
- Tune multibyte_split kernel (#11587) @upsj
- Move split_utils.cuh to strings/detail (#11585) @davidwendt
- Fix warnings due to compiler regression with
if constexpr
(#11581) @ttnghia - Add full 24-bit dictionary support to Parquet writer (#11580) @etseidl
- Expose "explicit-comms" option in shuffle-based dask_cudf functions (#11576) @rjzamora
- Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
- Refactor dask_cudf groupby to use apply_concat_apply (#11571) @rjzamora
- Add ability to write
list(struct)
columns asmap
type in orc writer (#11568) @galipremsagar - Add byte_range to multibyte_split benchmark + NVBench refactor (#11562) @upsj
- JNI support for writing binary columns in parquet (#11556) @revans2
- Support additional dictionary bit widths in Parquet writer (#11547) @etseidl
- Refactor string/numeric conversion utilities (#11545) @davidwendt
- Removing unnecessary asserts in parquet tests (#11544) @hyperbolic2346
- Clean up ORC reader benchmarks with NVBench (#11543) @PointKernel
- Reuse MurmurHash3_32 in Parquet page data. (#11528) @bdice
- Add hexadecimal value separators (#11527) @bdice
- Deprecate
skiprows
andnum_rows
inread_orc
(#11522) @galipremsagar - Struct support for
NULL_EQUALS
binary operation (#11520) @rwlee - Bump hadoop-common from 3.2.3 to 3.2.4 in /java (#11516) @dependabot[bot]
- Fix Feather test warning. (#11511) @bdice
- copy_range ballot_syncs to have no execution dependency (#11508) @robertmaynard
- Upgrade to
arrow-9.x
(#11507) @galipremsagar - Remove support for skip_rows / num_rows options in the parquet reader. (#11503) @nvdbaranec
- Single-pass
multibyte_split
(#11500) @upsj - Sanitize percentile_approx() output for empty input (#11498) @SrikarVanavasam
- Unpin
dask
anddistributed
for development (#11492) @galipremsagar - Move SparkMurmurHash3_32 functor. (#11489) @bdice
- Refactor group_nunique.cu to use nullate::DYNAMIC for reduce-by-key functor (#11482) @davidwendt
- Drop support for
skiprows
andnum_rows
incudf.read_parquet
(#11480) @galipremsagar - Add reduction
distinct_count
benchmark (#11473) @ttnghia - Add groupby
nunique
aggregation benchmark (#11472) @ttnghia - Disable Arrow S3 support by default. (#11470) @bdice
- Add groupby
max
aggregation benchmark (#11464) @ttnghia - Extract Dremel encoding code from Parquet (#11461) @vyasr
- Add missing Thrust #includes. (#11457) @bdice
- Make CMake hooks verbose (#11456) @vyasr
- Control Parquet page size through Python API (#11454) @etseidl
- Add control of Parquet column index creation to python (#11453) @etseidl
- Remove unused is_struct trait. (#11450) @bdice
- Refactor the
Buffer
class (#11447) @madsbk - Refactor pad_side and strip_type enums into side_type enum (#11438) @davidwendt
- Update to Thrust 1.17.0 (#11437) @bdice
- Add in JNI for parsing JSON data and getting the metadata back too. (#11431) @revans2
- Convert byte_array_view to use std::byte (#11424) @hyperbolic2346
- Deprecate unflatten_nested_columns (#11421) @SrikarVanavasam
- Remove HASH_SERIAL_MURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
- Add Spark list hashing Java tests (#11379) @bdice
- Move cmake to the build section. (#11376) @vyasr
- Remove use of CUDA driver API calls from libcudf (#11370) @shwina
- Add column constructor from device_uvector&& (#11356) @SrikarVanavasam
- Remove unused custreamz thirdparty directory (#11343) @vyasr
- Update jni version to 22.10.0-SNAPSHOT (#11338) @pxLi
- Enable using upstream jitify2 (#11287) @shwina
- Cache cudf.Scalar (#11246) @shwina
- Remove deprecated Series.applymap. (#11031) @bdice
- Remove deprecated expand parameter from str.findall. (#11030) @bdice