Release v22.10.00 · rapidsai/cudf

🚨 Breaking Changes

Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
Disable nvCOMP DEFLATE integration (#11811) @vuule
Fix return type of Index.isna & Index.notna (#11769) @galipremsagar
Remove kwargs in read_csv & to_csv (#11762) @galipremsagar
Fix cudf::partition* APIs that do not return offsets for empty output table (#11709) @ttnghia
Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
Update zfill to match Python output (#11634) @davidwendt
Upgrade pandas to 1.5 (#11617) @galipremsagar
Change default value of ordered to False in CategoricalDtype (#11604) @galipremsagar
Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
Adding optional parquet reader schema (#11524) @hyperbolic2346
Deprecate skiprows and num_rows in read_orc (#11522) @galipremsagar
Remove support for skip_rows / num_rows options in the parquet reader. (#11503) @nvdbaranec
Drop support for skiprows and num_rows in cudf.read_parquet (#11480) @galipremsagar
Disable Arrow S3 support by default. (#11470) @bdice
Convert thrust::optional usages to std::optional (#11455) @robertmaynard
Remove unused is_struct trait. (#11450) @bdice
Refactor the Buffer class (#11447) @madsbk
Return empty dataframe when reading an ORC file using empty columns option (#11446) @vuule
Refactor pad_side and strip_type enums into side_type enum (#11438) @davidwendt
Remove HASH_SERIAL_MURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
Use the new JSON parser when the experimental reader is selected (#11364) @vuule
Remove deprecated Series.applymap. (#11031) @bdice
Remove deprecated expand parameter from str.findall. (#11030) @bdice

🐛 Bug Fixes

Fixes bug in temporary decompression space estimation before calling nvcomp (#11879) @abellina
Handle ptx file paths during strings_udf import (#11862) @galipremsagar
Disable Zstandard decompression on nvCOMP 2.4 and Pascal GPus (#11856) @vuule
Reset strings_udf CEC and solve several related issues (#11846) @brandon-b-miller
Fix bug in new shuffle-based groupby implementation (#11836) @rjzamora
Fix is_valid checks in Scalar._binaryop (#11818) @wence-
Fix operator NotImplemented issue with numpy (#11816) @galipremsagar
Disable nvCOMP DEFLATE integration (#11811) @vuule
Build strings_udf package with other python packages in nightlies (#11808) @brandon-b-miller
Revert problematic shuffle=explicit-comms changes (#11803) @rjzamora
Fix regex out-of-bounds write in strided rows logic (#11797) @davidwendt
Build cudf locally before building strings_udf conda packages in CI (#11785) @brandon-b-miller
Fix an issue in cudf::row_bit_count involving structs and lists at multiple levels. (#11779) @nvdbaranec
Fix return type of Index.isna & Index.notna (#11769) @galipremsagar
Fix issue with set-item incase of list and struct types (#11760) @galipremsagar
Ensure all libcudf APIs run on cudf's default stream (#11759) @vyasr
Resolve dask_cudf failures caused by upstream groupby changes (#11755) @rjzamora
Fix ORC string sum statistics (#11740) @vuule
Add strings_udf package for python 3.9 (#11730) @brandon-b-miller
Ensure that all tests launch kernels on cudf's default stream (#11726) @vyasr
Don't assume stream is a compile-time constant expression (#11725) @vyasr
Fix get_thrust.cmake format at patch command (#11715) @davidwendt
Fix cudf::partition* APIs that do not return offsets for empty output table (#11709) @ttnghia
Fix cudf::lists::sort_lists for NaN and Infinity values (#11703) @davidwendt
Modify ORC reader timestamp parsing to match the apache reader behavior (#11699) @vuule
Fix DataFrame.from_arrow to preserve type metadata (#11698) @galipremsagar
Fix compile error due to missing header (#11697) @ttnghia
Default to Snappy compression in to_orc when using cuDF or Dask (#11690) @vuule
Fix an issue related to Multindex when group_keys=True (#11689) @galipremsagar
Transfer correct dtype to exploded column (#11687) @wence-
Ignore protobuf generated files in mypy checks (#11685) @galipremsagar
Maintain the index name after .loc (#11677) @shwina
Fix issue with extracting nested column data & dtype preservation (#11671) @galipremsagar
Ensure that all cudf tests and benchmarks are conda env aware (#11666) @robertmaynard
Update to Thrust 1.17.2 to fix cub ODR issues (#11665) @robertmaynard
Fix multi-file remote datasource bug (#11655) @rjzamora
Fix invalid regex quantifier check to not include alternation (#11654) @davidwendt
Fix bug in device_write(): it uses an incorrect size (#11651) @madsbk
fixes overflows in benchmarks (#11649) @elstehle
Fix regex negated classes to not automatically include new-lines (#11644) @davidwendt
Fix compile error in benchmark nested_json.cpp (#11637) @davidwendt
Update zfill to match Python output (#11634) @davidwendt
Removed converted type for INT32 and INT64 since they do not convert (#11627) @hyperbolic2346
Fix host scalars construction of nested types (#11612) @galipremsagar
Fix compile warning in nested_json_gpu.cu (#11607) @davidwendt
Change default value of ordered to False in CategoricalDtype (#11604) @galipremsagar
Preserve order if necessary when deduping categoricals internally (#11597) @brandon-b-miller
Add is_timestamp test for leap second (60) (#11594) @davidwendt
Fix an issue with to_arrow when column name type is not a string (#11590) @galipremsagar
Fix exception in segmented-reduce benchmark (#11588) @davidwendt
Fix encode/decode of negative timestamps in ORC reader/writer (#11586) @vuule
Correct distribution data type in quantiles benchmark (#11584) @vuule
Fix multibyte_split benchmark for host buffers (#11583) @upsj
xfail custreamz display test for now (#11567) @shwina
Fix JNI for TableWithMeta to use schema_info instead of column_names (#11566) @jlowe
Reduce code duplication for dask & distributed nightly/stable installs (#11565) @galipremsagar
Fix groupby failures in dask_cudf CI (#11561) @rjzamora
Fix for pivot: error when 'values' is a multicharacter string (#11538) @shaswat-indian
find_package(cudf) + arrow9 usable with cudf build directory (#11535) @robertmaynard
Fixing crash when writing binary nested data in parquet (#11526) @hyperbolic2346
Fix for: error when assigning a value to an empty series (#11523) @shaswat-indian
Fix invalid results from conditional-left-anti-join in debug build (#11517) @davidwendt
Fix cmake error after upgrading to Arrow 9 (#11513) @ttnghia
Fix reverse binary operators acting on a host value and cudf.Scalar (#11512) @bdice
Update parquet fuzz tests to drop support for skiprows & num_rows (#11505) @galipremsagar
Use rapids-cmake 22.10 best practice for RAPIDS.cmake location (#11493) @robertmaynard
Handle some zero-sized corner cases in dlpack interop (#11449) @wence-
Return empty dataframe when reading an ORC file using empty columns option (#11446) @vuule
libcudf c++ example updated to CPM version 0.35.3 (#11417) @robertmaynard
Fix regex quantifier check to include capture groups (#11373) @davidwendt
Fix read_text when byte_range is aligned with field (#11371) @upsj
Fix to_timestamps truncated subsecond calculation (#11367) @davidwendt
column: calculate null_count before release()ing the cudf::column (#11365) @wence-

📖 Documentation

Update guide-to-udfs notebook (#11861) @brandon-b-miller
Update docstring for cudf.read_text (#11799) @GregoryKimball
Add doc section for list & struct handling (#11770) @galipremsagar
Document that minimum required CMake version is now 3.23.1 (#11751) @robertmaynard
Update libcudf documentation build command in DOCUMENTATION.md (#11735) @davidwendt
Add docs for use of string data to DataFrame.apply and Series.apply and update guide to UDFs notebook (#11733) @brandon-b-miller
Enable more Pydocstyle rules (#11582) @bdice
Remove unused cpp/img folder (#11554) @davidwendt
Publish C++ developer docs (#11475) @vyasr
Fix a misalignment in cudf.get_dummies docstring (#11443) @galipremsagar
Update contributing doc to include links to the developer guides (#11390) @davidwendt
Fix table_view_base doxygen format (#11340) @davidwendt
Create main developer guide for Python (#11235) @vyasr
Add developer documentation for benchmarking (#11122) @vyasr
cuDF error handling document (#7917) @isVoid

🚀 New Features

Add hasNull statistic reading ability to ORC (#11747) @devavret
Add istitle to string UDFs (#11738) @brandon-b-miller
JSON Column creation in GPU (#11714) @karthikeyann
Adds option to take explicit nested schema for nested JSON reader (#11682) @elstehle
Add BGZIP data_chunk_reader (#11652) @upsj
Support DECIMAL order-by for RANGE window functions (#11645) @mythrocks
changing version of cmake to 3.23.3 (#11619) @hyperbolic2346
Generate unique keys table in java JNI contiguousSplitGroups (#11614) @res-life
Generic type casting to support the new nested JSON reader (#11613) @elstehle
JSON tree traversal (#11610) @karthikeyann
Add casting operators to masked UDFs (#11578) @brandon-b-miller
Adds type inference and type conversion for leaf-columns to the nested JSON parser (#11574) @elstehle
Add strings 'like' function (#11558) @davidwendt
Handle hyphen as literal for regex cclass when incomplete range (#11557) @davidwendt
Enable ZSTD compression in ORC and Parquet writers (#11551) @vuule
Adds support for json lines format to the nested JSON reader (#11534) @elstehle
Adding optional parquet reader schema (#11524) @hyperbolic2346
Adds GPU implementation of JSON-token-stream to JSON-tree (#11518) @karthikeyann
Add gdb pretty-printers for simple types (#11499) @upsj
Add create_random_column function to the data generator (#11490) @vuule
Add fluent API builder to data_profile (#11479) @vuule
Adds Nested Json benchmark (#11466) @karthikeyann
Convert thrust::optional usages to std::optional (#11455) @robertmaynard
Python API for the future experimental JSON reader (#11426) @vuule
Return schema info from JSON reader (#11419) @vuule
Add regex ASCII flag support for matching builtin character classes (#11404) @davidwendt
Truncate parquet column indexes (#11403) @etseidl
Adds the end-to-end JSON parser implementation (#11388) @elstehle
Use the new JSON parser when the experimental reader is selected (#11364) @vuule
Add placeholder for the experimental JSON reader (#11334) @vuule
Add read-only functions on string dtypes to DataFrame.apply and Series.apply (#11319) @brandon-b-miller
Added 'crosstab' and 'pivot_table' features (#11314) @shaswat-indian
Quickly error out when trying to build with unsupported nvcc versions (#11297) @robertmaynard
Adds JSON tokenizer (#11264) @elstehle
List lexicographic comparator (#11129) @devavret
Add generic type inference for cuIO (#11121) @PointKernel
Fully support nested types in cudf::contains (#10656) @ttnghia
Support nested types in lists::contains (#10548) @ttnghia

🛠️ Improvements

Pin dask and distributed for release (#11822) @galipremsagar
Add examples for Nested JSON reader (#11814) @GregoryKimball
Support shuffle-based groupby aggregations in dask_cudf (#11800) @rjzamora
Update strings udf version updater script (#11772) @galipremsagar
Remove kwargs in read_csv & to_csv (#11762) @galipremsagar
Pass dtype param to avoid pd.Series warnings (#11761) @galipremsagar
Enable schema_element & keep_quotes support in json reader (#11746) @galipremsagar
Add ability to construct ListColumn when size is None (#11745) @galipremsagar
Reduces memory requirements in JSON parser and adds bytes/s and peak memory usage to benchmarks (#11732) @elstehle
Add missing copyright headers. (#11712) @bdice
Fix copyright check issues in pre-commit (#11711) @bdice
Include decimal in supported types for range window order-by columns (#11710) @mythrocks
Disable very large column gtest for contiguous-split (#11706) @davidwendt
Drop split_out=None test from groupby.agg (#11704) @wence-
Use CubinLinker for CUDA Minor Version Compatibility (#11701) @gmarkall
Add regex capture-group parameter to auto convert to non-capture groups (#11695) @davidwendt
Add a __dataframe__ method to the protocol dataframe object (#11692) @rgommers
Special-case multibyte_split for single-byte delimiter (#11681) @upsj
Remove isort exclusions (#11680) @bdice
Refactor CSV reader benchmarks with nvbench (#11678) @PointKernel
Check conda recipe headers with pre-commit (#11669) @bdice
Remove redundant style check for clang-format. (#11668) @bdice
Add support for group_keys in groupby (#11659) @galipremsagar
Fix pandoc pinning. (#11658) @bdice
Revert removal of skip_rows / num_rows options from the Parquet reader. (#11657) @nvdbaranec
Update git metadata (#11647) @bdice
Call set_null_count on a returning column if null-count is known (#11646) @davidwendt
Fix some libcudf detail calls not passing the stream variable (#11642) @davidwendt
Update to mypy 0.971 (#11640) @wence-
Refactor strings strip functor to details header (#11635) @davidwendt
Fix incorrect nullCount in get_json_object (#11633) @trxcllnt
Simplify hostdevice_vector (#11631) @upsj
Refactor parquet writer benchmarks with nvbench (#11623) @PointKernel
Rework contains_scalar to check nulls at runtime (#11622) @davidwendt
Fix incorrect memory resource used in rolling temp columns (#11618) @mythrocks
Upgrade pandas to 1.5 (#11617) @galipremsagar
Move type-dispatcher calls from traits.hpp to traits.cpp (#11616) @davidwendt
Refactor parquet reader benchmarks with nvbench (#11611) @PointKernel
Forward-merge branch-22.08 to branch-22.10 (#11608) @bdice
Use stream in Java API. (#11601) @bdice
Refactors of public/detail APIs, CUDF_FUNC_RANGE, stream handling. (#11600) @bdice
Improve ORC writer benchmark with nvbench (#11598) @PointKernel
Tune multibyte_split kernel (#11587) @upsj
Move split_utils.cuh to strings/detail (#11585) @davidwendt
Fix warnings due to compiler regression with if constexpr (#11581) @ttnghia
Add full 24-bit dictionary support to Parquet writer (#11580) @etseidl
Expose "explicit-comms" option in shuffle-based dask_cudf functions (#11576) @rjzamora
Move cudf::strings::findall_record to cudf::strings::findall (#11575) @davidwendt
Refactor dask_cudf groupby to use apply_concat_apply (#11571) @rjzamora
Add ability to write list(struct) columns as map type in orc writer (#11568) @galipremsagar
Add byte_range to multibyte_split benchmark + NVBench refactor (#11562) @upsj
JNI support for writing binary columns in parquet (#11556) @revans2
Support additional dictionary bit widths in Parquet writer (#11547) @etseidl
Refactor string/numeric conversion utilities (#11545) @davidwendt
Removing unnecessary asserts in parquet tests (#11544) @hyperbolic2346
Clean up ORC reader benchmarks with NVBench (#11543) @PointKernel
Reuse MurmurHash3_32 in Parquet page data. (#11528) @bdice
Add hexadecimal value separators (#11527) @bdice
Deprecate skiprows and num_rows in read_orc (#11522) @galipremsagar
Struct support for NULL_EQUALS binary operation (#11520) @rwlee
Bump hadoop-common from 3.2.3 to 3.2.4 in /java (#11516) @dependabot[bot]
Fix Feather test warning. (#11511) @bdice
copy_range ballot_syncs to have no execution dependency (#11508) @robertmaynard
Upgrade to arrow-9.x (#11507) @galipremsagar
Remove support for skip_rows / num_rows options in the parquet reader. (#11503) @nvdbaranec
Single-pass multibyte_split (#11500) @upsj
Sanitize percentile_approx() output for empty input (#11498) @SrikarVanavasam
Unpin dask and distributed for development (#11492) @galipremsagar
Move SparkMurmurHash3_32 functor. (#11489) @bdice
Refactor group_nunique.cu to use nullate::DYNAMIC for reduce-by-key functor (#11482) @davidwendt
Drop support for skiprows and num_rows in cudf.read_parquet (#11480) @galipremsagar
Add reduction distinct_count benchmark (#11473) @ttnghia
Add groupby nunique aggregation benchmark (#11472) @ttnghia
Disable Arrow S3 support by default. (#11470) @bdice
Add groupby max aggregation benchmark (#11464) @ttnghia
Extract Dremel encoding code from Parquet (#11461) @vyasr
Add missing Thrust #includes. (#11457) @bdice
Make CMake hooks verbose (#11456) @vyasr
Control Parquet page size through Python API (#11454) @etseidl
Add control of Parquet column index creation to python (#11453) @etseidl
Remove unused is_struct trait. (#11450) @bdice
Refactor the Buffer class (#11447) @madsbk
Refactor pad_side and strip_type enums into side_type enum (#11438) @davidwendt
Update to Thrust 1.17.0 (#11437) @bdice
Add in JNI for parsing JSON data and getting the metadata back too. (#11431) @revans2
Convert byte_array_view to use std::byte (#11424) @hyperbolic2346
Deprecate unflatten_nested_columns (#11421) @SrikarVanavasam
Remove HASH_SERIAL_MURMUR3 / serial32BitMurmurHash3 (#11383) @bdice
Add Spark list hashing Java tests (#11379) @bdice
Move cmake to the build section. (#11376) @vyasr
Remove use of CUDA driver API calls from libcudf (#11370) @shwina
Add column constructor from device_uvector&& (#11356) @SrikarVanavasam
Remove unused custreamz thirdparty directory (#11343) @vyasr
Update jni version to 22.10.0-SNAPSHOT (#11338) @pxLi
Enable using upstream jitify2 (#11287) @shwina
Cache cudf.Scalar (#11246) @shwina
Remove deprecated Series.applymap. (#11031) @bdice
Remove deprecated expand parameter from str.findall. (#11030) @bdice

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v22.10.00

🚨 Breaking Changes

🐛 Bug Fixes

📖 Documentation

🚀 New Features

🛠️ Improvements

Contributors