-
Notifications
You must be signed in to change notification settings - Fork 921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RELEASE] cudf v24.10 #16943
Merged
Merged
[RELEASE] cudf v24.10 #16943
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…6454) `cudf.Series` is a public constructor that happens to accept a private `ColumnBase` object. Many ops return Columns and is natural to want to reconstruct a `Series`. This PR adds a `SingleColumnFrame._from_column` classmethod for instances where we need to wrap a new column in an `Index` or `Series`. This constructor also passes some unneeded validation in `ColumnAccessor` and `Series` Authors: - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #16454
Forward-merge branch-24.08 into branch-24.10
Add `stream` param to a bunch of stream compaction APIs. Authors: - Jayjeet Chakraborty (https://github.com/JayjeetAtGithub) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Nghia Truong (https://github.com/ttnghia) - Mark Harris (https://github.com/harrism) - Karthikeyan (https://github.com/karthikeyann) - Mike Wilson (https://github.com/hyperbolic2346) URL: #16295
…rsion (#16503) Contributes to rapidsai/build-planning#58. `scikit-build-core==0.10.0` was released today (https://github.com/scikit-build/scikit-build-core/releases/tag/v0.10.0), and wheel-building configurations across RAPIDS are incompatible with it. This proposes upgrading to that version and fixing configuration here in a way that: * is compatible with that new `scikit-build-core` version * takes advantage of the forward-compatibility mechanism (`minimum-version`) that `scikit-build-core` provides, to reduce the risk of needing to do this again in the future Authors: - James Lamb (https://github.com/jameslamb) Approvers: - https://github.com/jakirkham URL: #16503
Exposes the `stream` param in transform APIs Authors: - Jayjeet Chakraborty (https://github.com/JayjeetAtGithub) Approvers: - Bradley Dice (https://github.com/bdice) - Karthikeyan (https://github.com/karthikeyann) URL: #16452
…16498) Demonstrates the conversion from an `arrow:StringViewArray` to a `cudf::column` Authors: - Jayjeet Chakraborty (https://github.com/JayjeetAtGithub) Approvers: - Nghia Truong (https://github.com/ttnghia) URL: #16498
Changes the integer type for `cudf::strings::ipv4_to_integers` and `cudf::strings::integers_to_ipv4` to use UINT32 types instead of INT64. The INT64 type was originally chosen because libcudf did not support unsigned types at the time. This is a breaking change since the basic input/output type is changed. Closes #16324 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Matthew Roeschke (https://github.com/mroeschke) - https://github.com/brandon-b-miller - Karthikeyan (https://github.com/karthikeyann) URL: #16489
A few small tweaks to `update-version.sh` for alignment across RAPIDS. The `UCX_PY` curl call is unused. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - James Lamb (https://github.com/jameslamb) URL: #16506
This PR updates pre-commit hooks to the latest versions that are supported without causing style check errors. Authors: - Kyle Edwards (https://github.com/KyleFromNVIDIA) Approvers: - James Lamb (https://github.com/jameslamb) URL: #16510
This PR adopts some work from @srinivasyadav18 with additional modifications. This is meant to complement #16484. Authors: - Bradley Dice (https://github.com/bdice) - Srinivas Yadav (https://github.com/srinivasyadav18) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Srinivas Yadav (https://github.com/srinivasyadav18) URL: #16497
closes #15278 This PR allows list type also forced as string when mixed type as string is enabled and a user given schema specifies a column as string, in JSON reader. Authors: - Karthikeyan (https://github.com/karthikeyann) - Nghia Truong (https://github.com/ttnghia) Approvers: - Nghia Truong (https://github.com/ttnghia) - Shruti Shivakumar (https://github.com/shrshi) URL: #16472
Removes overloaded `cudf::io::text::multibyte_split` API deprecated in 24.08 and is no longer needed. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) URL: #16501
Authors: - Jayjeet Chakraborty (https://github.com/JayjeetAtGithub) Approvers: - Karthikeyan (https://github.com/karthikeyann) URL: #16423
This change updates json normalization calls (quote and whitespace normalization) to take owning buffer of device_buffer as input rather than device_uvector. It makes it easy to hand over a string_column's char buffer to normalization calls. Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - David Wendt (https://github.com/davidwendt) - Shruti Shivakumar (https://github.com/shrshi) URL: #16520
closes #14794 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Matthew Murray (https://github.com/Matt711) URL: #16519
#16516) xref #16507 `date_range` generates its dates via `range`, and the end of this range was calculated via `math.ceil((end - start) / freq)`. If `(end - start) / freq` did not produce a remainder, `math.ceil` would not correctly increment this value by `1` to capture the last date. Instead, this PR uses `math.floor((end - start) / freq) + 1` to always ensure the last date is captured Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Bradley Dice (https://github.com/bdice) URL: #16516
xref #16507 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Matthew Murray (https://github.com/Matt711) URL: #16515
xref #16507 I would say this was a bug before because we would silently return a new DataFrame with just `len(set(column_labels))` when selecting by column. Now this operation raises since duplicate column labels are generally not supported. Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - https://github.com/brandon-b-miller URL: #16514
Removing some more deprecated public libcudf APIs. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Karthikeyan (https://github.com/karthikeyann) URL: #16524
The JSON reader set the batch size to `INT_MAX` bytes since the motivation for implementing a batched JSON reader was to parse source files whose total size is larger than `INT_MAX` (#16138, #16162). However, we can use a much smaller batch size to evaluate the correctness of the reader and speed up tests significantly. This PR focuses on reducing runtime of the batched reader test by setting the batch size to be used by the reader as an environment variable. The runtime of `JsonLargeReaderTest.MultiBatch` in `LARGE_STRINGS_TEST` gtest drops from ~52s to ~3s. Authors: - Shruti Shivakumar (https://github.com/shrshi) Approvers: - Nghia Truong (https://github.com/ttnghia) - David Wendt (https://github.com/davidwendt) - Bradley Dice (https://github.com/bdice) URL: #16502
…rings (#16536) Recently some JSON parsing was updated so lists could be returned as strings. This updates the java code so that when cleaning up the results to match the desired schema that it can handle corner cases associated with lists and structs properly. Tests are covered in the Spark plugin, but I am happy to add some here if we really want to validate that part of this. Authors: - Robert (Bobby) Evans (https://github.com/revans2) Approvers: - Nghia Truong (https://github.com/ttnghia) URL: #16536
Adds `const` declarations to appropriate member functions in class `cudf::io::text::byte_range_info` and moves the ctor implementation to .cpp file. This helps with using the `byte_range_info` objects in `const` variables and inside of `const` functions. Found while working on #15983 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - Bradley Dice (https://github.com/bdice) URL: #16518
Fixes specialized behavior for all empty input column on the strings split APIs. Verifying behavior with Pandas `str.split( pat, expand, regex )` `pat=None -- whitespace` `expand=False -- record APIs` `regex=True -- re APIs` - [x] `split` - [x] `split` - whitespace - [x] `rsplit` - [x] `rsplit` - whitespace - [x] `split_record` - [x] `split_record` - whitespace - [x] `rsplit_record` - [x] `rsplit_record` - whitespace - [x] `split_re` - [x] `rsplit_re` - [x] `split_record_re` - [x] `rsplit_record_re` Closes #16453 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Mark Harris (https://github.com/harrism) - Bradley Dice (https://github.com/bdice) - Mike Wilson (https://github.com/hyperbolic2346) URL: #16466
Removes the pair-iterator benchmark logic. The remaining benchmarks use the null-replacement-iterator which uses the libcudf pair-iterator internally. There is no need for benchmarking this unique iterator pattern that is not used by libcudf. The `cpp/benchmarks/iterator/iterator.cu` failed to compile with gcc 12 because the sum-reduce function cannot resolve adding `thrust::pair` objects together likely due to some recent changes in CCCL. Regardless, adding `thrust::pair` objects is not something we need to benchmark. The existing benchmark benchmarks libcudf's usage of the internal pair-iterator correctly. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Bradley Dice (https://github.com/bdice) URL: #16511
This PR removes hardcoded Python versions from CI workflows. It is a prerequisite for dropping Python 3.9. See rapidsai/build-planning#88. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - James Lamb (https://github.com/jameslamb) URL: #16540
After dask/dask-expr#1114, Dask cuDF must register specific `read_parquet` and `read_csv` functions to be used when query-planning is enabled (the default). **This PR is required for CI to pass with dask>2024.8.0** **NOTE**: It probably doesn't make sense to add specific tests for this change. Once the 2014.7.1 dask pin is removed, all `dask_cudf` tests using `read_parquet` and `read_csv` will fail without this change... Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) - Benjamin Zaitlen (https://github.com/quasiben) URL: #16535
) When Python integers are compared to a series of integers, the result can always be correctly defined no matter the values of the Python integer. This was always a very mild issue. But with NumPy 2 behavior not upcasting the computation result type based on the value anymore, even things like: ``` cudf.Series([1, 2, 3], dtype="int8") < 1000 ``` would fail. (Similar paths could be taken for other integer scalars, but there would be mostly nice for performance.) N.B. NumPy/pandas also support exact comparisons when mixing e.g. uint64 and int64. This is another rare exception that cudf currently does not support. Closes gh-16282 Authors: - Sebastian Berg (https://github.com/seberg) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #16532
…mns (#16529) Fixes `cudf::empty_like` to only create empty child columns for nested types. The empty child columns are needed to store the types for consistency with `cudf::make_empty_column`. Closes #16490 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - Mark Harris (https://github.com/harrism) URL: #16529
…lity (#16531) Removes `output_size` parameter from `cudf::strings::detail::count_matches` utility since the output size should equal the input size from the first parameter. This also removes an unnecessary `assert()` call. The parameter became unnecessary as part of the large strings work. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Shruti Shivakumar (https://github.com/shrshi) URL: #16531
…16559) python 3.9 support was recently dropped in rapids, hence changing the python version to 3.10 Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) URL: #16559
Contributes to #15162 Authors: - Matthew Roeschke (https://github.com/mroeschke) - Vyas Ramasubramani (https://github.com/vyasr) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #16771
Contributes to #15162 Authors: - Matthew Roeschke (https://github.com/mroeschke) - Matthew Murray (https://github.com/Matt711) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Matthew Murray (https://github.com/Matt711) - Vyas Ramasubramani (https://github.com/vyasr) URL: #16781
More follow-up fixes to the recent Dask-cuDF documentation additions. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) URL: #16929
raydouglass
requested review from
KyleFromNVIDIA,
wence-,
Matt711 and
mythrocks
September 27, 2024 14:36
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
github-actions
bot
added
libcudf
Affects libcudf (C++/CUDA) code.
Python
Affects Python cuDF API.
CMake
CMake build issue
Java
Affects Java cuDF API.
cudf.pandas
Issues specific to cudf.pandas
cudf.polars
Issues specific to cudf.polars
pylibcudf
Issues specific to the pylibcudf package
labels
Sep 27, 2024
ttnghia
approved these changes
Sep 27, 2024
mythrocks
approved these changes
Oct 1, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Add the license file symlink to the `pylibcudf` wheels
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
CMake
CMake build issue
cudf.pandas
Issues specific to cudf.pandas
cudf.polars
Issues specific to cudf.polars
Java
Affects Java cuDF API.
libcudf
Affects libcudf (C++/CUDA) code.
pylibcudf
Issues specific to the pylibcudf package
Python
Affects Python cuDF API.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
❄️ Code freeze for
branch-24.10
and v24.10 releaseWhat does this mean?
Only critical/hotfix level issues should be merged into
branch-24.10
until release (merging of this PR).What is the purpose of this PR?
branch-24.10
intomain
for the release