Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Reduce arrow library dependencies in cudf #15193

Closed
GregoryKimball opened this issue Feb 29, 2024 · 6 comments · Fixed by #16640
Closed

[FEA] Reduce arrow library dependencies in cudf #15193

GregoryKimball opened this issue Feb 29, 2024 · 6 comments · Fixed by #16640
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Feb 29, 2024

Is your feature request related to a problem? Please describe.
Arrow is in general a difficult dependency to work with, increasing build system complexity and fragility on its own while simultaneously expanding the full dependency tree, which particularly complicates use cases like conda where it leads to meaningful constraints on core system packages like protobuf, abseil, or the AWS SDK. This often hinders developer velocity when builds or CI are broken, but can also have far-reaching impacts when it creates problems with installation or running in specific environments. To prevent this, we would like to reduce or remove our dependence on Arrow libraries entirely.

Currently cudf makes use of Arrow in various ways at different levels of the stack. The primary uses of Arrow boil down to interop with host Arrow data and I/O with specific types of files. This involves interaction at both the Python layer via pyarrow, at the Cython layer (also via pyarrow), and in C++. Both Cython and C++ interactions are particularly problematic because they involve C-level interactions, which sets ABI-level constraints that are significantly tighter than we would like while also significantly complicating build (CMake, Python builds) and packaging (narrow Arrow version support ranges leading to limited support of other packages in the dependency tree). Python interactions are generally less difficult to work around, especially since Python code can be written to dynamically adapt to the pyarrow version.

Describe the solution you'd like

We should look to remove the Arrow dependencies from the various layers of cudf (Java, Python, Cython, C++) to the greatest extent possible, ideally entirely.

For Arrow Array interop code, this can be accomplished by using the Arrow C Data Interface (see #5097), which provides an ABI-stable way to interchange Arrow data without directly using Arrow libraries. To make this even easier, the nanoarrow library was created to support clients that wish to produce or interpret Arrow C Data and Arrow C Streams structures, without having to include a dependency on libarrow. We can make use of that (see also #13678 which discusses this in depth). For Python interaction we can use Arrow's pycapsule interface, which provides a standard way to interchange this data from Python. We can write Cython code leveraging this interface to get Arrow C Data from pyarrow objects without relying directly on pyarrow's Cython, therefore also allowing us to remove this dependency from the Cython layer.

For I/O, the question is a bit trickier. We currently have limited usage of libarrow headers in our C++, and those features largely exist only for Python support for reading Arrow's NativeFiles. We could in principle remove those from the C++ entirely, which would in turn allow us to remove libarrow as a dependency of libcudf. However, libcudf tests would still need libarrow (removing that dependency would require significant additional work). Moreover, those features would still be used by cudf Cython, so we would just be limiting the dependency. However, this could at least allow us to remove Arrow as a build-time dependency for both libcudf and the low-level pylibcudf Python API (#13921) that we are currently developing, which would still be a significant improvement since it would avoid imposing the Arrow dependency on low-level consumers of our APIs at the Python level. Then we could come back to working on replacing the cudf Cython usage.

Based on the above, the current plan is the following:

  1. Remove libarrow as a dependency of libcudf/pylibcudf:
    a. Remove the compiled parts of arrow_io_source.cpp and make arrow_io_source.hpp a standalone header not compiled by anything in libcudf.
    b. Rewrite cudf Cython to use the arrow headers directly.
    c. Add new interop code that uses the Arrow C Data interface (see Add to_arrow_device function to cudf interop using nanoarrow #15047)
    d. Rewrite Python interop code to call through to the new interfaces
    e. Remove the old Cython bindings for interop
  2. Remove pyarrow Cython linkages from cudf Cython
    a. This will require some exploration as to how we can maintain performant file reading. We may have to implement our own minimal version of something like Arrow's NativeFile reader interface.
    b. Once the above is done, we'll need to rewrite cuIO C++ to consume this interface and remove the current functions.
  3. Rewrite libcudf tests to remove libarrow dependence.
    a. This will require further investigation into how tests could be rewritten without Arrow. One possibility would be rewriting these tests as pylibcudf tests (see [FEA] Add tests of pylibcudf #15133) that use pyarrow instead (only the Python API, no Cython). That would give us access to the same functionality without tying us to linking to the libarrow library

Additional context

Code pointers where libarrow is used in 24.04

Source file Arrow include Notes
detail/interop.hpp api.h to_arrow_array uses many array classes: arrow::*Array, arrow::TimeUnit::*, arrow::*Type also arrow::MemoryPool, arrow::Scalar, arrow::Table. I believe all of these are covered by nanoarrow
include/cudf/interop.hpp api.h uses arrow::Table, arrow::MemoryPool, arrow::default_memory_pool, arrow::Scalar. I believe all of these are covered by nanoarrow
include/cudf/io/arrow_io_source.hpp filesystem/filesystem.h
io/interfaces.h
uses arrow::io:RandomAccessFile, arrow::fs::FileSystem. See #13698 for the work to refactor arrow_io_source out of datasource
include/cudf/io/arrow_io_source.cpp buffer.h
filesystem/filesystem.h
result.h
uses arrow::Buffer, arrow::fs::FileSystemFromUri,
src/io/utilities/datasource.cpp io/memory.h to be solved by #15189
Test file Arrow include Notes
tests/interop/arrow_utils.hpp util/bitmap_builders.h for arrow::internal::BytesToBits Also uses many arrow types such as: arrow::Array, arrow:DictionaryArray, arrow::dictionary, arrow::Table, arrow::Decimal128Builder, arrow::decimal, arrow::default_memory_pool, arrow::ListArray, arrow::list , arrow::Buffer, arrow::StringBuilder, arrow::StringArray , arrow::BooleanArray, arrow::BooleanBuilder
needs research - can all of these references be migrated to nanoarrow?
tests/io/arrow_io_source_test.cpp io/api.h
filesystem/filesystem.h
filesystem/s3fs.h
util/config.h
uses arrow::fs::FileSystemFromUri, arrow::fs::EnsureS3Finalized
tests/io/json_test.cpp io/api.h Uses arrow::io::ReadableFile as part of a test for reading from an ArrowFileSource
tests/io/csv_test.cpp io/api.h uses arrow::io::ReadableFile
tests/quantiles/percentile_approx_test.cpp util/tdigest.h uses arrow::internal::TDigest. presumably we could replace this with our own limited implementation
@GregoryKimball GregoryKimball added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Feb 29, 2024
@vyasr
Copy link
Contributor

vyasr commented Mar 2, 2024

One important correction to the table above, nanoarrow does not support scalars and neither does the Arrow C data interface, which focuses only on arrays. Previously this was a significant concern due to concerns with Python interoperability, but it looks like pyarrow 13 patched in some critical support for scalar->array conversion that ought to make this concern moot for us now. See #15213

@vyasr vyasr added the Python Affects Python cuDF API. label Mar 8, 2024
@vyasr vyasr changed the title [FEA] Replace libarrow dependency with nanoarrow in libcudf [FEA] Reduce arrow library dependencies in cudf Mar 8, 2024
@vyasr
Copy link
Contributor

vyasr commented Mar 8, 2024

I've substantially updated this issue with a more holistic discussion of how we can stage the removal of Arrow at different layers of our library (C++, Cython, Python).

@zeroshade
Copy link
Contributor

zeroshade commented Mar 12, 2024

Just wanted to point out that a big step towards this is the work started in #15047 and discussed in #14926. Once the to_arrow_device and from_arrow_device functionality is crafted, it should be pretty simple to re-implement several of the interoperability areas in terms of the Arrow C Data and Device interfaces, making strides towards eliminating the need for libarrow as a dependency in favor of nanoarrow

EDIT: Just re-read and saw this was already mentioned in the OP.... oops 😄

@vyasr
Copy link
Contributor

vyasr commented Mar 13, 2024

Yup! Your PR was top of mind while writing up this issue!

rapids-bot bot pushed a commit that referenced this issue Mar 18, 2024
Resolves #15310. Contributes to #15193

In addition, this PR adds pylibcudf.Column<-->pyarrow.Array interconversion as a benefit

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)

URL: #15325
rapids-bot bot pushed a commit that referenced this issue Jul 20, 2024
Contributes to #15193

Authors:
  - Thomas Li (https://github.com/lithomas1)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Richard (Rick) Zamora (https://github.com/rjzamora)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #16132
@jakirkham
Copy link
Member

jakirkham commented Jul 20, 2024

Assuming this lets us drop NumPy from cuDF's build dependencies, would make sure to drop these lines (with comments) when that happens

cudf/dependencies.yaml

Lines 395 to 399 in 508bdea

- output_types: pyproject
packages:
# Hard pin the version used during the build.
# Sync with conda build constraint & wheel run constraint.
- numpy==2.0.*

Edit - Reflects current state since NumPy 2 changes in PR: #16300

lithomas1 added a commit to lithomas1/cudf that referenced this issue Jul 31, 2024
commit 606d15e7260b553cbdb69f9ecd935c12ba94e430
Author: Thomas Li <[email protected]>
Date:   Wed Jul 31 14:30:48 2024 +0000

    put back mistakenly removed CMakeLists.txt

commit feac68de39be09c1751d0ccc2bb5f93b1075ac8f
Author: Thomas Li <[email protected]>
Date:   Wed Jul 31 13:59:50 2024 +0000

    rpath was the problem?

commit b2b68e14b9faa1dac0f2516667f65ecb5693a744
Author: Thomas Li <[email protected]>
Date:   Tue Jul 30 22:29:14 2024 +0000

    maybe fix?

commit 5243eac8a90114e4fdf794760cb6b6029d9ba1a1
Author: Thomas Li <[email protected]>
Date:   Tue Jul 30 21:11:03 2024 +0000

    fix cuda suffixing

commit acb31227d3ffb07e4a35be5d1c0ec6cbadbfe53d
Author: Thomas Li <[email protected]>
Date:   Tue Jul 30 20:29:52 2024 +0000

    fixes

commit b2306df549ac5db08dc0d1b09df270137dacfe9d
Author: Thomas Li <[email protected]>
Date:   Tue Jul 30 20:08:13 2024 +0000

    fixes

commit d6d91df1510a70d79fefacf8b57ca1caf027edf8
Merge: b7a2782f1a 7b3e73a7e3
Author: Thomas Li <[email protected]>
Date:   Tue Jul 30 19:32:18 2024 +0000

    Merge branch 'branch-24.10' of github.com:rapidsai/cudf into setup-pylibcudf-package

commit 7b3e73a7e38b671db1387879cfa963fe61060c36
Merge: ce259fff66 dbf4bd02a8
Author: gpuCI <[email protected]>
Date:   Tue Jul 30 13:14:19 2024 -0400

    Merge pull request #16435 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit dbf4bd02a8fdccd1891edbc2d049c3ddddb234b3
Author: GALI PREM SAGAR <[email protected]>
Date:   Tue Jul 30 12:14:14 2024 -0500

    Add about rmm modes in `cudf.pandas` docs (#16404)

    This PR adds user facing docs for rmm memory modes and prefetching.

    ---------

    Co-authored-by: Mark Harris <[email protected]>
    Co-authored-by: Bradley Dice <[email protected]>

commit ce259fff6641dd847883d535645c7c17c36fb7ec
Merge: b8bfe2c912 0f07b0bb5e
Author: gpuCI <[email protected]>
Date:   Tue Jul 30 09:02:26 2024 -0400

    Merge pull request #16433 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 0f07b0bb5e2cc89ca66e9d9639ff6ac961ec0471
Author: GALI PREM SAGAR <[email protected]>
Date:   Tue Jul 30 08:02:21 2024 -0500

    Enable prefetching before `runpy` (#16427)

    This PR enables prefetching before we execute the `runpy` module and
    script code.

commit b8bfe2c91234032cbe9b2549e46a08109e238c8a
Merge: d1be0b6dc0 5feeaf3827
Author: gpuCI <[email protected]>
Date:   Tue Jul 30 09:02:06 2024 -0400

    Merge pull request #16432 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 5feeaf3827bfd20755cdd0516ef0c6ba484a600c
Author: Richard (Rick) Zamora <[email protected]>
Date:   Tue Jul 30 08:02:01 2024 -0500

    [Bug] Remove loud `NativeFile` deprecation noise for `read_parquet` from S3 (#16415)

    Important follow-up to https://github.com/rapidsai/cudf/pull/16132

    Without this PR, using `dask_cudf.read_parquet("s3://...", ...)` will
    result in loud deprecation warnings after `compute`/`persist` is called.
    This is because dask will always pass `NativeFile` objects down to cudf.

    My fault for missing this earlier!

commit d1be0b6dc06fddd0b69fb69731281b16894cb132
Author: Matthew Roeschke <[email protected]>
Date:   Mon Jul 29 15:12:38 2024 -1000

    Align CategoricalIndex APIs with pandas 2.x (#16369)

    Mostly exposing methods that were available on the CategoricalColumn

    Authors:
      - Matthew Roeschke (https://github.com/mroeschke)
      - GALI PREM SAGAR (https://github.com/galipremsagar)

    Approvers:
      - GALI PREM SAGAR (https://github.com/galipremsagar)

    URL: https://github.com/rapidsai/cudf/pull/16369

commit 368a34ca9fd7db1b6cfb6e7817978e3e4fcfb00b
Author: Bradley Dice <[email protected]>
Date:   Mon Jul 29 20:05:17 2024 -0500

    Use RMM adaptor constructors instead of factories. (#16414)

    This PR uses RMM memory resource adaptor constructors instead of factory functions. With CTAD, we do not need the factory and can use the constructor directly. The factory will be deprecated in https://github.com/rapidsai/rmm/pull/1626.

    Authors:
      - Bradley Dice (https://github.com/bdice)

    Approvers:
      - Nghia Truong (https://github.com/ttnghia)
      - Jayjeet Chakraborty (https://github.com/JayjeetAtGithub)

    URL: https://github.com/rapidsai/cudf/pull/16414

commit e8048f7f3d66433203651a6a603d4de1360ca5ca
Merge: f8eb63e499 bd302d773c
Author: gpuCI <[email protected]>
Date:   Mon Jul 29 20:07:38 2024 -0400

    Merge pull request #16431 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit bd302d773c50552531bc7f11f782f8ed876e8fab
Author: Nghia Truong <[email protected]>
Date:   Mon Jul 29 17:07:33 2024 -0700

    Support thread-safe for `prefetch_config::get` and `prefetch_config::set` (#16425)

    This adds muti-thread support for `prefetch_config` getter and setter
    functions. This avoid the issue that the config map is corrupted in
    multi-thread environments.

    Closes https://github.com/rapidsai/cudf/issues/16426.

    ---------

    Signed-off-by: Nghia Truong <[email protected]>

commit f8eb63e499f94d583d715f5c1f5e6f234589be57
Author: Matthew Roeschke <[email protected]>
Date:   Mon Jul 29 12:39:19 2024 -1000

    Align Index APIs with pandas 2.x (#16361)

    Similar to https://github.com/rapidsai/cudf/pull/16310, the follow APIs have been modified to adjust/add parameters

    * `to_flat_index`
    * `isin`
    * `unique`
    * `transpose`

    Authors:
      - Matthew Roeschke (https://github.com/mroeschke)
      - GALI PREM SAGAR (https://github.com/galipremsagar)

    Approvers:
      - GALI PREM SAGAR (https://github.com/galipremsagar)

    URL: https://github.com/rapidsai/cudf/pull/16361

commit 743e16426c564d0ed0d7e3d9be5f67e4605c4f32
Author: James Lamb <[email protected]>
Date:   Mon Jul 29 14:19:43 2024 -0500

    update some branch references in GitHub Actions configs (#16397)

    Fixes some lingering references to `branch-24.08` in the `pr_issue_status_automation` CI workflow.

    This was missed when new branches were cut because that file ends in `.yml` and `update-version.sh` was only modifying files ending in `.yaml`. The corresponding `update-version.sh` changes were made in #16183 and are already on 24.10 thanks to forward mergers.

    https://github.com/rapidsai/cudf/blob/dc05a01f3fc0742c5fbbddd86a0f2007bfdc2050/ci/release/update-version.sh#L78

    ## Notes for Reviewers

    I checked like this, and don't see any other missed references:

    ```shell
    git grep -E '24\.8|24\.08|0\.39'
    ```

    Authors:
      - James Lamb (https://github.com/jameslamb)

    Approvers:
      - Kyle Edwards (https://github.com/KyleFromNVIDIA)

    URL: https://github.com/rapidsai/cudf/pull/16397

commit 35796057b64e258713d4d89ba368837d30a1a9c5
Author: Matthew Roeschke <[email protected]>
Date:   Mon Jul 29 08:33:23 2024 -1000

    Align misc DataFrame and MultiIndex methods with pandas 2.x (#16402)

    The API changes in this PR are mostly adding implementations or adding missing keyword argument (although they might not be implemented). The APIs affected are:

    * `DataFrame.insert`
    * `DataFrame.melt`
    * `DataFrame.merge`
    * `DataFrame.quantile`
    * `DataFrame.cov`
    * `DataFrame.corr`
    * `DataFrame.median`
    * `DataFrame.rolling`
    * `DataFrame.resample`
    * `DataFrame.dropna`
    * `MultiIndex.from_tuple`
    * `MultiIndex.from_frame`
    * `MultiIndex.from_product`

    Authors:
      - Matthew Roeschke (https://github.com/mroeschke)
      - GALI PREM SAGAR (https://github.com/galipremsagar)

    Approvers:
      - GALI PREM SAGAR (https://github.com/galipremsagar)

    URL: https://github.com/rapidsai/cudf/pull/16402

commit 6e7624d6b31c93b0547590929ac63ed8e3a48d24
Author: David Wendt <[email protected]>
Date:   Mon Jul 29 14:06:51 2024 -0400

    Add stream parameter to reshape APIs (#16410)

    Adds `stream` parameter to reshape APIs:
    - `cudf::interleave_columns`
    - `cudf::tile`
    - `cudf::byte_cast`

    Found while working #15983

    Authors:
      - David Wendt (https://github.com/davidwendt)

    Approvers:
      - Bradley Dice (https://github.com/bdice)
      - Nghia Truong (https://github.com/ttnghia)

    URL: https://github.com/rapidsai/cudf/pull/16410

commit 58f47242fe04b1e25fd42e1e45e8c15417140777
Author: Matthew Roeschke <[email protected]>
Date:   Mon Jul 29 06:09:21 2024 -1000

    Align groupby APIs with pandas 2.x (#16403)

    The following breaking APIs are affected:

    * `apply`
    * `transform`
    * `describe`

    The rest of the APIs are non-breaking and generally will raise a `NotImplementedError`

    Authors:
      - Matthew Roeschke (https://github.com/mroeschke)

    Approvers:
      - GALI PREM SAGAR (https://github.com/galipremsagar)

    URL: https://github.com/rapidsai/cudf/pull/16403

commit 18c1465b597284d8b558964cc0ca48de7da60a17
Author: Matthew Roeschke <[email protected]>
Date:   Mon Jul 29 06:06:07 2024 -1000

    Align ewm APIs with pandas 2.x (#16413)

    These all currently are not implemented and raise a `NotImplementedError`

    Authors:
      - Matthew Roeschke (https://github.com/mroeschke)

    Approvers:
      - GALI PREM SAGAR (https://github.com/galipremsagar)

    URL: https://github.com/rapidsai/cudf/pull/16413

commit eed0b1f36c84aa4a4bf17a3b99f931940cb6ddd9
Merge: 24997fda19 a51964ed8b
Author: gpuCI <[email protected]>
Date:   Mon Jul 29 09:42:33 2024 -0400

    Merge pull request #16419 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit a51964ed8b00c3c88d463e329af7ec8378642343
Author: GALI PREM SAGAR <[email protected]>
Date:   Mon Jul 29 08:42:27 2024 -0500

    Fix a `pandas-2.0` missing attribute error (#16416)

    `NumpyEADtype` is a 2.1.0+ change, this PR handles the missing attribute
    error in pandas-2.0

commit 24997fda194d5b8af34048a8bf275830cabbff8c
Author: Muhammad Haseeb <[email protected]>
Date:   Fri Jul 26 18:37:30 2024 -0700

    Deduplicate decimal32/decimal64 to decimal128 conversion function (#16236)

    Closes #16194

    This PR deduplicates the `convert_data_to_decimal128` function from `to_arrow.cu`, `writer_impl.cu` and `to_arrow_device.cu` to a common location.

    Authors:
      - Muhammad Haseeb (https://github.com/mhaseeb123)
      - Vyas Ramasubramani (https://github.com/vyasr)

    Approvers:
      - Vukasin Milovanovic (https://github.com/vuule)
      - Nghia Truong (https://github.com/ttnghia)
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: https://github.com/rapidsai/cudf/pull/16236

commit 473dec55abd1a3d9d540c541443f831d18ebb532
Author: Jayjeet Chakraborty <[email protected]>
Date:   Fri Jul 26 14:45:12 2024 -0700

    Add query 10 to the TPC-H suite (#16392)

    Adds Q10 to the TPC-H benchmark suite

    Authors:
      - Jayjeet Chakraborty (https://github.com/JayjeetAtGithub)

    Approvers:
      - Mike Wilson (https://github.com/hyperbolic2346)
      - Yunsong Wang (https://github.com/PointKernel)

    URL: https://github.com/rapidsai/cudf/pull/16392

commit 46ff702144a2477d06ffabd3d92d38967c10b1ff
Merge: 73158f06e2 5dd3efba5b
Author: gpuCI <[email protected]>
Date:   Fri Jul 26 16:47:54 2024 -0400

    Merge pull request #16411 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 5dd3efba5b7e0c22dce87cf20aecb1b198677d2e
Author: David Wendt <[email protected]>
Date:   Fri Jul 26 16:47:49 2024 -0400

    Fix nightly memcheck error for empty STREAM_INTEROP_TEST (#16406)

    ## Description
    The `STREAM_INTEROP_TEST` code was commented out in #16379 so the
    `compute-sanitizer` returns an error for this test in the nightly
    cpp-memcheck tests.
    https://github.com/rapidsai/cudf/actions/runs/10107041505/job/27950193878#step:9:62177

    This PR comments out the empty test so it is not built. The test will be
    re-enabled in a future release when the deprecated functions are
    replaced.

    ## Checklist
    - [x] I am familiar with the [Contributing
    Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md).
    - [x] New or existing tests cover these changes.
    - [x] The documentation is up to date with these changes.

commit 73158f06e2b816d88e4a2b71f236812ab997391f
Merge: dc05a01f3f f88a242832
Author: Jake Awe <[email protected]>
Date:   Fri Jul 26 13:14:22 2024 -0500

    Merge pull request #16409 from vyasr/branch-24.10-merge-branch-24.08

    Branch 24.10 merge branch 24.08

commit f88a242832a1c991c615961631f02c9875ab871f
Merge: dc05a01f3f cd762b4eb1
Author: Vyas Ramasubramani <[email protected]>
Date:   Fri Jul 26 18:10:32 2024 +0000

    Merge branch 'branch-24.08' into branch-24.10-merge-branch-24.08

commit cd762b4eb1fd55a0bc5079ed69bfc04426f10e60
Author: Matthew Roeschke <[email protected]>
Date:   Fri Jul 26 08:08:01 2024 -1000

    Gate ArrowStringArrayNumpySemantics cudf.pandas proxy behind version check (#16401)

    ## Description
    `ArrowStringArrayNumpySemantics` was newly added in 2.1:
    https://github.com/pandas-dev/pandas/blob/2.1.x/pandas/core/arrays/string_arrow.py#L488,
    so putting the proxy wrapper behind a version check for pandas 2.0.x
    compat

    ```ipython
    In [1]: %load_ext cudf.pandas

    In [2]: import pandas as pd

    In [3]: pd.__version__
    Out[3]: '2.0.0'
    ```

    ## Checklist
    - [ ] I am familiar with the [Contributing
    Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md).
    - [ ] New or existing tests cover these changes.
    - [ ] The documentation is up to date with these changes.

commit 1cea1eaf6c1e87e65729897dd9bbedc4bdc5e7ab
Author: Kyle Edwards <[email protected]>
Date:   Thu Jul 25 16:26:34 2024 -0400

    Don't export bs_thread_pool (#16398)

    ## Description
    cudf does not currently export any headers that depend on
    bs_thread_pool, and having it as a dependency is currently causing
    problems for consumers. Avoid exporting it since it's not needed.

    ## Checklist
    - [ ] I am familiar with the [Contributing
    Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md).
    - [ ] New or existing tests cover these changes.
    - [ ] The documentation is up to date with these changes.

commit dc05a01f3fc0742c5fbbddd86a0f2007bfdc2050
Merge: fb2021fe82 e553295cfa
Author: gpuCI <[email protected]>
Date:   Thu Jul 25 12:14:52 2024 -0400

    Merge pull request #16396 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit e553295cfaf2f5bd1f539ee78d9a3a064e00e5f0
Author: brandon-b-miller <[email protected]>
Date:   Thu Jul 25 11:14:47 2024 -0500

    Require fixed width types for casting in `cudf-polars` (#16381)

    Fixes a bug where numeric <-> string casts are not being properly rejected at the cudf-polars level.

    Authors:
      - https://github.com/brandon-b-miller

    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: https://github.com/rapidsai/cudf/pull/16381

commit fb2021fe82724746ae1c58345ed37f7e7a0207ed
Merge: 673b96f6d1 f756e01a3c
Author: Ray Douglass <[email protected]>
Date:   Thu Jul 25 11:06:30 2024 -0400

    Merge pull request #16391 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit f756e01a3c5ff83421b1afb44460d9e5147a410e
Author: Thomas Li <[email protected]>
Date:   Thu Jul 25 07:04:47 2024 -0700

    Implement support for scan_ndjson in cudf-polars (#16263)

    Implement support for scan_ndjson in cudf-polars.

    Authors:
      - Thomas Li (https://github.com/lithomas1)
      - Vyas Ramasubramani (https://github.com/vyasr)

    Approvers:
      - Lawrence Mitchell (https://github.com/wence-)

    URL: https://github.com/rapidsai/cudf/pull/16263

commit 673b96f6d15dbd5d8bcb22d612d3c324aa899e26
Merge: 5a3399bec8 4cc37896a5
Author: Jake Awe <[email protected]>
Date:   Thu Jul 25 08:27:15 2024 -0500

    Merge pull request #16393 from jameslamb/branch-24.10-merge-branch-24.08

    Merge branch-24.08 into branch-24.10

commit d953676e9281125a5b8bd9be739c997611471771
Author: Robert Maynard <[email protected]>
Date:   Thu Jul 25 04:49:12 2024 -0400

    Hide visibility of non public symbols (#15982)

    Converts cudf over to a system of explicit markup of what symbols should be used by consumers. This is done by compiling with `-fvisibility=hidden` and explicit markup via `CUDF_EXPORT` of components we want usable.

    Due to issues with tests a portion of `include/` detail functions had to be marked as public API.

    More concernning are that the tests leverage functions from `cpp/` that are never part of the installed headers. That set of files can be found at https://github.com/rapidsai/cudf/commit/16b365635ab0f86bb1cc6db5f036564e8290f3b1 and we should discuss how we should restructure cudf to remove these.

    Authors:
      - Robert Maynard (https://github.com/robertmaynard)
      - Bradley Dice (https://github.com/bdice)

    Approvers:
      - Bradley Dice (https://github.com/bdice)
      - Nghia Truong (https://github.com/ttnghia)

    URL: https://github.com/rapidsai/cudf/pull/15982

commit 4aefcc7b2988346166b9a757fc837e93f6f0a3bb
Author: GALI PREM SAGAR <[email protected]>
Date:   Wed Jul 24 22:30:35 2024 -0500

    Add ability to prefetch in `cudf.pandas` and change default to managed pool (#16296)

    This PR adds ability to prefetch in `cudf.pandas` based off of: https://github.com/rapidsai/rmm/pull/1608/

    Authors:
      - GALI PREM SAGAR (https://github.com/galipremsagar)
      - Bradley Dice (https://github.com/bdice)

    Approvers:
      - Bradley Dice (https://github.com/bdice)
      - Muhammad Haseeb (https://github.com/mhaseeb123)
      - Vyas Ramasubramani (https://github.com/vyasr)
      - Mark Harris (https://github.com/harrism)

    URL: https://github.com/rapidsai/cudf/pull/16296

commit 6486bb928dfb0e1817b0604572e2f5789d05c596
Author: Matthew Murray <[email protected]>
Date:   Wed Jul 24 22:24:46 2024 -0400

    Migrate lists/filtering to pylibcudf (#16184)

    Apart of #15162

    Authors:
      - Matthew Murray (https://github.com/Matt711)

    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: https://github.com/rapidsai/cudf/pull/16184

commit a33f520b370d048a22de031294311c241ab23858
Author: David Gardner <[email protected]>
Date:   Wed Jul 24 18:42:16 2024 -0700

    Fix inconsistent usage of 'results' and 'records' in read-json.md (#15766)

    * Fix inconsistent usage of 'results' and 'records' in `docs/cudf/source/user_guide/io/read-json.md`

    Authors:
      - David Gardner (https://github.com/dagardner-nv)
      - Vyas Ramasubramani (https://github.com/vyasr)

    Approvers:
      - Bradley Dice (https://github.com/bdice)
      - Nghia Truong (https://github.com/ttnghia)

    URL: https://github.com/rapidsai/cudf/pull/15766

commit 5a3399bec868f44d13c003f172c665919096d8e8
Author: James Lamb <[email protected]>
Date:   Wed Jul 24 19:26:12 2024 -0500

    fix [tool.setuptools] reference in custreamz config (#16365)

    Noticed this warning in logs from #16183

    > _/python3.10/site-packages/setuptools/config/pyprojecttoml.py:70: _ToolsTypoInMetadata: Ignoring [tools.setuptools] in pyproject.toml, did you mean [tool.setuptools]?_

    This fixes that.

    ## Notes for Reviewers

    Intentionally targeting this at 24.10.

    This misconfiguration has been in `custreamz` since the 23.04 release ([git blame link](https://github.com/rapidsai/cudf/blame/e6d412cba7c23df7ee500c28257ed9281cea49b9/python/custreamz/pyproject.toml#L60)).

    I think the only effect might be that some test files are included in wheels when we don't want to.

    I don't think the fix for it needs to be rushed into 24.08.

    I searched across RAPIDS in case this was copied from somewhere else... don't see any other instances of this typo that need to be fixed.

    Authors:
      - James Lamb (https://github.com/jameslamb)

    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: https://github.com/rapidsai/cudf/pull/16365

commit 4cc37896a5dff1e019f0dff8101f3a84a05fd5d8
Merge: 29ce5c529e a36dacb663
Author: James Lamb <[email protected]>
Date:   Wed Jul 24 18:54:56 2024 -0500

    Merge branch-24.08 into branch-24.10

commit a36dacb66325e03d3264482d35a5cf7e0b6c7a37
Author: Lawrence Mitchell <[email protected]>
Date:   Thu Jul 25 00:31:40 2024 +0100

    Make C++ compilation warning free after #16297 (#16379)

    In https://github.com/rapidsai/cudf/pull/16297, we deprecated the use of `to_arrow` in favour of `to_arrow_host` and `to_arrow_device`. However, the scalar detail overload of `to_arrow` used the public table overload. So we get a warning when compiling internal libcudf code. Fix this by using the detail API, and fix a bug along the way where we were not passing through the arrow memory resource.

    Authors:
      - Lawrence Mitchell (https://github.com/wence-)

    Approvers:
      - David Wendt (https://github.com/davidwendt)
      - Michael Schellenberger Costa (https://github.com/miscco)
      - Vyas Ramasubramani (https://github.com/vyasr)
      - Karthikeyan (https://github.com/karthikeyann)

    URL: https://github.com/rapidsai/cudf/pull/16379

commit ae4c7e3ce4fe100eb919ca00fa34461e44078ba9
Author: James Lamb <[email protected]>
Date:   Wed Jul 24 18:30:53 2024 -0500

    split up CUDA-suffixed dependencies in dependencies.yaml (#16183)

    Contributes to https://github.com/rapidsai/build-planning/issues/31

    Follow-up to #15245

    RAPIDS DLFW builds prefer to build all RAPIDS packages together without CUDA suffixes, leading to the following set of requirements for `cudf` wheels built there:

    * project name must be `cudf` (not `cudf-cu12`)
    * all dependencies must be unsuffixed (e.g. `rmm` not `rmm-cu12`)
    * the correct set of dependencies based on CUDA version must be expressed in the wheel metadata (e.g. `cubinlinker` and `ptxcompiler` on CUDA 11, `pynvjitlink` on CUDA 12)

    To meet all 3 of those, this proposes decomposing CUDA-suffixed dependencies in `dependencies.yaml` into two lists... `cuda_suffixed="true"` and `cuda_suffixed="false"`.

    That'd allow DLFW builds to do the following to meet its requirements:

    ```shell
    pip wheel \
      -C rapidsai.disable-cuda=true \
      -C rapidsai.matrix-entry="cuda=12.5;cuda_suffixed=false" \
      .
    ```

    Authors:
      - James Lamb (https://github.com/jameslamb)

    Approvers:
      - Bradley Dice (https://github.com/bdice)
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: https://github.com/rapidsai/cudf/pull/16183

commit 29ce5c529ea9ea18edc32ab905f1ef076f266008
Author: Michael Schellenberger Costa <[email protected]>
Date:   Thu Jul 25 01:29:41 2024 +0200

    Fix some issues with deprecated / removed cccl facilities (#16377)

    `cub::If` has been deprecated and should not be used. There is a better alternative in `cuda::std::conditional_t`

    `thrust::{binary, unary}_function` has been deprecated and does not serve a purpose similar to the removed `std::{binary, unary}_function`

    Rather than relying on the type aliases one should use the `std::invoke` machinery

    Authors:
      - Michael Schellenberger Costa (https://github.com/miscco)

    Approvers:
      - Bradley Dice (https://github.com/bdice)
      - Nghia Truong (https://github.com/ttnghia)
      - Bernhard Manfred Gruber (https://github.com/bernhardmgruber)

    URL: https://github.com/rapidsai/cudf/pull/16377

commit a6b1cf1fa96d622626a9e4d99a5c71d33fb1bd49
Merge: 2eabe0de58 59f65843b8
Author: gpuCI <[email protected]>
Date:   Wed Jul 24 19:10:33 2024 -0400

    Merge pull request #16389 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 59f65843b80d967f743841aee8489b6ae63b269a
Author: Muhammad Haseeb <[email protected]>
Date:   Wed Jul 24 16:10:28 2024 -0700

    Gracefully CUDF_FAIL when `skip_rows > 0` in Chunked Parquet reader (#16385)

    This PR must merge in cudf 24.08 to avoid unhandled expections.

    Gracefully CUDF_FAIL in chunked parquet reader when `skip_rows>0` which may result in runtime exceptions like segfaults or an infinite loop. See #16186 for more information.

    Authors:
      - Muhammad Haseeb (https://github.com/mhaseeb123)

    Approvers:
      - David Wendt (https://github.com/davidwendt)
      - Vyas Ramasubramani (https://github.com/vyasr)
      - Bradley Dice (https://github.com/bdice)
      - Karthikeyan (https://github.com/karthikeyann)
      - Nghia Truong (https://github.com/ttnghia)

    URL: https://github.com/rapidsai/cudf/pull/16385

commit 2eabe0de584ff8c8ae6e82b1845309d5b01c4a98
Merge: 4624edf586 8bba6dfad2
Author: gpuCI <[email protected]>
Date:   Wed Jul 24 18:16:08 2024 -0400

    Merge pull request #16388 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 8bba6dfad239b4fd69a82acbc5dd7707ba576cce
Author: Matthew Murray <[email protected]>
Date:   Wed Jul 24 18:16:03 2024 -0400

    Migrate lists/set_operations to pylibcudf (#16190)

    Apart of #15162

    Authors:
      - Matthew Murray (https://github.com/Matt711)

    Approvers:
      - Thomas Li (https://github.com/lithomas1)

    URL: https://github.com/rapidsai/cudf/pull/16190

commit 4624edf58683391529cd9d7b76ca2e45438655bf
Merge: 077457ee89 73937fbaba
Author: gpuCI <[email protected]>
Date:   Wed Jul 24 16:42:06 2024 -0400

    Merge pull request #16387 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 73937fbabaeea76665663ed23688b1cac61b7ee9
Author: Matthew Murray <[email protected]>
Date:   Wed Jul 24 16:42:00 2024 -0400

    Migrate lists/filling to pylibcudf (#16189)

    Apart of #15162

    Authors:
      - Matthew Murray (https://github.com/Matt711)
      - Vyas Ramasubramani (https://github.com/vyasr)

    Approvers:
      - Thomas Li (https://github.com/lithomas1)
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: https://github.com/rapidsai/cudf/pull/16189

commit 077457ee89140e98c9e25849511b14410370f684
Merge: 17c1afbd93 8fcf72a787
Author: gpuCI <[email protected]>
Date:   Wed Jul 24 13:06:35 2024 -0400

    Merge pull request #16382 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 8fcf72a787acb0168c97d11b8ab9130146e9b37e
Author: Alessandro Bellina <[email protected]>
Date:   Wed Jul 24 12:06:29 2024 -0500

    [JNI] Add setKernelPinnedCopyThreshold and setPinnedAllocationThreshold (#16288)

    In 24.08 two new cuDF methods are being added, and the second method is still in flight (see: https://github.com/rapidsai/cudf/pull/16206):

    ```
    cudf::set_kernel_pinned_copy_threshold
    cudf::set_allocate_host_as_pinned_threshold
    ```

    We'd like to expose these methods in our JNI layer. I created a Cudf.java with the two static methods, and put the definitions in CudfJni.cpp.

    Marked as draft since I need https://github.com/rapidsai/cudf/pull/16206 to merge, and we are still testing it.

    Authors:
      - Alessandro Bellina (https://github.com/abellina)
      - Nghia Truong (https://github.com/ttnghia)

    Approvers:
      - Robert (Bobby) Evans (https://github.com/revans2)
      - Jason Lowe (https://github.com/jlowe)

    URL: https://github.com/rapidsai/cudf/pull/16288

commit 17c1afbd936989bdcdcdb5654c1cbc4dbe57cc7d
Merge: a0c58c766e 7191b74ce2
Author: gpuCI <[email protected]>
Date:   Wed Jul 24 09:55:53 2024 -0400

    Merge pull request #16380 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 7191b74ce244518f17ef65e701f5a262f1c5cf8a
Author: Matthew Roeschke <[email protected]>
Date:   Wed Jul 24 03:55:48 2024 -1000

    Align Index __init__ APIs with pandas 2.x (#16362)

    * It would be nice to have `Index`'s constructor to not go through `IndexMeta.__call__`, but I think that would be a separate effort
    * There were a couple `verify_integrity` keyword arguments added that don't raise a `NotImplementedError` since there's not support, but I don't think it's worth making this case falling back in `cudf.pandas` as it's just a validation and won't affect further behavior with the object

    Authors:
      - Matthew Roeschke (https://github.com/mroeschke)
      - GALI PREM SAGAR (https://github.com/galipremsagar)

    Approvers:
      - GALI PREM SAGAR (https://github.com/galipremsagar)

    URL: https://github.com/rapidsai/cudf/pull/16362

commit a0c58c766e41525059e5a4e37ac5fce3a638468e
Merge: b66281c4fa 743264f6ac
Author: gpuCI <[email protected]>
Date:   Wed Jul 24 06:32:36 2024 -0400

    Merge pull request #16378 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 743264f6ac924fdbec58fad666f989b14b901a98
Author: brandon-b-miller <[email protected]>
Date:   Wed Jul 24 05:32:31 2024 -0500

    Warn on cuDF failure when `POLARS_VERBOSE` is true (#16308)

    Just something quick to get us started here

    Closes https://github.com/rapidsai/cudf/issues/16256

    Authors:
      - https://github.com/brandon-b-miller
      - Lawrence Mitchell (https://github.com/wence-)

    Approvers:
      - Lawrence Mitchell (https://github.com/wence-)

    URL: https://github.com/rapidsai/cudf/pull/16308

commit b66281c4fa811431dec0cdc0d8222fba9e8e4088
Merge: f20205b2dc 62625f1bfc
Author: gpuCI <[email protected]>
Date:   Wed Jul 24 03:42:08 2024 -0400

    Merge pull request #16376 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 62625f1bfcdb980186a1afbec41e420fdb4a7075
Author: Matt Topol <[email protected]>
Date:   Wed Jul 24 03:42:03 2024 -0400

    Host implementation of `to_arrow` using nanoarrow (#16297)

    Adds the corresponding `to_arrow_host` functions for interop using `ArrowDeviceArray`. This includes updating the version of nanoarrow in use to pick up some bug fixes and features.

    Authors:
      - Matt Topol (https://github.com/zeroshade)
      - Muhammad Haseeb (https://github.com/mhaseeb123)
      - Vyas Ramasubramani (https://github.com/vyasr)

    Approvers:
      - Muhammad Haseeb (https://github.com/mhaseeb123)
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: https://github.com/rapidsai/cudf/pull/16297

commit f20205b2dc7a5e830b72386df378934c53da5043
Merge: bc748d67b5 8c1749b40e
Author: gpuCI <[email protected]>
Date:   Wed Jul 24 01:19:15 2024 -0400

    Merge pull request #16375 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 8c1749b40eaa983966ed3bece6bdd29a4316d18a
Author: Kyle Edwards <[email protected]>
Date:   Wed Jul 24 01:19:10 2024 -0400

    Use rapids_cpm_bs_thread_pool() (#16360)

    Authors:
      - Kyle Edwards (https://github.com/KyleFromNVIDIA)

    Approvers:
      - Bradley Dice (https://github.com/bdice)

    URL: https://github.com/rapidsai/cudf/pull/16360

commit bc748d67b52de4cf1c876f9701644fdbf1d839e5
Merge: 6d9aff4b7d 75289c58f3
Author: gpuCI <[email protected]>
Date:   Wed Jul 24 00:46:03 2024 -0400

    Merge pull request #16374 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 75289c58f3d9ca11a51396e4adadfbd5f51856f5
Author: Bradley Dice <[email protected]>
Date:   Tue Jul 23 23:45:59 2024 -0500

    Rename PrefetchConfig to prefetch_config. (#16358)

    This PR addresses a comment requesting a rename of `PrefetchConfig` to `prefetch_config`.

    See: https://github.com/rapidsai/cudf/pull/16020#discussion_r1686284151

    Authors:
      - Bradley Dice (https://github.com/bdice)

    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)
      - Shruti Shivakumar (https://github.com/shrshi)
      - Nghia Truong (https://github.com/ttnghia)

    URL: https://github.com/rapidsai/cudf/pull/16358

commit 6d9aff4b7dfd23db43d294dacdeaf6c52af2fc4b
Merge: dcf791c83e f0efc8b36a
Author: gpuCI <[email protected]>
Date:   Tue Jul 23 20:17:10 2024 -0400

    Merge pull request #16373 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit f0efc8b36a8f43cfa027966265dcea052bb5c45d
Author: Vukasin Milovanovic <[email protected]>
Date:   Tue Jul 23 17:17:05 2024 -0700

    Modify `make_host_vector` and `make_device_uvector` factories to optionally use pinned memory and kernel copy (#16206)

    Issue #15616

    Modified `make_host_vector` functions to return `cudf::detail::host_vector`, which can use a pinned or a pageable memory resource. When pinned memory is used, the D2H copy is potentially done using a CUDA kernel.

    Also added factories to create `host_vector`s without device data. These are useful to replace uses of `std::vector` and `thrust::host_vector` when the data eventually gets copied to the GPU.

    Added `is_device_accessible` to `host_span`. With this, `make_device_uvector` can optionally use the kernel for the H2D copy.

    Modified `cudf::detail::host_vector` to be derived from `thrust::host_vector`, to avoid issues with implicit conversion from `std::vector`.

    Used `cudf::detail::host_vector` and its new factory functions wherever data ends up copied to the GPU.

    Stopped using `thrust::copy_n` for the kernel copy path in `cuda_memcpy` because of an optimization that allows it to fall back to `cudaMemCpyAsync`. We now call a simple local kernel.

    Authors:
      - Vukasin Milovanovic (https://github.com/vuule)

    Approvers:
      - Robert Maynard (https://github.com/robertmaynard)
      - Yunsong Wang (https://github.com/PointKernel)
      - Nghia Truong (https://github.com/ttnghia)
      - Alessandro Bellina (https://github.com/abellina)

    URL: https://github.com/rapidsai/cudf/pull/16206

commit dcf791c83e3ab87d57d94017ee7413d96f9e99a5
Merge: 7a09f809dc 39f256c339
Author: gpuCI <[email protected]>
Date:   Tue Jul 23 20:03:22 2024 -0400

    Merge pull request #16372 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 39f256c3397afc9c495cb819636abddb23f81dc0
Author: brandon-b-miller <[email protected]>
Date:   Tue Jul 23 19:03:16 2024 -0500

    Fall back to CPU for unsupported libcudf binaryops in cudf-polars (#16188)

    This PR adds logic that should trigger CPU fallback unsupported binary ops.

    Authors:
      - https://github.com/brandon-b-miller
      - Lawrence Mitchell (https://github.com/wence-)

    Approvers:
      - Lawrence Mitchell (https://github.com/wence-)

    URL: https://github.com/rapidsai/cudf/pull/16188

commit 7a09f809dc5c8cf8d2663fae186e4d249893c888
Merge: a3aacd8915 cd711913d2
Author: gpuCI <[email protected]>
Date:   Tue Jul 23 18:24:24 2024 -0400

    Merge pull request #16370 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit cd711913d2312ba158e34f5c03784a7b07f1583a
Author: Elias Stehle <[email protected]>
Date:   Wed Jul 24 00:24:19 2024 +0200

    Adds write-coalescing code path optimization to FST (#16143)

    This PR adds an optimized code path to the finite-state transducer (FST) that will use a shared memory-backed write buffer for the translated output and translated output indexes, if the the write buffer does not require allocating excessive amounts of shared memory (i.e., current heuristic is 24 KB/CTA). Writes are first buffered in shared memory and then collaboratively written out using coalesced writes to global memory.

    ## Benchmark results

    Numbers are for libcudf's FST_NVBENCH for a 1.073 GB input. FST outputs one token per input symbol. Benchmarks run on V100 with 900 GB/s theoretical peak BW.
    We compare the current FST implementation (old) to an FST implementaation that uses write-coalescing to gmem (new).

    |                  | OLD throughput  (GB/s) | NEW throughput  (GB/s) | relative performance |   | 1st kernel, per byte: bytes read/written | 2nd kernel, per byte: bytes read/written | expected SOL (GB/s) | achieved SOL (old) | achieved SOL (new) |
    |------------------|------------------------|------------------------|----------------------|---|------------------------------------------|------------------------------------------|---------------------|--------------------|--------------------|
    | full             |                   15.7 |                  74.74 |                 476% |   |                                        1 |                                        6 |              102.86 |             15.26% |             72.66% |
    | no out-indexes   |                 39.123 |                  105.8 |                 270% |   |                                        1 |                                        2 |              240.00 |             16.30% |             44.08% |
    | no-output        |                 229.27 |                 178.92 |                  78% |   |                                        1 |                                        1 |              360.00 |             63.69% |             49.70% |
    | out-indexes-only |                  24.95 |                   85.2 |                 341% |   |                                        1 |                                        5 |              120.00 |             20.79% |             71.00% |

    Authors:
      - Elias Stehle (https://github.com/elstehle)

    Approvers:
      - Shruti Shivakumar (https://github.com/shrshi)
      - Vukasin Milovanovic (https://github.com/vuule)

    URL: https://github.com/rapidsai/cudf/pull/16143

commit a3aacd8915fa503ea4be8e1d7797a080e0427923
Merge: 2de9fa7bd8 ff30c02111
Author: gpuCI <[email protected]>
Date:   Tue Jul 23 15:04:01 2024 -0400

    Merge pull request #16366 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit ff30c0211109e14b1f6918fcc6c2e2b98f863a1f
Author: Nghia Truong <[email protected]>
Date:   Tue Jul 23 12:03:55 2024 -0700

    Fix compile warnings with `jni_utils.hpp` (#16336)

    This fixes the compiler warnings with `jni_utils.hpp`, removing some `const` qualifiers that are redundant.

    Closes https://github.com/rapidsai/cudf/issues/16335.

    Authors:
      - Nghia Truong (https://github.com/ttnghia)

    Approvers:
      - Jason Lowe (https://github.com/jlowe)

    URL: https://github.com/rapidsai/cudf/pull/16336

commit 2de9fa7bd821c7b1653340dfca4e6a1e9e216cc5
Merge: bc609fb648 e6d412cba7
Author: gpuCI <[email protected]>
Date:   Tue Jul 23 07:03:33 2024 -0400

    Merge pull request #16364 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit e6d412cba7c23df7ee500c28257ed9281cea49b9
Author: brandon-b-miller <[email protected]>
Date:   Tue Jul 23 06:03:28 2024 -0500

    Fall back when casting a timestamp to numeric in cudf-polars (#16232)

    This PR adds logic that falls back to CPU when a cudf-polars query would cast a timestamp column to a numeric type, an unsupported operation in libcudf, which should fix a few polars tests. It could be cleaned up a bit with some of the utilities that will be added in https://github.com/rapidsai/cudf/pull/16150.

    Authors:
      - https://github.com/brandon-b-miller

    Approvers:
      - Lawrence Mitchell (https://github.com/wence-)

    URL: https://github.com/rapidsai/cudf/pull/16232

commit bc609fb6482e32152d64f3e9d34aaa4cb9b87cec
Merge: 023dba6fab c7b28ceeb4
Author: gpuCI <[email protected]>
Date:   Tue Jul 23 06:28:20 2024 -0400

    Merge pull request #16363 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit c7b28ceeb46d2b921e30f081a9ed97745c91ff9e
Author: brandon-b-miller <[email protected]>
Date:   Tue Jul 23 05:28:13 2024 -0500

    Add `drop_nulls` in `cudf-polars` (#16290)

    Closes https://github.com/rapidsai/cudf/issues/16219

    Authors:
      - https://github.com/brandon-b-miller

    Approvers:
      - Lawrence Mitchell (https://github.com/wence-)

    URL: https://github.com/rapidsai/cudf/pull/16290

commit 023dba6fab1c00116b11ff10fc7536d4f9e78fcd
Merge: 4a0813b681 0cac2a9d68
Author: gpuCI <[email protected]>
Date:   Mon Jul 22 17:18:26 2024 -0400

    Merge pull request #16359 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 0cac2a9d68341a38721be16132ead14cf4a0d70b
Author: Shruti Shivakumar <[email protected]>
Date:   Mon Jul 22 14:18:21 2024 -0700

    Remove size constraints on source files in batched JSON reading (#16162)

    Addresses https://github.com/rapidsai/cudf/issues/16138
    The batched multi-source JSON reader fails when the size of any of the input source buffers exceeds `INT_MAX` bytes.
    The goal of this PR is to remove this constraint by modifying the batching behavior of the reader.  Instead of constructing batches that include entire source files, the batches are now constructed at the granularity of byte ranges of size at most `INT_MAX` bytes,

    Authors:
      - Shruti Shivakumar (https://github.com/shrshi)

    Approvers:
      - Vukasin Milovanovic (https://github.com/vuule)
      - Karthikeyan (https://github.com/karthikeyann)

    URL: https://github.com/rapidsai/cudf/pull/16162

commit 4a0813b68158474b00d3e7c692310b62b48fe2fc
Merge: a4acaa7177 81e65ee312
Author: gpuCI <[email protected]>
Date:   Mon Jul 22 16:18:45 2024 -0400

    Merge pull request #16357 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 81e65ee312af5133ca2b98d52efaeb29c274a825
Author: GALI PREM SAGAR <[email protected]>
Date:   Mon Jul 22 15:18:40 2024 -0500

    Fix docstring of `DataFrame.apply` (#16351)

    This PR fixes docstring of `DataFrame.apply`

    Authors:
      - GALI PREM SAGAR (https://github.com/galipremsagar)

    Approvers:
      - Matthew Roeschke (https://github.com/mroeschke)

    URL: https://github.com/rapidsai/cudf/pull/16351

commit a4acaa717798a3a09a57ab333965c00666d9d808
Merge: 0868314b1d 996cb8d870
Author: gpuCI <[email protected]>
Date:   Mon Jul 22 16:15:22 2024 -0400

    Merge pull request #16356 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 996cb8d870b7b6153802bde670435e8cd3b8775d
Author: Matthew Murray <[email protected]>
Date:   Mon Jul 22 16:15:16 2024 -0400

    Migrate lists/sorting to pylibcudf (#16179)

    Apart of #15162

    Authors:
      - Matthew Murray (https://github.com/Matt711)

    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: https://github.com/rapidsai/cudf/pull/16179

commit 0868314b1d5f2ca31eb56f4fee5f75de42b22fbe
Merge: a3ebf3badd c14c8bf59f
Author: gpuCI <[email protected]>
Date:   Mon Jul 22 15:04:01 2024 -0400

    Merge pull request #16355 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit c14c8bf59fd1e97fe94c8dfd2db6df7f9a6c65ad
Author: Thomas Li <[email protected]>
Date:   Mon Jul 22 12:03:56 2024 -0700

    Implement parquet reading using pylibcudf in cudf-polars (#16346)

    Replace cudf-classic with pylibcudf for parquet reading in cudf-polars.

    Authors:
      - Thomas Li (https://github.com/lithomas1)

    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: https://github.com/rapidsai/cudf/pull/16346

commit a3ebf3badd0c7375b3f24dd466d4db8fa127000e
Merge: edbb1bcd9c e0a00c1fcb
Author: gpuCI <[email protected]>
Date:   Mon Jul 22 15:03:29 2024 -0400

    Merge pull request #16354 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit e0a00c1fcb4b72b7abd29debe5b2f6b38081d39a
Author: Jayjeet Chakraborty <[email protected]>
Date:   Mon Jul 22 12:03:24 2024 -0700

    Add `stream` param to list explode APIs (#16317)

    Add `stream` param to list `explode*` APIs. Partially fixes https://github.com/rapidsai/cudf/issues/13744

    Authors:
      - Jayjeet Chakraborty (https://github.com/JayjeetAtGithub)

    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: https://github.com/rapidsai/cudf/pull/16317

commit edbb1bcd9c363876b79039caf7176270ee3eba03
Merge: b52ec0f436 e54b82c9f3
Author: gpuCI <[email protected]>
Date:   Mon Jul 22 15:03:09 2024 -0400

    Merge pull request #16353 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit e54b82c9f3499b35e7e789d41d2042a5d5a80810
Author: Mark Harris <[email protected]>
Date:   Tue Jul 23 05:03:04 2024 +1000

    Use resource_ref for upstream in stream_checking_resource_adaptor (#16187)

    As we move toward replacing all `device_memory_resource` pointers with `resource_ref`s, there are some places that changes can be made ahead of RMM to simplify required changes as RMM is refactored.

    In this PR I eliminate the unnecessary `Upstream` template parameter from `cudf_test::stream_checking_resource_adaptor`, and use a `device_async_resource` for the upstream resource.   A similar change will be made to all RMM resource adaptors, but this one can be done without deprecations since it is just a test utility.

    Authors:
      - Mark Harris (https://github.com/harrism)
      - Vyas Ramasubramani (https://github.com/vyasr)

    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: https://github.com/rapidsai/cudf/pull/16187

commit b52ec0f436c549b79daf6d9379ad2851b8833dbe
Merge: 0135e46880 3053f42351
Author: gpuCI <[email protected]>
Date:   Mon Jul 22 13:56:45 2024 -0400

    Merge pull request #16352 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 3053f42351b04e22d873f78f5bc49f8b20ff17ac
Author: Jayjeet Chakraborty <[email protected]>
Date:   Mon Jul 22 10:56:39 2024 -0700

    Add missing `stream` param to dictionary factory APIs (#16319)

    Add `stream` param to dictionary column factory functions. Partially solves #13744

    Authors:
      - Jayjeet Chakraborty (https://github.com/JayjeetAtGithub)

    Approvers:
      - Mark Harris (https://github.com/harrism)
      - Yunsong Wang (https://github.com/PointKernel)

    URL: https://github.com/rapidsai/cudf/pull/16319

commit 0135e468808ccf7e8471e654bcd723eafb9c48c5
Merge: c53f9c54ac 135c99512e
Author: gpuCI <[email protected]>
Date:   Mon Jul 22 10:13:37 2024 -0400

    Merge pull request #16344 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 135c99512e5f7a2d38f6a870ad6883ccb39a3cce
Author: Matthew Roeschke <[email protected]>
Date:   Mon Jul 22 04:13:32 2024 -1000

    Align Series APIs with pandas 2.x (#16333)

    Similar to https://github.com/rapidsai/cudf/pull/16310, the follow APIs have been modified to adjust/add parameters

    * `reindex`
    * `reset_index`
    * `add_suffix`
    * `searchsorted`
    * `clip`
    * `mask`
    * `shift`
    * `dropna`
    * `rename`
    * `cov`
    * `apply`
    * `replace`

    Authors:
      - Matthew Roeschke (https://github.com/mroeschke)

    Approvers:
      - GALI PREM SAGAR (https://github.com/galipremsagar)

    URL: https://github.com/rapidsai/cudf/pull/16333

commit c53f9c54ac9e4d25350f04ffcb41ceb5bca9bdb2
Merge: c636778de3 852b151002
Author: gpuCI <[email protected]>
Date:   Mon Jul 22 09:48:23 2024 -0400

    Merge pull request #16343 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 852b151002dc76e9f09d3529c80e4b589f1df9fc
Author: Lawrence Mitchell <[email protected]>
Date:   Mon Jul 22 14:48:18 2024 +0100

    Fix issue in horizontal concat implementation in cudf-polars (#16271)

    Shorter tables must be extended to the same length as the longest table.

    Authors:
      - Lawrence Mitchell (https://github.com/wence-)

    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: https://github.com/rapidsai/cudf/pull/16271

commit c636778de39491e24ace55d99dcfb29c574a20d2
Merge: dacc6c0baa e6537de747
Author: gpuCI <[email protected]>
Date:   Fri Jul 19 23:10:44 2024 -0400

    Merge pull request #16342 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit e6537de7474c91b4153542e6611c8a4e33a58caa
Author: Vyas Ramasubramani <[email protected]>
Date:   Fri Jul 19 20:10:40 2024 -0700

    Experimental support for configurable prefetching (#16020)

    This PR adds experimental support for prefetching managed memory at a select few points in libcudf. A new configuration object is introduced for handling whether prefetching is enabled or disabled, and whether to print debug information about pointers being prefetched. Prefetching control is managed on a per API basis to enable profiling of the effects of prefetching different classes of data in different contexts. Prefetching in this PR always occurs on the default stream, so it will trigger synchronization with any blocking streams that the user has created. Turning on prefetching and then passing non-blocking to any libcudf APIs will trigger undefined behavior.

    Authors:
      - Vyas Ramasubramani (https://github.com/vyasr)

    Approvers:
      - David Wendt (https://github.com/davidwendt)
      - Kyle Edwards (https://github.com/KyleFromNVIDIA)
      - Thomas Li (https://github.com/lithomas1)
      - Muhammad Haseeb (https://github.com/mhaseeb123)

    URL: https://github.com/rapidsai/cudf/pull/16020

commit dacc6c0baa47c89fe8e0d1c3d246bcc94a4b6416
Merge: 1ccdf15dd7 c5b96003ce
Author: gpuCI <[email protected]>
Date:   Fri Jul 19 23:04:24 2024 -0400

    Merge pull request #16341 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit c5b96003cef00b2635923d03edcd48a13821a61e
Author: Thomas Li <[email protected]>
Date:   Fri Jul 19 20:04:19 2024 -0700

    Migrate Parquet reader to pylibcudf (#16078)

    xref #15162

    Migrates the parquet reader (and chunked parquet reader) to pylibcudf.

    (Does not migrate the writers or the metadata reader yet).

    Authors:
      - Thomas Li (https://github.com/lithomas1)
      - Vyas Ramasubramani (https://github.com/vyasr)

    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)
      - Lawrence Mitchell (https://github.com/wence-)

    URL: https://github.com/rapidsai/cudf/pull/16078

commit 1ccdf15dd736a1a08aa8f566a47ca0392ca33cac
Merge: 97e1bab151 26a3799d2f
Author: gpuCI <[email protected]>
Date:   Fri Jul 19 22:49:07 2024 -0400

    Merge pull request #16340 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 26a3799d2ff9ffb2aa72d63bb388b4bee70b3440
Author: Matthew Roeschke <[email protected]>
Date:   Fri Jul 19 16:49:01 2024 -1000

    Make ColumnAccessor strictly require a mapping of columns (#16285)

    `ColumnAccessor` had a default `data=None` argument and initialized an empty dict in the `__init__` if `data` was not passed. This PR now makes `data` a required argument.

    Additionally if `verify=True`, the `__init__` would call `as_column` on each `data.values()` allowing non-`ColumnBase` inputs. This PR now avoids this call and makes the caller responsible for ensuring the inputs are `ColumnBase`s

    Also, adds a few `verify=False` internally where we know we are passing columns from a libcudf op or reconstructing from another `ColumnAccessor`

    Authors:
      - Matthew Roeschke (https://github.com/mroeschke)

    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: https://github.com/rapidsai/cudf/pull/16285

commit 97e1bab151184aa537edf39b7e838c07e07271a9
Merge: 5ad4c877ed 75335f6af5
Author: gpuCI <[email protected]>
Date:   Fri Jul 19 21:21:32 2024 -0400

    Merge pull request #16339 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 75335f6af51bde6be68c1fb0a6caa8030b9eda3e
Author: Muhammad Haseeb <[email protected]>
Date:   Fri Jul 19 18:21:27 2024 -0700

    Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 (#16195)

    Closes #15389
    Closes #16186

    This PR adds the capability to calculate and report the number of rows read from each data source into the table returned by the Parquet reader (both chunked and normal). The returned vector of counts is only valid (non-empty) when row selection (AST filter) is not being used.

    This PR also fixes a segfault in chunked parquet reader when skip_rows > 0 and the number of passes > 1. This segfault was being caused by a couple of arithmetic errors when computing the (start_row, num_row)  for row_group_info, pass, column chunk descriptor structs.

    Both changes were added to this PR as changes and the gtests from the former work were needed to implement the segfault fix.

    Authors:
      - Muhammad Haseeb (https://github.com/mhaseeb123)

    Approvers:
      - GALI PREM SAGAR (https://github.com/galipremsagar)
      - Vukasin Milovanovic (https://github.com/vuule)

    URL: https://github.com/rapidsai/cudf/pull/16195

commit 5ad4c877ed631094f358f87c003ee9b381e9e270
Merge: ebacf394d9 535db9b26e
Author: gpuCI <[email protected]>
Date:   Fri Jul 19 20:28:20 2024 -0400

    Merge pull request #16338 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 535db9b26ed1a57e4275f4a6f11b04ebeee21248
Author: Thomas Li <[email protected]>
Date:   Fri Jul 19 17:28:14 2024 -0700

    Deprecate Arrow support in I/O (#16132)

    Contributes to https://github.com/rapidsai/cudf/issues/15193

    Authors:
      - Thomas Li (https://github.com/lithomas1)
      - Vyas Ramasubramani (https://github.com/vyasr)

    Approvers:
      - Richard (Rick) Zamora (https://github.com/rjzamora)
      - Lawrence Mitchell (https://github.com/wence-)

    URL: https://github.com/rapidsai/cudf/pull/16132

commit ebacf394d975fa5a0f65a7337d5587c9e8273902
Merge: b11cdf854d e169e8e427
Author: gpuCI <[email protected]>
Date:   Fri Jul 19 19:36:08 2024 -0400

    Merge pull request #16337 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit e169e8e4273e4d317e3f27c810c5b137dd75adb3
Author: Thomas Li <[email protected]>
Date:   Fri Jul 19 16:36:03 2024 -0700

    Implement read_csv in cudf-polars using pylibcudf (#16307)

    Replace cudf-classic with pylibcudf for CSV reading in cudf-polars

    Authors:
      - Thomas Li (https://github.com/lithomas1)
      - Vyas Ramasubramani (https://github.com/vyasr)

    Approvers:
      - Lawrence Mitchell (https://github.com/wence-)

    URL: https://github.com/rapidsai/cudf/pull/16307

commit b11cdf854d64e248d682ad2d8178f8ae08e34b3e
Merge: d82caec4e0 5dde41d7f7
Author: gpuCI <[email protected]>
Date:   Fri Jul 19 19:08:41 2024 -0400

    Merge pull request #16334 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 5dde41d7f7533180ecd355bac248a7ed18adcc10
Author: Matthew Roeschke <[email protected]>
Date:   Fri Jul 19 13:08:36 2024 -1000

    Replace is_float/integer_dtype checks with .kind checks (#16261)

    It appears this was called when we already had a dtype object so can instead just simply check the .kind attribute

    Authors:
      - Matthew Roeschke (https://github.com/mroeschke)

    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: https://github.com/rapidsai/cudf/pull/16261

commit d82caec4e04468b497f2d553221c6314c53f9d10
Merge: 3c3ee56637 4c46628eaf
Author: gpuCI <[email protected]>
Date:   Fri Jul 19 18:51:12 2024 -0400

    Merge pull request #16332 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 4c46628eaf7ba16a2a181ceb3311f315cd4932dc
Author: Matthew Roeschke <[email protected]>
Date:   Fri Jul 19 12:51:07 2024 -1000

    Mark cudf._typing as a typing module in ruff (#16318)

    Additionally breaks up the prior, single-line of `select` rules that are enabled.

    Authors:
      - Matthew Roeschke (https://github.com/mroeschke)

    Approvers:
      - Thomas Li (https://github.com/lithomas1)
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: https://github.com/rapidsai/cudf/pull/16318

commit 3c3ee56637116e07804f20efab46d4dd3aa7c4cf
Merge: 1cb07e0c29 7d3083254c
Author: gpuCI <[email protected]>
Date:   Fri Jul 19 18:48:43 2024 -0400

    Merge pull request #16331 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 7d3083254c0503b07f82af32188120f42acef860
Author: Matthew Roeschke <[email protected]>
Date:   Fri Jul 19 12:48:39 2024 -1000

    Replace np.isscalar/issubdtype checks with is_scalar/.kind checks (#16275)

    * `is_scalar` also handles cudf.Scalars which should be handled internally
    * `issubdtype` can largely be replaced by checking the `.kind` attribute on the dtype

    Authors:
      - Matthew Roeschke (https://github.com/mroeschke)

    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: https://github.com/rapidsai/cudf/pull/16275

commit 1cb07e0c29c0b6acd1896ecef867afeca27a84c1
Merge: 52657b3375 57ed7fce67
Author: gpuCI <[email protected]>
Date:   Fri Jul 19 18:25:01 2024 -0400

    Merge pull request #16330 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 57ed7fce6742abc96a8fd65216f032bad5937a2f
Author: brandon-b-miller <[email protected]>
Date:   Fri Jul 19 17:24:55 2024 -0500

    Add tests for `pylibcudf` binaryops (#15470)

    This PR implements a more general approach to testing binaryops that originally came up in https://github.com/rapidsai/cudf/pull/15279. This PR can possibly supersede that one.

    Authors:
      - https://github.com/brandon-b-miller

    Approvers:
      - Lawrence Mitchell (https://github.com/wence-)
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: https://github.com/rapidsai/cudf/pull/15470

commit 52657b3375c900a66b6ec5f8d7e1ebe37c38232f
Merge: 6be515506d ecc27a1140
Author: gpuCI <[email protected]>
Date:   Fri Jul 19 17:55:45 2024 -0400

    Merge pull request #16329 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit ecc27a1140c0c287091f6a1291dfaf7ccd82cb19
Author: Matthew Roeschke <[email protected]>
Date:   Fri Jul 19 11:55:40 2024 -1000

    Align more DataFrame APIs with pandas (#16310)

    I have a script that did some signature comparisons between `pandas.DataFrame` and `cudf.DataFrame` API and it appears some signatures have changed between the pandas 1.x and 2.x release. The API changes in this PR are mostly adding implementations or adding missing keyword argument (although they might not be implemented). The APIs affected are:

    * `__init__`
    * `__array__`
    * `__arrow_c_stream__`
    * `to_dict`
    * `where`
    * `add_prefix`
    * `join`
    * `apply`
    * `to_records`
    * `from_records`
    * `unstack`
    * `pct_change`
    * `sort_values`

    Marking as breaking as I ensured some added keywords are in the same positions as pandas and therefore might break users who are using purely positional arguments.

    Authors:
      - Matthew Roeschke (https://github.com/mroeschke)
      - GALI PREM SAGAR (https://github.com/galipremsagar)

    Approvers:
      - GALI PREM SAGAR (https://github.com/galipremsagar)

    URL: https://github.com/rapidsai/cudf/pull/16310

commit 6be515506d4a6f833e71ac67f16c2925f7b8576b
Merge: fcaea56166 6e37afc7c9
Author: gpuCI <[email protected]>
Date:   Fri Jul 19 17:52:32 2024 -0400

    Merge pull request #16328 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 6e37afc7c9e177b307c41950e52453bd5906af44
Author: Matthew Roeschke <[email protected]>
Date:   Fri Jul 19 11:52:27 2024 -1000

    Make __bool__ raise for more cudf objects (#16311)

    To match pandas, this PR makes `DataFrame`, `MultiIndex` and `RangeIndex` raise on `__bool__`.

    Authors:
      - Matthew Roeschke (https://github.com/mroeschke)
      - GALI PREM SAGAR (https://github.com/galipremsagar)

    Approvers:
      - GALI PREM SAGAR (https://github.com/galipremsagar)

    URL: https://github.com/rapidsai/cudf/pull/16311

commit fcaea56166e2d8f8b1916d702ec8572a9e12b2be
Merge: 051fadd250 910989eb8f
Author: gpuCI <[email protected]>
Date:   Fri Jul 19 17:48:42 2024 -0400

    Merge pull request #16327 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 910989eb8fb87b2e896aa032260705c27cce71e0
Author: Bradley Dice <[email protected]>
Date:   Fri Jul 19 15:48:37 2024 -0600

    Rename gather/scatter benchmarks to clarify coalesced behavior. (#16083)

    The benchmark names `coalesce_x` and `coalesce_o` are not very clear. This PR renames them to `coalesced` and `shuffled`. This was discussed with @GregoryKimball.

    Authors:
      - Bradley Dice (https://github.com/bdice)
      - Vyas Ramasubramani (https://github.com/vyasr)

    Approvers:
      - Karthikeyan (https://github.com/karthikeyann)
      - Mike Wilson (https://github.com/hyperbolic2346)

    URL: https://github.com/rapidsai/cudf/pull/16083

commit 051fadd2500bc20b90b74d662deec918ee27f299
Merge: ece86996ad fa0d89d9b4
Author: gpuCI <[email protected]>
Date:   Fri Jul 19 17:46:33 2024 -0400

    Merge pull request #16326 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit fa0d89d9b4b4152b919999b5f01b1e68407469c5
Author: Matthew Roeschke <[email protected]>
Date:   Fri Jul 19 11:46:28 2024 -1000

    Clean unneeded/redudant dtype utils (#16309)

    * Replace `min_scalar_type` with `min_signed_type` (the former just called the latter)
    * Replace `numeric_normalize_types` with `find_common_dtype` followed by a column `astype`
    * Removed `_NUMPY_SCTYPES` with just hardcoding the integer/floating types or using `np.integer`/`np.floating`

    Authors:
      - Matthew Roeschke (https://github.com/mroeschke)

    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: https://github.com/rapidsai/cudf/pull/16309

commit ece86996ad69b1631e0da6f4dfb551cda38585a8
Merge: f47c891a2e 18f5fe0010
Author: gpuCI <[email protected]>
Date:   Fri Jul 19 17:41:47 2024 -0400

    Merge pull request #16325 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 18f5fe0010fd42f604a340cd025a9ca9e122c6f5
Author: Thomas Li <[email protected]>
Date:   Fri Jul 19 14:41:39 2024 -0700

    Fix polars for 1.2.1 (#16316)

    I think Polars made a breaking change in a patch release.
    At least the error we're getting looks like the error from
    https://github.com/pola-rs/polars/pull/17606.

    Authors:
      - Thomas Li (https://github.com/lithomas1)

    Approvers:
      - Lawrence Mitchell (https://github.com/wence-)
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: https://github.com/rapidsai/cudf/pull/16316

commit f47c891a2ea3a0de4bb0462d557531e046860fbb
Merge: c61638cbeb 3df4ac2842
Author: gpuCI <[email protected]>
Date:   Fri Jul 19 16:46:23 2024 -0400

    Merge pull request #16323 from rapidsai/branch-24.08

    Forward-merge branch-24.08 into branch-24.10

commit 3df4ac28423b99e4dd88570da8d55e2e5af2e1bc
Author: Matthew Roeschke <[email protected]>
Date:   Fri Jul 19 10…
rapids-bot bot pushed a commit that referenced this issue Aug 16, 2024
…nterface (#16548)

This PR rewrites all remaining parts of the Python interop code previously using Arrow C++ types to instead use the C Data Interface. With this change, we no longer require pyarrow in that part of the Cython code. There are further improvements that we should make to streamline the internals, but I would like to keep this changeset minimal since getting it merged unblocks progress on multiple fronts so that we can progress further in parallel.

Contributes to #15193

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #16548
rapids-bot bot pushed a commit that referenced this issue Aug 22, 2024
Contributes to #15193.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Bradley Dice (https://github.com/bdice)
  - David Wendt (https://github.com/davidwendt)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #16590
@vyasr vyasr mentioned this issue Aug 22, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
Status: Done
Status: Story Issue
Development

Successfully merging a pull request may close this issue.

4 participants