Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Dask versions >=2024.3.0 with dask expressions #288

Merged
merged 6 commits into from
Apr 24, 2024

Conversation

smcguire-cmu
Copy link
Contributor

@smcguire-cmu smcguire-cmu commented Apr 23, 2024

Dask updating to use dask expressions for dataframes introduced a few behavior changes that caused errors. This is mostly Sandro's work to fix the bugs introduced. I think there should probably be some more refactoring, but I wanted to get this in with it causing issues with Tape integration and with python 3.11.9+ as per #285.

  • Empty DataFrames are now handled differently, with from_delayed now failing for an empty list of delayed objects, and from_pandas with npartitions=0 generating a DataFrame that gives an error on compute(). For now, the solution is to make a ddf with a single empty pandas df as a partition. I thought that we were generating empty data frames properly before, but this is actually what Dask used to do with the empty input cases, it just throws an error now instead. I'll make an issue to investigate how to actually do empty dask dfs. Currently we have a lot of repeated code to create a Catalog from delayed objects which meant this had to be changed in a few places, but we already have an issue Refactor common catalog creation logic #143 for refactoring this.
  • from_delayed doesn't convert object columns to pyarrow strings any more, but from_map which we use for loading from hipscat does. The unit test that compares from_dataframe to from_hispcat has had its input changed to load the input df with the pyarrow string class. I'm hoping the solution to Support loading strings with the pyarrow backend to use the string[pyarrow] pandas dtype #279 will fix this consistently without this fix being needed.
  • the data frame type is now dask_expr._collection.DataFrame instead of dd.core.DataFrame, so this has been updated.

Copy link

github-actions bot commented Apr 23, 2024

Before [286fa7a] <v0.2.0> After [98a9314] Ratio Benchmark (Parameter)
11.2±0.7ms 11.1±0.1ms 0.99 benchmarks.time_polygon_search
33.0±0.8ms 32.1±0.3ms 0.97 benchmarks.time_kdtree_crossmatch
5.66±0.2ms 5.36±0.2ms 0.95 benchmarks.time_box_filter_on_partition

Click here to view all benchmarks.

Copy link

codecov bot commented Apr 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.97%. Comparing base (286fa7a) to head (064e786).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #288   +/-   ##
=======================================
  Coverage   98.96%   98.97%           
=======================================
  Files          41       41           
  Lines        1260     1264    +4     
=======================================
+ Hits         1247     1251    +4     
  Misses         13       13           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@smcguire-cmu smcguire-cmu merged commit 12ec915 into main Apr 24, 2024
11 checks passed
@smcguire-cmu smcguire-cmu deleted the issue/181/dask-expr branch April 24, 2024 12:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants