Add support for Dask versions >=2024.3.0 with dask expressions #288

smcguire-cmu · 2024-04-23T23:39:51Z

Dask updating to use dask expressions for dataframes introduced a few behavior changes that caused errors. This is mostly Sandro's work to fix the bugs introduced. I think there should probably be some more refactoring, but I wanted to get this in with it causing issues with Tape integration and with python 3.11.9+ as per #285.

Empty DataFrames are now handled differently, with from_delayed now failing for an empty list of delayed objects, and from_pandas with npartitions=0 generating a DataFrame that gives an error on compute(). For now, the solution is to make a ddf with a single empty pandas df as a partition. I thought that we were generating empty data frames properly before, but this is actually what Dask used to do with the empty input cases, it just throws an error now instead. I'll make an issue to investigate how to actually do empty dask dfs. Currently we have a lot of repeated code to create a Catalog from delayed objects which meant this had to be changed in a few places, but we already have an issue Refactor common catalog creation logic #143 for refactoring this.
from_delayed doesn't convert object columns to pyarrow strings any more, but from_map which we use for loading from hipscat does. The unit test that compares from_dataframe to from_hispcat has had its input changed to load the input df with the pyarrow string class. I'm hoping the solution to Support loading strings with the pyarrow backend to use the string[pyarrow] pandas dtype #279 will fix this consistently without this fix being needed.
the data frame type is now dask_expr._collection.DataFrame instead of dd.core.DataFrame, so this has been updated.

github-actions · 2024-04-23T23:42:56Z

Before [`286fa7a`] <v0.2.0>	After [`98a9314`]	Ratio	Benchmark (Parameter)
11.2±0.7ms	11.1±0.1ms	0.99	benchmarks.time_polygon_search
33.0±0.8ms	32.1±0.3ms	0.97	benchmarks.time_kdtree_crossmatch
5.66±0.2ms	5.36±0.2ms	0.95	benchmarks.time_box_filter_on_partition

Click here to view all benchmarks.

codecov · 2024-04-23T23:54:56Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.97%. Comparing base (286fa7a) to head (064e786).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #288   +/-   ##
=======================================
  Coverage   98.96%   98.97%           
=======================================
  Files          41       41           
  Lines        1260     1264    +4     
=======================================
+ Hits         1247     1251    +4     
  Misses         13       13

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

smcguire-cmu requested a review from delucchi-cmu April 23, 2024 23:39

smcguire-cmu mentioned this pull request Apr 23, 2024

Dask empty DataFrames aren't really empty #287

Open

camposandro and others added 4 commits April 23, 2024 19:46

Prevent from_delayed from loading empty set

d33d602

Load band with string pyarrow type

6194155

Pin dask minimum version

281821e

fix soft association catalog ddf

f3ad88e

smcguire-cmu force-pushed the issue/181/dask-expr branch from e8cca2f to f3ad88e Compare April 23, 2024 23:46

smcguire-cmu added 2 commits April 23, 2024 19:51

update dataframe type

0de2910

update dataframe type

064e786

smcguire-cmu enabled auto-merge April 24, 2024 00:16

delucchi-cmu approved these changes Apr 24, 2024

View reviewed changes

smcguire-cmu merged commit 12ec915 into main Apr 24, 2024
11 checks passed

smcguire-cmu deleted the issue/181/dask-expr branch April 24, 2024 12:59

This was referenced Apr 24, 2024

Investigate dask dataframe deprecation warning #181

Closed

"import lsdb" fails with a dask error #285

Closed

smcguire-cmu mentioned this pull request Apr 24, 2024

Revert dask expressions #291

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Dask versions >=2024.3.0 with dask expressions #288

Add support for Dask versions >=2024.3.0 with dask expressions #288

smcguire-cmu commented Apr 23, 2024 •

edited

Loading

github-actions bot commented Apr 23, 2024 •

edited

Loading

codecov bot commented Apr 23, 2024

Add support for Dask versions >=2024.3.0 with dask expressions #288

Add support for Dask versions >=2024.3.0 with dask expressions #288

Conversation

smcguire-cmu commented Apr 23, 2024 • edited Loading

github-actions bot commented Apr 23, 2024 • edited Loading

codecov bot commented Apr 23, 2024

Codecov Report

smcguire-cmu commented Apr 23, 2024 •

edited

Loading

github-actions bot commented Apr 23, 2024 •

edited

Loading