Add support for Dask versions >=2024.3.0 with dask expressions #288
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Dask updating to use dask expressions for dataframes introduced a few behavior changes that caused errors. This is mostly Sandro's work to fix the bugs introduced. I think there should probably be some more refactoring, but I wanted to get this in with it causing issues with Tape integration and with python 3.11.9+ as per #285.
from_delayed
now failing for an empty list of delayed objects, andfrom_pandas
withnpartitions=0
generating a DataFrame that gives an error oncompute()
. For now, the solution is to make a ddf with a single empty pandas df as a partition. I thought that we were generating empty data frames properly before, but this is actually what Dask used to do with the empty input cases, it just throws an error now instead. I'll make an issue to investigate how to actually do empty dask dfs. Currently we have a lot of repeated code to create a Catalog from delayed objects which meant this had to be changed in a few places, but we already have an issue Refactor common catalog creation logic #143 for refactoring this.from_delayed
doesn't convert object columns to pyarrow strings any more, butfrom_map
which we use for loading from hipscat does. The unit test that comparesfrom_dataframe
tofrom_hispcat
has had its input changed to load the input df with the pyarrow string class. I'm hoping the solution to Support loading strings with the pyarrow backend to use thestring[pyarrow]
pandas dtype #279 will fix this consistently without this fix being needed.dask_expr._collection.DataFrame
instead ofdd.core.DataFrame
, so this has been updated.