Skip to content

issues with dask interpolate #185

@knaaptime

Description

@knaaptime

i've finally had some time to take a closer look at the dask backend, so starting a thread here to triage some thoughts. In this gist I'm looking at timing for some pretty big datasets (source=2.7m, target=137k). tl;dr,:

  • the single core interpolation takes 4-4.5 minutes,
  • the parallelized version takes 3.5 min
  • the dask-based version takes 1.5 min
    • (thats the generic one without doing any config. If i try to use distributed, everything crashes)

So this is really nice when you've got a large enough dataset, and it will be great to get it fleshed out for the other variable types. Now that i've interacted with the code a little more:

  • i'm not sure what the dtype error is in the currently failing tests, but im not seeing it locally
  • why do categoricals need to be categorical dtype? Usually they are just strings, so this line can be a little confusing when you get back AttributeError: Can only use .cat accessor with a 'category' dtype. If dask absolutely requires a categorical type, we should do a check or conversion
    • what's the purpose of category_vars? The categorical variables should already come back formatted like that, so this block is basically just doing the same thing but forcing categorical instead of using unique
  • if not given, the id col should probably be the target gdf index
    • we probably want to drop the dask index ('hilbert_distance' in the current example) that comes back from area_interpolate_dask, and probably instead set the index to id_col
  • the real work here is still done by the area_interpolate function, and dask is acting as a scheduler (basically a drop in replacement for joblib in the parallelized implementation). So all actual computation of extensive/intensive/categorical happens in the workers, and just needs to be stuck back together at the end. So shouldn't we have a single aggregation instead of multiple based on variable type? the weighting and summations should've already happened by this point, so everything should get the same groupby/apply?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions