WIP Native Dask implementation for area interpolation (Do not merge) #187
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is my first attempt at implementing area interpolation fully in Dask (as opposed to using the single-core logic inside the Dask scheduler). The main motivation for this is to obtain correct estimates for intensive/extensive variables, which currently are erroneous using the chunk-by-chunk code that does work for categorical (as discussed in #185 )
A couple of thoughts on what I learned:
tobler/tobler/area_weighted/area_interpolate.py
Lines 316 to 321 in ce6fcb9
This means we need to build that table of weights before we split things to each worker for independent work. Now, the weights can be built in parallel (this is essentially about filling in different parts of a matrix which are not all related to each other. This is what I attempt in this PR on
id_area_table
(which is copied from the single-core implementation, works in chunks and is added to the existing Dask computation graph. This returns a three-column Dask dataframe with source ID, target ID and shared area, where only non-zero values are stored (i.e., you never have a row in this table with a value of 0 in the third column. As @knaaptime suggests, this is not a million miles away from the new graph implementation.id_area_table
, which builds the cross-over table, can be run in parallel with minimal inter-worker communication, it's performant and returns what is expected.https://github.com/darribas/tobler/blob/5ce79e71a89fc4ede93ec22f6eb84b1acf884e4a/tobler/area_weighted/area_interpolate_dask.py#L346-L360
The logic of the code is as follows:
weights
Now the issue with the logic above is that. although 1., 3., and potentially 4., are fast and very performant on Dask, 2. involves a significant amount of inter-worker communication and, since it is not done on sorted indices (that I can see), it will be slow.
This is where I stopped. A couple of further thoughts:
pandas.DataFrame
orpolars.DataFrame
, so if we can live with in-memory computation only, I think this logic would speed up our current implementationVery happy to explore further options and discuss alternative views!