Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spatial joins #113

Merged
merged 12 commits into from
Apr 23, 2024
Merged

Spatial joins #113

merged 12 commits into from
Apr 23, 2024

Conversation

asinghvi17
Copy link
Member

@asinghvi17 asinghvi17 commented Apr 17, 2024

This PR hooks the spatial DE-9IM predicates up to FlexiJoins.jl's by_pred functionality via a package extension, to enable spatial joins on tables.

An example is also added (but this still needs tests, just to check that the functionality works).

download-6

cc @JuliaAPlavin @aplavin

@asinghvi17
Copy link
Member Author

asinghvi17 commented Apr 17, 2024

Can't merge until I figure out why the benchmark theme is leaking in the docs, but this is essentially complete aside from tests now.


@testset "Basic integration" begin

@test_nowarn joined_df = FlexiJoins.innerjoin((poly_df, points_df), by_pred(:geometry, GO.contains, :geometry))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to wrap this and avoid the :geometry :geometry part

Like this maybe:

joined_df = GO.innerjoin(poly_df, points_df, GO.contains)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you're saying here now - but until we resolve the DataFrames/GI.geometrycolumn/ArchGDAL mess, it's probably best to keep this explicit IMO :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it actually a mess? It mostly just works

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case yes - GADM for example only outputs tables with geom columns, as do many other ArchGDAL-based loaders.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArchGDAL tables are fine, GADM returns a NamedTuple. Theres an issue for that

@rafaqz
Copy link
Member

rafaqz commented Apr 17, 2024

Is the idea we would do this on prepared geometries, or at least GI wrappers with the extent calculated?

Nested loops that recalculate extent for every combination are going to be pretty slow.

Probably we should do an extent generating pass before doing the join, but that breaks the Flexijoins abstraction a little...

@asinghvi17
Copy link
Member Author

Good point - in that case I can see the argument for wrapping it. Though this loses the ability to do true SQL-like joins via FlexiJoins (since I don't know whether we are able to define a new predicate which can mutate the geometry...)

We can always say that this is just for convenience, and you should use FlexiJoins for any real work?

@aplavin
Copy link
Contributor

aplavin commented Apr 17, 2024

Interesting direction!

The user-facing API in FlexiJoins is stable, well-tested, and reliable. As for extending with more predicates – this part may evolve somewhat over time, for the sole reason that it's more more rarely used and not all sharp edges have been smoothed out.
As of now, aside from predicates implemented directly in FlexiJoins, I have a few extensions in other packages. One is basically joining an in-memory dataset to an SQL table, just in the context of astronomical data: https://github.com/JuliaAPlavin/VirtualObservatory.jl/blob/master/ext/FlexiJoinsSkyCoordsExt.jl. This can be extended to DBInterface SQL tables in general, just isn't yet.
Another is (not published) package for matching astronomy catalogs taking coordinate uncertainties into account.

The main feature of FlexiJoins is doing composable & performant joins, preferably avoiding n^2 loops at the default. It definitely has an ability for precomputations, they are extensively used for all basic join predicates. Basically, first prepare_for_join is called on one side of the join, and then findmatchix called for each row of the other side.
Is that relevant to the kind of computations needed for "geojoins"?

@asinghvi17
Copy link
Member Author

asinghvi17 commented Apr 17, 2024

Hmm, at least for one side we could use prepare_for_join for sure. The idea would be to prepare several caches (boundingbox etc) so that the boolean operation can be made as fast as possible.

Reading the code, it seems like findmatchix is called once per row of a? If so, then it doesn't look like we can do default caching for the B table. Still, a cache on the a table is better than nothing.

@aplavin
Copy link
Contributor

aplavin commented Apr 17, 2024

Hmm, at least for one side we could use prepare_for_join for sure. The idea would be to prepare several caches (boundingbox etc) so that the boolean operation can be made as fast as possible.

Sounds like exactly what prepare_for_join is intended for :) It's used to create hashmaps (https://github.com/JuliaAPlavin/FlexiJoins.jl/blob/master/src/bykey.jl#L62-L90), to build search trees (https://github.com/JuliaAPlavin/FlexiJoins.jl/blob/master/src/nearestneighbors.jl#L2-L3), etc.

Reading the code, it seems like findmatchix is called once per row of a?

Yeah, that's the current design.

If so, then it doesn't look like we can do default caching for the B table.

I understand this can significantly improve performance sometimes, but such a workflow isn't implemented yet.
That would bring all those complications, on deciding which is faster in each case (just loop over one side vs "prepare" both). And all that in addition to having another interface to implement...

But would be nice to have indeed at some point! Actually, preparing both sides can give a boost in many cases, from sorted-joins to tree NN joins.

@asinghvi17
Copy link
Member Author

Yeah I can see that being a pain for sure. It's probably not too much trouble to ask users to run table.geom = prepare(table.geom) before a join...

@aplavin
Copy link
Contributor

aplavin commented Apr 17, 2024

And still prepare on one side + loop over the other makes sense to implement IMO in general. Is it a reasonable thing for geojoin?

@asinghvi17
Copy link
Member Author

I think so, we can at least start with providing extents. We are creating a prepared geometry interface in #105 which will slot into the position of the extent call later!

@rafaqz
Copy link
Member

rafaqz commented Apr 17, 2024

Calculating extent for the "a" side and re-wrapping geoms in prepare_for_join sounds like it will improve things quite a bit.

But eventually we probably don't want to run those double nested loops on large datasets either, and instead precalculate an STRtree or similar instead that we can just query to return the minimum list. We can use existing packages like https://github.com/JuliaGeo/LibSpatialIndex.jl or native in https://github.com/maxfreu/SortTileRecursiveTree.jl.

This is what e.g. PostGIS does with an R-tree and a few other options.

Maybe FlexiJoins is still good as a small dataset/single use approach.

@asinghvi17
Copy link
Member Author

I think FlexiJoins has a tree based approach as well, I just haven't implemented it yet :D

@rafaqz
Copy link
Member

rafaqz commented Apr 17, 2024

Ok amazing if it can handle trees to. @maxfreu s package is already GI compatible, if we can hook up SortTileRecursiveTree.jl and FlexiJoins.jl then this could really work well.

@aplavin
Copy link
Contributor

aplavin commented Apr 17, 2024

Totally, build tree in prepare_for_join and then query it in findmatchix is how FlexiJoins does eg distance-based joins. See https://github.com/JuliaAPlavin/FlexiJoins.jl/blob/master/src/nearestneighbors.jl#L2-L5, using NearestNeighbors.jl.
This is just a few lines.

Btw, if you know of a better/faster packages for general range search, please tell me! NN.jl unfortunately doesn't have very generic implementation.

@rafaqz
Copy link
Member

rafaqz commented Apr 17, 2024

Ok great that's all built in already. I'm not sure about the general problem! But for geometries NN.jl is less suitable. The standard approach is to match geometries by precalculated extent tiles and then run actual checks only on the short-list of crossing extents. Crossing extents doesn't mean geometries actually cross, but they are quite likely to. So the R-tree gets us most of the way.

The plots here are a good overview:
https://en.wikipedia.org/wiki/R-tree

Co-authored-by: Alexander Plavin <[email protected]>
@asinghvi17
Copy link
Member Author

asinghvi17 commented Apr 17, 2024

Thanks for the review @aplavin! I wasn't sure how to implement some of these but it makes a lot more sense now :)

If you'd like I can write a quick blurb on how to get FlexiJoins working with user-defined predicates, now that we've done it here! Not sure where it would go though.


```julia
my_predicate_function = <(5) ∘ abs ∘ GO.distance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is an actual distance, it's probably already supported by FlexiJoins :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is supported by Distances.jl, and there are a bunch of other GO functions one might want to use :D - for example, testing whether centroids are close to each other, or something. So I figured a general approach would be best.

Just out of curiosity, is there a reason that NestedLoopFast isn't supported by default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example, testing whether centroids are close to each other

Isn't it just by_distance(x -> centroid(x.geometry), Euclidean(), <=(3))? Or whatever other distance you need instead of Euclidean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is supported by Distances.jl

Wonder why is that the case? Does the function break some distance properties?

Copy link
Contributor

@aplavin aplavin Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason that NestedLoopFast isn't supported by default?

I consider falling back to n^2 join without user explicitly requesting it is a footgun.
NestedLoopFast is really intended for cheap filtering operations on top of the "main" join predicate. Such as NotSame in FlexiJoins itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For centroids yes, but if computing the distance between geometries (basically distance between closest linesegments) then that's not a Distances.jl thing. The centroid comparison would be that though, and I should probably add an example of that syntax to the docs as well!

@asinghvi17 asinghvi17 added the enhancement New feature or request label Apr 20, 2024
@asinghvi17 asinghvi17 merged commit 745f0bd into main Apr 23, 2024
3 checks passed
@asinghvi17 asinghvi17 deleted the as/flexijoins branch April 23, 2024 23:30
@asinghvi17 asinghvi17 restored the as/flexijoins branch April 23, 2024 23:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants