Suggestion: Significant speedup for `difference_inner_join`

Hi!

The current implementation of `difference_inner_join` relies on the more general `fuzzy_join` and (conceptually) does a Cartesian join, followed by a filter on the pairs that are within the given distance. This is a *O(N M)* time and memory cost (assuming N=length of shorter dataframe, M=length of the longer).

A more efficient way is to represent the longer join column as a `[value - dist, value + dist]` interval tree. Then, for each element in the shorter column, query this tree in *O(log M)* time. The resulting cost is *O( (N+M) log M)*, including the build time for the tree. In the case of constant N and M growing to large values (e.g. small sample searched against a huge database), the cost is then *O(M log M)* instead of *O(M^2)*.

I'm not fluent enough in R and unfortunately have very little time to help out, should this be interesting, but if it helps, see my implementation for Pandas [here](https://github.com/fburic/pandance/blob/main/pandance/pandance.py#L17). I was motivated to write my own since I needed this functionality for a project with big data but sadly `difference_inner_join` wasn't feasible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Suggestion: Significant speedup for `difference_inner_join` #86

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Suggestion: Significant speedup for difference_inner_join #86

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Suggestion: Significant speedup for `difference_inner_join` #86