Skip to content

Suggestion: Significant speedup for difference_inner_join #86

Open
@fburic

Description

@fburic

Hi!

The current implementation of difference_inner_join relies on the more general fuzzy_join and (conceptually) does a Cartesian join, followed by a filter on the pairs that are within the given distance. This is a O(N M) time and memory cost (assuming N=length of shorter dataframe, M=length of the longer).

A more efficient way is to represent the longer join column as a [value - dist, value + dist] interval tree. Then, for each element in the shorter column, query this tree in O(log M) time. The resulting cost is O( (N+M) log M), including the build time for the tree. In the case of constant N and M growing to large values (e.g. small sample searched against a huge database), the cost is then O(M log M) instead of O(M^2).

I'm not fluent enough in R and unfortunately have very little time to help out, should this be interesting, but if it helps, see my implementation for Pandas here. I was motivated to write my own since I needed this functionality for a project with big data but sadly difference_inner_join wasn't feasible.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions