Skip to content

Feature suggestion: faster methods for stringdist-based joins? #88

Open
@JonDDowns

Description

@JonDDowns

Hi,

First, I want to extend my appreciation for the great work on the fuzzyjoin package. Our team relies on it extensively, and it has been an invaluable tool in our workflows.

Recently, I was tasked with optimizing certain performance bottlenecks in one of our pipelines. To address this, I experimented with implementing a fuzzy join using Rust, which led to significant improvements in execution speed. I adapted this approach into a public example, available at https://github.com/JonDDowns/fozziejoin. While the benchmark is not exhaustive, I consistently observe 4–20x performance improvements across various datasets.

Given these results, I wanted to ask if there would be interest in integrating a similar approach within fuzzyjoin. Replacing stringdist with an alternative would indeed be a substantial change, but I believe it could offer considerable performance benefits.

I’d love to hear your thoughts on this and whether there might be an opportunity to collaborate on incorporating these enhancements into the package.

Best,
Jon

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions