Skip to content

Question: Suggestions for handling . wildcards within query patterns #122

@tfwillems

Description

@tfwillems

Hi Andrew,

First and foremost, thanks for creating and maintaining this package - it truly is awesome and incredibly powerful!

I'm currently using the crate to build large automatons containing 100k - 1M DNA patterns, where each pattern is < 25 characters long. These enable me to search large genomic databases/graphs for many query patterns simultaneously.

The text I'm querying is solely comprised solely of ACGT alphabet, and my patterns can contain ACGT or N, where N is a special character that matches every other character.

Do you have a recommendation for how best to handle this type of wildcard (N = . in regexes) within the automaton? My current naive approach is just to enumerate all unambiguous versions of each pattern (e.g. ANG would generate AAG, ACG, AGG and ATG) and store them all in the automaton. This works fairly well when patterns can only contain a few wildcards, but the combinatorics become memory prohibitive in some of the applications I'm exploring.

Any suggestions for how to better handle these wildcards would be much appreciated!

Thomas

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions