Question: Suggestions for handling . wildcards within query patterns

Hi Andrew,

First and foremost, thanks for creating and maintaining this package - it truly is awesome and incredibly powerful!

I'm currently using the crate to build large automatons containing 100k - 1M DNA patterns, where each pattern is < 25 characters long. These enable me to search large genomic databases/graphs for many query patterns simultaneously.

The text I'm querying is solely comprised solely of ACGT alphabet, and my patterns can contain ACGT or N, where N is a special character that matches every other character.

Do you have a recommendation for how best to handle this type of wildcard (N = . in regexes) within the automaton? My current naive approach is just to enumerate all unambiguous versions of each pattern (e.g. ANG would generate AAG, ACG, AGG and ATG) and store them all in the automaton. This works fairly well when patterns can only contain a few wildcards, but the combinatorics become memory prohibitive in some of the applications I'm exploring.

Any suggestions for how to better handle these wildcards would be much appreciated!

Thomas



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Suggestions for handling . wildcards within query patterns #122

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question: Suggestions for handling . wildcards within query patterns #122

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions