-
Notifications
You must be signed in to change notification settings - Fork 108
Description
Hi Andrew,
First and foremost, thanks for creating and maintaining this package - it truly is awesome and incredibly powerful!
I'm currently using the crate to build large automatons containing 100k - 1M DNA patterns, where each pattern is < 25 characters long. These enable me to search large genomic databases/graphs for many query patterns simultaneously.
The text I'm querying is solely comprised solely of ACGT alphabet, and my patterns can contain ACGT or N, where N is a special character that matches every other character.
Do you have a recommendation for how best to handle this type of wildcard (N = . in regexes) within the automaton? My current naive approach is just to enumerate all unambiguous versions of each pattern (e.g. ANG would generate AAG, ACG, AGG and ATG) and store them all in the automaton. This works fairly well when patterns can only contain a few wildcards, but the combinatorics become memory prohibitive in some of the applications I'm exploring.
Any suggestions for how to better handle these wildcards would be much appreciated!
Thomas