Description
Is there some way to decide which of a set of dictionaries would be appropriate to use for a given piece of data?
I've been thinking how I would do it if there's nothing else to use:
My thinking so far was to convert each dictionary into a Bloom filter of all length-n
substrings in that dictionary (maybe n=4
or thereabouts?). Then run a rolling hash (also length-n
) over the data in question and score each dictionary by the number of hits it yields, with a bit of a bonus multiplier for uninterrupted runs of hits; where uninterrupted runs serve as a proxy (albeit with false positives) for longer string matches.
I say "rolling hash", but obviously for strings fixed at length 4 or 8 that's just an unaligned load and permute.
I haven't done the arithmetic but it's just a heuristic so I feel like it's OK to be sloppy with the false positives (Bloom's innate errors, plus the bad assumption that matching "ABCD*" and "*BCDE" means that "ABCDE" is probably in the dictionary).
Is there something pre-existing, or a better algorithm?