Given a catalogue of dictionaries, how to programmatically decide which one to use for a file (or for a block)?

Is there some way to decide which of a set of dictionaries would be appropriate to use for a given piece of data?

I've been thinking how I would do it if there's nothing else to use:

My thinking so far was to convert each dictionary into a Bloom filter of all length-`n` substrings in that dictionary (maybe `n=4` or thereabouts?).  Then run a rolling hash (also length-`n`) over the data in question and score each dictionary by the number of hits it yields, with a bit of a bonus multiplier for uninterrupted runs of hits; where uninterrupted runs serve as a proxy (albeit with false positives) for longer string matches.

I say "rolling hash", but obviously for strings fixed at length 4 or 8 that's just an unaligned load and permute.

I haven't done the arithmetic but it's just a heuristic so I feel like it's OK to be sloppy with the false positives (Bloom's innate errors, plus the bad assumption that matching "ABCD*" and "*BCDE" means that "ABCDE" is probably in the dictionary).

Is there something pre-existing, or a better algorithm?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Given a catalogue of dictionaries, how to programmatically decide which one to use for a file (or for a block)? #4373

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Given a catalogue of dictionaries, how to programmatically decide which one to use for a file (or for a block)? #4373

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions