Skip to content

Given a catalogue of dictionaries, how to programmatically decide which one to use for a file (or for a block)? #4373

Open
@sh1boot

Description

@sh1boot

Is there some way to decide which of a set of dictionaries would be appropriate to use for a given piece of data?

I've been thinking how I would do it if there's nothing else to use:

My thinking so far was to convert each dictionary into a Bloom filter of all length-n substrings in that dictionary (maybe n=4 or thereabouts?). Then run a rolling hash (also length-n) over the data in question and score each dictionary by the number of hits it yields, with a bit of a bonus multiplier for uninterrupted runs of hits; where uninterrupted runs serve as a proxy (albeit with false positives) for longer string matches.

I say "rolling hash", but obviously for strings fixed at length 4 or 8 that's just an unaligned load and permute.

I haven't done the arithmetic but it's just a heuristic so I feel like it's OK to be sloppy with the false positives (Bloom's innate errors, plus the bad assumption that matching "ABCD*" and "*BCDE" means that "ABCDE" is probably in the dictionary).

Is there something pre-existing, or a better algorithm?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions