Match Keys #2603

medwar99 · 2025-02-06T14:58:57Z

medwar99
Feb 6, 2025

Hi Splink Superheros,

I've been mulling over the use of match keys as a method of assisting end users with their use of edges produced by Splink.

In previous systems within our organisation we had a coding system describing the type of edge that had been produced, based on a cascading set of business logic and deterministic/fuzzy rules. This allows users to select which edge types they wish to keep, depending on how stringent their usage requirements are.

I notice match_keys and gammas do something similar. The gammas tell us exactly which comparison levels were used, and the match_key details which blocking rule has generated that edge/comparison for consideration. There's some questions I have about the granularity of the match_key though...

Firstly, am I right in thinking that match_key is just a sequential enumeration of the blocking rule within the ordered blocking rule list? See:

splink/splink/internals/blocking.py

Lines 106 to 107 in 2ed9f8b

    
           def match_key(self): 
        
               return len(self.preceding_rules)

If this is the case, match_key would allow me to point to which rule first generated that comparison, but does mean my blocking rule ordering matters. A match_key may not tell me which is the most strict / informative blocking rule that captures that edge, depending on the ordering of the rules in blocking_rules_to_generate_predictions.

Second, I wonder what might be a nice approach to recording the "best" match_key generating that comparison.

We could order our blocking rules manually when defining the model, so that it flows in a way that seems sensible from a user perspective. I try to do this as much as possible at the moment, but it becomes complicated with increasing rule count and interactions.
We could use the gammas to further subdivide the match_key, but really it would be nice to keep the blocking and comparisons separate.
Perhaps if a comparison is hit by multiple rules, we could have a list of match_key referencing the multiple rules? The user could then refer back and determine which of those rules may be the most insightful.
Perhaps there are some algorithmic/statistical ways to looks at how "strict" a blocking rule may be (number of comparisons generated?). But this might not have the same interpretations by the end user when blocking rules are complicated.

We could just look at the gammas and ignore the match key entirely, but I like to keep the blocking and comparison stages distinct given their differing purpose in the pipeline. Plus explaining a set of numbers to users is more complicated than a single one.

Perhaps I'm just musing to myself, but interested to hear other's thoughts.

A working (Splink 3) example when blocking by postcode:

Say we have two models with the same two blocking rules, just with their order swapped:

Model 1. [l.postcode = r.postcode, left(l.postcode,2) = left(r.postcode)]
Model 2. [left(l.postcode,2) = left(r.postcode), l.postcode = r.postcode]

Model 1 would assign a match_key of 0 to the comparison SW1A 2AA ↔ SW1A 2AA, and a match_key of 1 to and SW1A 2AA ↔ SW1A 2AB.

Model 2 however would assign a match_key of 0 to both, which doesn't help us to see at a glance that these are semantically different edges/match types based on the blocking rules we've designed.

from splink.duckdb.linker import DuckDBLinker
import splink.duckdb.comparison_library as cl
import pandas as pd

df = pd.DataFrame({'unique_id': [1, 2, 3],
                   'name': ['john doe', 'john doe', 'john doe'],
                   'postcode': ['SW1A 2AA','SW1A 2AA','SW1A 2AB']})
df

## Model 1 - Exact then Area
settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        "l.postcode = r.postcode",
        "left(l.postcode, 2) = left(r.postcode, 2)"
    ],
    "comparisons": [cl.exact_match('postcode')]
}

linker = DuckDBLinker(df, settings_dict=settings)
linker.estimate_u_using_random_sampling(max_pairs=1e6)
linker.estimate_parameters_using_expectation_maximisation("l.name = r.name")
pair_preds = linker.predict().as_pandas_dataframe()
display(pair_preds)


## Model 2 - Area then Exact
settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        "left(l.postcode, 2) = left(r.postcode, 2)",
        "l.postcode = r.postcode"
    ],
    "comparisons": [cl.exact_match('postcode')]
}

linker = DuckDBLinker(df, settings_dict=settings)
linker.estimate_u_using_random_sampling(max_pairs=1e6)
linker.estimate_parameters_using_expectation_maximisation("l.name = r.name")
pair_preds = linker.predict().as_pandas_dataframe()
display(pair_preds)

Gives

match_weight	match_probability	unique_id_l	unique_id_r	postcode_l	postcode_r	gamma_postcode	match_key
-17.024153	0.000008	2	3	SW1A 2AA	SW1A 2AB	0	1
-17.024153	0.000008	1	3	SW1A 2AA	SW1A 2AB	0	1
-11.776626	0.000285	1	2	SW1A 2AA	SW1A 2AA	1	0

and

match_weight	match_probability	unique_id_l	unique_id_r	postcode_l	postcode_r	gamma_postcode
-17.024153	0.000008	2	3	SW1A 2AA	SW1A 2AB	0
-17.024153	0.000008	1	3	SW1A 2AA	SW1A 2AB	0
-11.776626	0.000285	1	2	SW1A 2AA	SW1A 2AA	1

RobinL · 2025-02-07T15:37:09Z

RobinL
Feb 7, 2025
Maintainer

Thanks for this!

Firstly, am I right in thinking that match_key is just a sequential enumeration of the blocking rule within the ordered blocking rule list?
See: Splink source code

Yes, that's correct.

I notice match_keys and gammas do something similar. The gammas tell us exactly which comparison levels were used, and the match_key details which blocking rule has generated that edge/comparison for consideration. There's some questions I have about the granularity of the match_key though...

Yes. Broadly, I think the gamma patterns are usually a better measure of the 'type' of edge, but I agree that match key is sometimes relevant here.

Perhaps if a comparison is hit by multiple rules, we could have a list of match_key referencing the multiple rules?
The user could then refer back and determine which of those rules may be the most insightful.

I think you're on to something here. We've always had a vague idea that there may be an algorithm that could optimise a set of blocking rules. As you mention, at the moment, order matters. And it's possible that a reordering of blocking rules may make one rule completely redundant. Or, at the very least, a reordering may result in some of the blocking rules creating just a handful of marginal comparisons.

At the moment, there's no obvious way of automatically identifying these redundant rules. But if we followed your suggestion and computed, for each comparison, the corresponding set of satisfied match keys, then it feels like we could unlock this.

Until I read your post, it hadn't really occurred to me that computing this set of match keys is computationally cheap. Once you have it, I think you could do some sort of count/group by to get something like:

Match Keys	Count of Pairwise Comparisons
`[1,2]`	1000
`[2]`	10
`[1]`	1

And I think clever analysis of this table could help recommend an optimal ordering of blocking rules, and identify 'redundant' rules (rules which, with a reordering, are no longer needed because they create no marginal comparisons)

1 reply

medwar99 Feb 21, 2025
Author

That's great, thanks for confirming.

For fun I've been writing an optimiser for blocking rule creation (both defining the rules and their order), but you're right in that an optimal target function to optimise towards isn't explicitly clear to me at the moment. I'll be interested to see what kinds of models pop out from this as it may help to create a generic workflow for dataset-agnostic blocking rule definition.

I like the idea of count of comparisons produced, and agree looking at the novel comparisons being produced in successive rules is informative there. I'm also exploring whether we can optimise for a fixed model's final performance (f1, f2 ratio of positives not captured by blocking rules, etc.).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Match Keys #2603

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Match Keys #2603

medwar99 Feb 6, 2025

A working (Splink 3) example when blocking by postcode:

Replies: 1 comment · 1 reply

RobinL Feb 7, 2025 Maintainer

medwar99 Feb 21, 2025 Author

medwar99
Feb 6, 2025

Replies: 1 comment 1 reply

RobinL
Feb 7, 2025
Maintainer

medwar99 Feb 21, 2025
Author