Replies: 1 comment 1 reply
-
Thanks for this!
Yes, that's correct.
Yes. Broadly, I think the gamma patterns are usually a better measure of the 'type' of edge, but I agree that match key is sometimes relevant here.
I think you're on to something here. We've always had a vague idea that there may be an algorithm that could optimise a set of blocking rules. As you mention, at the moment, order matters. And it's possible that a reordering of blocking rules may make one rule completely redundant. Or, at the very least, a reordering may result in some of the blocking rules creating just a handful of marginal comparisons. At the moment, there's no obvious way of automatically identifying these redundant rules. But if we followed your suggestion and computed, for each comparison, the corresponding set of satisfied match keys, then it feels like we could unlock this. Until I read your post, it hadn't really occurred to me that computing this set of match keys is computationally cheap. Once you have it, I think you could do some sort of count/group by to get something like:
And I think clever analysis of this table could help recommend an optimal ordering of blocking rules, and identify 'redundant' rules (rules which, with a reordering, are no longer needed because they create no marginal comparisons) |
Beta Was this translation helpful? Give feedback.
-
Hi Splink Superheros,
I've been mulling over the use of match keys as a method of assisting end users with their use of edges produced by Splink.
In previous systems within our organisation we had a coding system describing the type of edge that had been produced, based on a cascading set of business logic and deterministic/fuzzy rules. This allows users to select which edge types they wish to keep, depending on how stringent their usage requirements are.
I notice
match_key
s and gammas do something similar. The gammas tell us exactly which comparison levels were used, and the match_key details which blocking rule has generated that edge/comparison for consideration. There's some questions I have about the granularity of thematch_key
though...Firstly, am I right in thinking that
match_key
is just a sequential enumeration of the blocking rule within the ordered blocking rule list? See:splink/splink/internals/blocking.py
Lines 106 to 107 in 2ed9f8b
If this is the case, match_key would allow me to point to which rule first generated that comparison, but does mean my blocking rule ordering matters. A
match_key
may not tell me which is the most strict / informative blocking rule that captures that edge, depending on the ordering of the rules inblocking_rules_to_generate_predictions
.Second, I wonder what might be a nice approach to recording the "best" match_key generating that comparison.
match_key
referencing the multiple rules? The user could then refer back and determine which of those rules may be the most insightful.We could just look at the gammas and ignore the match key entirely, but I like to keep the blocking and comparison stages distinct given their differing purpose in the pipeline. Plus explaining a set of numbers to users is more complicated than a single one.
Perhaps I'm just musing to myself, but interested to hear other's thoughts.
A working (Splink 3) example when blocking by postcode:
Say we have two models with the same two blocking rules, just with their order swapped:
Model 1.
[l.postcode = r.postcode, left(l.postcode,2) = left(r.postcode)]
Model 2.
[left(l.postcode,2) = left(r.postcode), l.postcode = r.postcode]
Model 1 would assign a match_key of 0 to the comparison SW1A 2AA ↔ SW1A 2AA, and a match_key of 1 to and SW1A 2AA ↔ SW1A 2AB.
Model 2 however would assign a match_key of 0 to both, which doesn't help us to see at a glance that these are semantically different edges/match types based on the blocking rules we've designed.
Gives
and
Beta Was this translation helpful? Give feedback.
All reactions