Blocking rules vs random sampling during EM? #1075

NickCrews · 2023-02-24T18:53:07Z

NickCrews
Feb 24, 2023

I'm wondering why we use blocking rules during the EM step. If I understand correctly, the point of the blocking rules is to reduce the number of comparisons to a tractable level. But wouldn't random sampling also achieve this? Would random sampling result in such a tiny fraction of true-matches that EM would fail?

I ask because I just stumbled upon Robin's great deep dive on splink (PS this article and some other record linkage ones aren't listed under the record linkage topic, they only are on the homepage) where it describes a downside of using blocking rules inside EM:

Second, the m values estimated by the models were different from their true values by a fairly large margin, and this pattern was consistent across all models.

What may be the cause?

The m values are computed from running the expectation maximisation algorithm on a subset of blocked record comparisons. If the blocking rules select a biased sample of records, then the estimated m values will also be biased.

There is good reason to believe this may be the case. For example, for estimation of the parameters on first name and surname, I block on date of birth and postcode. But records where date of birth and postcode match are likely to be higher quality records overall, and so there is less likely to be an error in forename or surname than for an average matching record. This would be expected to cause the m value for a match to be estimated too high.

RobinL · 2023-02-26T17:57:24Z

RobinL
Feb 26, 2023
Maintainer

As you say, the fundamental reason is that you need a decent number of examples of matching pairs for EM to work well. And on a large dataset random sampling results in a a very tiny proportion of randomly-sampled pairs being matches.

But now you mention it, we haven't actually tested this, and you might be right: it might work.

So for example, if you were linking two million-record datasets, random sampling may result in 1/1,000,000 randomly sampled pairs being a match. So if you can sample (say) 1bn pairs, you expect to find 1,000 matches.

Now I guess we have a bias variance tradeoff. As you mentioned, the current approach of using blocking for EM results in potential bias due to data quality issues. But you likely get a much larger sample size of matching records. Plausibly in the above example, you might get 800k of the 1m matches by (say) blocking on surname, as opposed to only 1k matches from the random sampling approach.

So really the outstanding question is whether EM converges quickly when there's an extreme imbalance of non-matches to matches. I don't see any reason in principle why it should matter.

Then it might well give you the best estimator in two situations:

relatively small input datasets which are nonetheless too large to generate all comparisons (e.g. 100k). Note that you can give a blocking rule of 1=1 or true if you want to just create all comparisons
you have lots of compute (a big cluster) and can randomly sample e.g. 10bn comparisons

Overall this feels like something we should test with the synthetic data to see how it compares to the existing approach. It's just finding the time!

3 replies

aalexandersson Feb 27, 2023

Is the main goal here more speed or more accuracy?

In the literature, the focus when discussing blocking errors on record linkage has been on accuracy. The first joint blocking and record linkage model that I am aware of is the "hit-miss" model (Copas and Hilton 1990 https://doi.org/10.2307/2982975). Software implementation of it has been very difficult, and at the expense of speed. A recent implementation example is d-blink (Marchant et al 2021 https://doi.org/10.1080/10618600.2020.1825451).

RobinL Feb 28, 2023
Maintainer

I guess it's a bit of both. One of the key goals of Splink is for it to run at scale (input datasets of 10s of millions of records). So it's not necessarily that we need Splink to be incredibly fast, but we do need it to be able to perform these large record linkages in a 'reasonable time' (no more than a few hours). As a result, Splink ends up being very fast on smaller datasets.

I'm not an expert in alternative models, but my rough understanding is there are various other linkage approaches which are more sophisticated than Fellegi-Sunter, and this means they can be more accurate in some circumstances. But their complexity means they run much slower. With Splink, we felt that Fellegi-Sunter gave a good balance between speed and accuracy.

In practice we've found that you can get a long way with Fellegi Sunter if you're careful with the model spec and customising it carefully to the input data - we've not actually seen anyone demonstrate significantly higher accuracy than a good Splink model. (But if it's possible I'd love to see it e.g. tested against the same synthetic dataset I used for the Splink benchmarks here).

My gut feeling is that there's isn't actually much accuracy 'left on the table' to improve from some of the best Splink models - if you look at the actual false positive and negative examples, often a human being is left just as confused as the computer.

I will take a look at d-blink, it looks interesting - thanks!

NickCrews Feb 28, 2023
Author

My goal in suggesting this was to remove a tunable parameter that the user would have to worry about, assuming that speed and accuracy wouldn't be hurt too much.

Thank you both for the thoughts. I think you're right Robin, just trying it might be the next step.

NickCrews · 2023-02-28T23:01:51Z

NickCrews
Feb 28, 2023
Author

Wait, another idea: Instead of asking the user to estimate the recall, we find the recall:

Do a rough pass on EM to find some rough values for u and m.
Take the blocking rule that the user gives us. Generate a sample of comparisons from that. Predict matches based on this using our u and m values. Now we have the match rate (n_matches/total_pairs) for records within this blocking rule.
Take a random sample of pairs from cartesian product. Predict matches based on this using our u and m values. Now we have the match rate (n_matches/total_pairs) for a random pair.
By comparing these two match rates, we know the recall for that blocking rule.
Do EM again to better fit m and u, using that blocking rule and recall
Repeat steps 2-5 until convergence

IDK, perhaps this still suffers from bias? It's a bit too complicated for me to wrap my head around it.

I think there's something promising in there though of updating all 3 params (u, m, and prob_random_match) until convergence, as opposed to setting prob_random_match at the beginning and then only updating u and m.

1 reply

RobinL Feb 28, 2023
Maintainer

There's definitely an interesting idea there (re: being able to estimate recall) - it's not something we've thought of before, but I can't immediately see any reason something like that wouldn't work.

Perhaps a simpler (but very similar) strategy would be to start with your (3), the cartesian product random sample. Then create an additional boolean column on this table called something like 'matches_blocking_rule'. That table could then be used (i believe) to derive all the calculations you need: e.g. how many records have matches_blocking_rule=False, but a prediction of it being a match.

aalexandersson · 2023-03-01T02:25:11Z

aalexandersson
Mar 1, 2023

As I understand it, your basic idea is "Use EM for matches_blocking_rule". In the literature, this is known as approximate probabilistic blocking. See the conference paper/book chapter (Enamorado and Steorts 2020, "Probabilistic Blocking and Distributed Bayesian Entity Resolution" in Privacy in Statistical Databases, 224-239. URL: https://doi.org/10.1007/978-3-030-57521-2_16).

Here is a related idea and strategy, which is less theoretical but more computer-intensive, for improving record linkage: "Generate a sequence of blocking and linkage solutions from which a researcher can choose". Often we need to run multiple configurations of blocking and record linkage which produce different results. It would be nice to have to have some simple way to automate different choices as "sensitivity analysis".

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blocking rules vs random sampling during EM? #1075

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Blocking rules vs random sampling during EM? #1075

NickCrews Feb 24, 2023

Replies: 3 comments · 4 replies

RobinL Feb 26, 2023 Maintainer

aalexandersson Feb 27, 2023

RobinL Feb 28, 2023 Maintainer

NickCrews Feb 28, 2023 Author

NickCrews Feb 28, 2023 Author

RobinL Feb 28, 2023 Maintainer

aalexandersson Mar 1, 2023

NickCrews
Feb 24, 2023

Replies: 3 comments 4 replies

RobinL
Feb 26, 2023
Maintainer

RobinL Feb 28, 2023
Maintainer

NickCrews Feb 28, 2023
Author

NickCrews
Feb 28, 2023
Author

RobinL Feb 28, 2023
Maintainer

aalexandersson
Mar 1, 2023