Replies: 3 comments 4 replies
-
As you say, the fundamental reason is that you need a decent number of examples of matching pairs for EM to work well. And on a large dataset random sampling results in a a very tiny proportion of randomly-sampled pairs being matches. But now you mention it, we haven't actually tested this, and you might be right: it might work. So for example, if you were linking two million-record datasets, random sampling may result in 1/1,000,000 randomly sampled pairs being a match. So if you can sample (say) 1bn pairs, you expect to find 1,000 matches. Now I guess we have a bias variance tradeoff. As you mentioned, the current approach of using blocking for EM results in potential bias due to data quality issues. But you likely get a much larger sample size of matching records. Plausibly in the above example, you might get 800k of the 1m matches by (say) blocking on surname, as opposed to only 1k matches from the random sampling approach. So really the outstanding question is whether EM converges quickly when there's an extreme imbalance of non-matches to matches. I don't see any reason in principle why it should matter. Then it might well give you the best estimator in two situations:
Overall this feels like something we should test with the synthetic data to see how it compares to the existing approach. It's just finding the time! |
Beta Was this translation helpful? Give feedback.
-
Wait, another idea: Instead of asking the user to estimate the recall, we find the recall:
IDK, perhaps this still suffers from bias? It's a bit too complicated for me to wrap my head around it. I think there's something promising in there though of updating all 3 params (u, m, and prob_random_match) until convergence, as opposed to setting prob_random_match at the beginning and then only updating u and m. |
Beta Was this translation helpful? Give feedback.
-
As I understand it, your basic idea is "Use EM for matches_blocking_rule". In the literature, this is known as Here is a related idea and strategy, which is less theoretical but more computer-intensive, for improving record linkage: "Generate a sequence of blocking and linkage solutions from which a researcher can choose". Often we need to run multiple configurations of blocking and record linkage which produce different results. It would be nice to have to have some simple way to automate different choices as "sensitivity analysis". |
Beta Was this translation helpful? Give feedback.
-
I'm wondering why we use blocking rules during the EM step. If I understand correctly, the point of the blocking rules is to reduce the number of comparisons to a tractable level. But wouldn't random sampling also achieve this? Would random sampling result in such a tiny fraction of true-matches that EM would fail?
I ask because I just stumbled upon Robin's great deep dive on splink (PS this article and some other record linkage ones aren't listed under the record linkage topic, they only are on the homepage) where it describes a downside of using blocking rules inside EM:
Beta Was this translation helpful? Give feedback.
All reactions