Replies: 1 comment
-
Summary The reason we think the current methodology is 'good enough' is that getting
Your idea Your idea is definitely workable and would probably improve the accuracy of this parameter. The only reason we don't do it is that it adds a little complexity to the model training. But there's no reason not to add an additional method to linker that implements it. From empirical experience I would suggest the following steps would work best for training:
Note that this final step is exactly equivalent to running a single EM iteration with a training blocking rule of (1=1). In fact, the only reason we don't estimate Why did I only say it would 'probably' increase the accuracy of the parameter? Because it assumes the Finally, note that instead of using "If match score is > .5 it's a match, else its a non match. ", it's more accurate to take the sum of probabilities as the estimate of the number of matches. Detail Yes, One interpretation is that, when you call Since you need to choose a threshold match probability (or match weight) to decide which predictions to treat as matches, then in some sense the value of However, getting the value right is important for two reasons:
What are the implications of me choosing an poor recall param? Say the recall is actually 40%, but I said 80%, is probability_two_random_records_match going to be off by a factor of 2, or way more? It doesn't have a huge effect because the effects of the match weights are usually far far stronger than a small change in the Talking about it as a 'factor of 2' is a little tricky for the following reason: 'if the probability is 99%, and it becomes twice as likely, what's the new probability' - you get this kind of 'bunching up' for probabilities close to 0% and close to 100%. The answer is that 99% turns into about 99.5%. And therein lies the explanation for why the different between e.g. 1/10k and 1/5k isn't that great: in this example, the probability shifts by only 0.5%. The largest possible effect would be for a prediction that started at 50%. Then the probability would move to 66.6%. The calculator under the 'bayes factor' heading here does the maths. If you want to play around with scenarios, it's easiest to everything through these functions: Line 16 in bc76458 i.e. turn the prior probability into a bayes factor, and then multiply all bayes factors, before turning the final bayes factor back into a probability |
Beta Was this translation helpful? Give feedback.
-
I read through #462 to try to understand the methodology behind how we determine
probability_two_random_records_match
, but that issue didn't make much sense to me. So I have two Qs:Explanation of current method
Can you explain the implications of
probability_two_random_records_match
? It just sets the prior for every comparison? Or will a bad fit have implications elsewhere?What are the implications of me choosing an poor
recall
param? Say the recall is actually 40%, but I said 80%, isprobability_two_random_records_match
going to be off by a factor of 2, or way more?Idea of new method
EDIT: per #1085, taking the cartesian product and then sampling actually materialized the entire cartesian product and is therefor intractable. So the following doesn't work.
This seems too simple, I feel like I must be missing something here, but can we:
probability_two_random_records_match
before we can do EM, so we have a chicken and egg problem, and that's what we don't use this method?).predict()
on those K. If match score is > .5 it's a match, else its a non match. Now we directly haveprobability_two_random_records_match
EDIT: I tried this by subclassing Linker. This isn't directly runnable because I am using Ibis and have some custom util methods, but should be easily tweakable. It seems to work, at least it finds "Estimated probability of random match: 0.000085", IDK if this is principled or not though.
Beta Was this translation helpful? Give feedback.
All reactions