Replies: 1 comment 4 replies
-
If I understand your idea, it would be to have the option of specifying deterministic rules when using For example, in your case, you may provide the rule I agree that's a good idea. Worth noting that in many cases it's unlikely to make a substantive difference. I think the cases where it may make a difference are when you're working with fairly small datasets. Let's take an example. Supposing you're linking 1,000 records vs 1,000, and it's a 1 to 1 link, so there are 1,000 true matches. The first thing That's 1,000,000 comparisons, of which 1,000 are a link. It then asks the question 'what proportion are a match'. The In some cases I guess being out by 1/1000 could reduce significantly the quality of predictions. By extension, if you had 1m input records, you'd only be out by 1/1m, which is unlikely to make much of a difference. |
Beta Was this translation helpful? Give feedback.
-
Congrats on your nice package!
I have a question (and perhaps an idea) about the estimation of the u parameters.
I've read your documentation and my understanding is that, when making random pairs of records, it is assumed that the fraction of true pairs is considered essentially zero, making the u parameters easy to estimate.
I have a situation where that is not the case. I have two lists of names (and DOBs) that are not consistently formatted, making it hard to separate them into first, middle and last names. So when I have good (say near exact) matches between two names, I can be reasonably sure it is a true match. Two random (full) names with a good match that are unmatched should be quite rare (probability ~ 10^-5, I estimate). So when making all random record pairs, on the back of an envelope I naively expect the good match category to have a significant pollution of true matches, overestimating that u parameter.
However I could cut out those particular true matches by filtering them out with another independent variable, for example by rejecting close dob matches. And then get the 'clean' u parameters for name pairs.
(Same for exact dob matches, I can reject the true matches in that bin by rejecting close name pairs, and get the u parameters for the dob pairs.)
I browsed through your code for this and didn't see it -- although I may very well have missed it.
I was wondering if this is possible in your current setup? (If it's not possible it might be a good feature.)
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions