Question/idea about estimation of u parameters #1641

mbaak · 2023-10-10T20:59:21Z

mbaak
Oct 10, 2023

Congrats on your nice package!

I have a question (and perhaps an idea) about the estimation of the u parameters.
I've read your documentation and my understanding is that, when making random pairs of records, it is assumed that the fraction of true pairs is considered essentially zero, making the u parameters easy to estimate.

I have a situation where that is not the case. I have two lists of names (and DOBs) that are not consistently formatted, making it hard to separate them into first, middle and last names. So when I have good (say near exact) matches between two names, I can be reasonably sure it is a true match. Two random (full) names with a good match that are unmatched should be quite rare (probability ~ 10^-5, I estimate). So when making all random record pairs, on the back of an envelope I naively expect the good match category to have a significant pollution of true matches, overestimating that u parameter.

However I could cut out those particular true matches by filtering them out with another independent variable, for example by rejecting close dob matches. And then get the 'clean' u parameters for name pairs.
(Same for exact dob matches, I can reject the true matches in that bin by rejecting close name pairs, and get the u parameters for the dob pairs.)

I browsed through your code for this and didn't see it -- although I may very well have missed it.
I was wondering if this is possible in your current setup? (If it's not possible it might be a good feature.)

Thanks!

RobinL · 2023-10-11T18:16:33Z

RobinL
Oct 11, 2023
Maintainer

If I understand your idea, it would be to have the option of specifying deterministic rules when using linker.estimate_u_using_random_sampling that would allow you to filter out true matches.

For example, in your case, you may provide the rule l.full_name = r.full_name and l.dob = r.dob, which would be saying 'I believe those are true matches, so they shouldn't be counted as non-matches'.

I agree that's a good idea.

Worth noting that in many cases it's unlikely to make a substantive difference. I think the cases where it may make a difference are when you're working with fairly small datasets.

Let's take an example. Supposing you're linking 1,000 records vs 1,000, and it's a 1 to 1 link, so there are 1,000 true matches.

The first thing linker.estimate_u_using_random_sampling does is create the cartesian product of records - it compares every record to every other.

That's 1,000,000 comparisons, of which 1,000 are a link.

It then asks the question 'what proportion are a match'. The u probability will be out by 1,000 out of 1 million (because it's counted 1,000 of the million as a non-match when they are in fact a match).

In some cases I guess being out by 1/1000 could reduce significantly the quality of predictions.

By extension, if you had 1m input records, you'd only be out by 1/1m, which is unlikely to make much of a difference.

4 replies

RobinL Oct 11, 2023
Maintainer

In terms of implementation, it would be reasonably straightforward. We have sql generated like:

          select
            "l"."unique_id" as "unique_id_l", "r"."unique_id" as "unique_id_r", "l"."first_name" as "first_name_l", "r"."first_name" as "first_name_r", "l"."surname" as "surname_l", "r"."surname" as "surname_r", "l"."dob" as "dob_l", "r"."dob" as "dob_r", "l"."city" as "city_l", "r"."city" as "city_r", "l"."email" as "email_l", "r"."email" as "email_r", "l"."cluster" as "cluster_l", "r"."cluster" as "cluster_r"
            , '0' as match_key
            
            from __splink__df_concat_with_tf_sample as l
            inner join __splink__df_concat_with_tf_sample as r
            on
            (1=1)
            
            where l."unique_id" < r."unique_id"

to create the cartesian product of records. It really just needs another 'where' condition representing the deterministic rule(s).

If you're interested in having a go, those rules would go somewhere here:

splink/splink/blocking.py

Line 200 in 0b3cae1

def block_using_rules_sql(linker: Linker):

which is called from

splink/splink/estimate_u.py

Line 109 in 0b3cae1

sql = block_using_rules_sql(training_linker)

(the weird 1=1 thing is just so we can re-use the same code we use for blocking easily for a cartesian product)

mbaak Oct 11, 2023
Author

Thanks for the reply, and thanks for the pointers. I will have a look.

Perhaps I'm overlooking something, but I think the problem can large in a situation where the probability of two random people having a close (full) name match is small, let's say 10^-5. For 1000 x 1000 that gives 10 fake candidates with a close name match, and perhaps hundred true candidates with a close name match. If those end up together in one bin, and "assuming no true matches", there would be a large overestimate of that bin's u parameter. So there the assumption of no true matches would be unwarranted, I think, and it'd be useful if many/most of the true matches can be rejected with one or more independent cuts.
(Indirectly it would also affect the estimation of the m parameter of that bin and the precision in that bin.)

(For other name-matching bins (say with low cosine similarity) the true/fake ratio would be much lower, and the assumption of no true matches is fine.)

Happy to hear if I'm missing something though?

RobinL Oct 11, 2023
Maintainer

No, I think I pretty much agree with you. I think the rationale is just that the number of record comparisons rises quadratically whereas the number of matching comparisons rises linearly, so for large data sets, we can tend to ignore the matches. But yes it's right to point out that there are cases where that logic/those assumptions are not true

Ultimately you end up with estimated positive match weights that are not as positive as they should be, so you lose some accuracy, but even then, the model still may perform fairly well

RobinL Oct 12, 2023
Maintainer

One more sort of interesting point, reviewing the maths here what we're actually calculating as the 'u' value is the bottom part of this fraction.

So arguably we should just use it that way, although that'd be a lot more work to change

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question/idea about estimation of u parameters #1641

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Question/idea about estimation of u parameters #1641

mbaak Oct 10, 2023

Replies: 1 comment · 4 replies

RobinL Oct 11, 2023 Maintainer

RobinL Oct 11, 2023 Maintainer

mbaak Oct 11, 2023 Author

RobinL Oct 11, 2023 Maintainer

RobinL Oct 12, 2023 Maintainer

mbaak
Oct 10, 2023

Replies: 1 comment 4 replies

RobinL
Oct 11, 2023
Maintainer

RobinL Oct 11, 2023
Maintainer

mbaak Oct 11, 2023
Author

RobinL Oct 11, 2023
Maintainer

RobinL Oct 12, 2023
Maintainer