Help/Feat: regarding `estimate_probability_two_random_records_match()` #1072

NickCrews · 2023-02-24T00:04:28Z

NickCrews
Feb 24, 2023

I read through #462 to try to understand the methodology behind how we determine probability_two_random_records_match, but that issue didn't make much sense to me. So I have two Qs:

Explanation of current method

Can you explain the implications of probability_two_random_records_match? It just sets the prior for every comparison? Or will a bad fit have implications elsewhere?

What are the implications of me choosing an poor recall param? Say the recall is actually 40%, but I said 80%, is probability_two_random_records_match going to be off by a factor of 2, or way more?

Idea of new method

EDIT: per #1085, taking the cartesian product and then sampling actually materialized the entire cartesian product and is therefor intractable. So the following doesn't work.

This seems too simple, I feel like I must be missing something here, but can we:

Do EM training to find m and u params (perhaps we need probability_two_random_records_match before we can do EM, so we have a chicken and egg problem, and that's what we don't use this method?)
do a cartesian product or records to find all possible pairings, take a random sampling of K of them, maybe 100_000, enough to get a good distribution, but still tractable.
Call .predict() on those K. If match score is > .5 it's a match, else its a non match. Now we directly have probability_two_random_records_match

EDIT: I tried this by subclassing Linker. This isn't directly runnable because I am using Ibis and have some custom util methods, but should be easily tweakable. It seems to work, at least it finds "Estimated probability of random match: 0.000085", IDK if this is principled or not though.

import contextlib
import logging
from typing import Iterable

from ibis.expr.types import Table
import pandas as pd
from splink import linker as linker_module
from splink.blocking import block_using_rules_sql
from splink.duckdb.duckdb_linker import DuckDBLinker
from splink.splink_dataframe import SplinkDataFrame

from noatak.splink.util import Ibis_to_duckdb

logger = logging.getLogger(__name__)


class MyDuckDBLinker(DuckDBLinker):
    def __init__(self, table: Table, verbose: bool = False):
        self._backend = table._find_backend(use_default=True)
        logger.info(f"Creating DuckDBLinker on {table.count().execute():,} rows")
        conn, name = Ibis_to_duckdb(table)
        # Splink has a bug in their caching and if you create two different linkers
        # using the same database connection, but different tables, it
        # will re-use the same table for both linkers.
        _remove_splink_tables(self._backend)
        super().__init__(name, connection=conn)
        if verbose:
            self.debug_mode = True
    
    def train(
        self,
        blocking_rules: Iterable[str],
        n_sampled_pairs: int = 10_000_000,
    ) -> None:
        """Convenience wrapper around
        estimate_u_using_random_sampling() and
        estimate_parameters_using_expectation_maximisation()
        """
        blocking_rules = list(blocking_rules)
        if len(blocking_rules) < 1:
            raise ValueError("must have at least one blocking rule")
        logger.info(f"Sampling {n_sampled_pairs:,} rows for u estimation")
        self.estimate_u_using_random_sampling(target_rows=n_sampled_pairs)
        self._em(blocking_rules)
        self._estimate_probability_random_match(k=n_sampled_pairs)
        self._em(blocking_rules)

    def _em(self, blocking_rules):
        for rule in blocking_rules:
            self.estimate_parameters_using_expectation_maximisation(rule)

    def _estimate_probability_random_match(self, k: int = 10_000_000) -> float:
        with _monkeypatched(
            self._settings_obj, "_blocking_rules_to_generate_predictions", []
        ):
            with _monkeypatched(
                linker_module, "block_using_rules_sql", _sampled_blocking(k)
            ):
                random_sampling_of_all_pairs = self.predict()
        t = self.splink_to_ibis(random_sampling_of_all_pairs)
        n_matches = (t.match_probability > 0.5).sum()
        total = t.count()
        prob = (n_matches / total).execute()
        logger.info(f"Estimated probability of random match: {prob:.6f}")
        self._settings_obj._probability_two_random_records_match = prob

    def splink_to_ibis(self, df: SplinkDataFrame) -> Table:
        return self._backend.table(df.physical_name)


@contextlib.contextmanager
def _monkeypatched(object, name, patch):
    pre_patched_value = getattr(object, name)
    setattr(object, name, patch)
    yield object
    setattr(object, name, pre_patched_value)


def _sampled_blocking(k: int):
    def f(linker):
        return block_using_rules_sql(linker) + f" USING SAMPLE {k}"

    return f

RobinL · 2023-03-04T12:23:04Z

RobinL
Mar 4, 2023
Maintainer

Summary

The reason we think the current methodology is 'good enough' is that getting probability_two_random_records_match wrong by a factor of two has little effect on most predictions. Most of time time, users should be able to take a guess at recall that is within a factor of two of the real value. Finally, getting this wrong doesn't cause too many problems:

EM may converge slower
Predictions will be different by a constant factor, but the order will be preserved

Your idea

Your idea is definitely workable and would probably improve the accuracy of this parameter. The only reason we don't do it is that it adds a little complexity to the model training. But there's no reason not to add an additional method to linker that implements it.

From empirical experience I would suggest the following steps would work best for training:

# Simple, fast initial estimate that will help EM iterate faster
linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)
linker.estimate_u_using_random_sampling(target_rows=x)

# repeat as necessary with different training_blocking_rules
training_blocking_rule = "l.dob = r.dob"
training_session_dob = linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule)

# new method - could have a better name!
linker.reestimate_probability_two_random_records_match_using_current_params(target_rows=x)

Note that this final step is exactly equivalent to running a single EM iteration with a training blocking rule of (1=1).

In fact, the only reason we don't estimate probability_two_random_records_match using EM at the moment is that it doesn't work very well when you're using a training blocking rule due to violations of the conditional independency assumption.

Why did I only say it would 'probably' increase the accuracy of the parameter? Because it assumes the match_probability prediction is a true probability and this is only true if the conditional independence assumption holds, which is usually false.

Finally, note that instead of using "If match score is > .5 it's a match, else its a non match. ", it's more accurate to take the sum of probabilities as the estimate of the number of matches.

Detail

Yes, probability_two_random_records_match sets the prior for each and every pairwise prediction.

One interpretation is that, when you call predict(), this is just a scaling factor. If you sort the matches by match_weight or match_probability, it won't change the order of pairwise predictions, just the absolute value of each prediction. You can see this in the waterfall charts: probability_two_random_records_match is represented in the leftmost bar, which is almost always very negative.

Since you need to choose a threshold match probability (or match weight) to decide which predictions to treat as matches, then in some sense the value of probability_two_random_records_match doesn't matter; since it affects all comparisons equally, it just changes the threshold you need.

However, getting the value right is important for two reasons:

It is important during EM estimation. One way of thinking about EM - in particular the maximisation step - is that it's sorting pairwise predictions into two piles: matches, and non matches - see here. In order to sort this correctly, probability_two_random_records_match needs to be roughly correct.
If you want the match_probability to be roughly correct (i.e. for it to be able to be interpreted as 'close to' a true probability)

What are the implications of me choosing an poor recall param? Say the recall is actually 40%, but I said 80%, is probability_two_random_records_match going to be off by a factor of 2, or way more?

It doesn't have a huge effect because the effects of the match weights are usually far far stronger than a small change in the probability_two_random_records_match , e.g. from 1/10_000 to 1/5_000. The change from 1/10k to 1/5k change could be interpreted as 'twice as likely'. But a match on e.g. a date of birth might make it 'ten thousand times more likely. So generally if probability_two_random_records_match is off by a bit, it doesn't really matter much to the final prediction.

Talking about it as a 'factor of 2' is a little tricky for the following reason: 'if the probability is 99%, and it becomes twice as likely, what's the new probability' - you get this kind of 'bunching up' for probabilities close to 0% and close to 100%. The answer is that 99% turns into about 99.5%. And therein lies the explanation for why the different between e.g. 1/10k and 1/5k isn't that great: in this example, the probability shifts by only 0.5%. The largest possible effect would be for a prediction that started at 50%. Then the probability would move to 66.6%.

The calculator under the 'bayes factor' heading here does the maths.

If you want to play around with scenarios, it's easiest to everything through these functions:

splink/splink/misc.py

Line 16 in bc76458

def prob_to_bayes_factor(prob):

i.e. turn the prior probability into a bayes factor, and then multiply all bayes factors, before turning the final bayes factor back into a probability

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help/Feat: regarding `estimate_probability_two_random_records_match()` #1072

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Help/Feat: regarding estimate_probability_two_random_records_match() #1072

NickCrews Feb 24, 2023

Explanation of current method

Idea of new method

Replies: 1 comment

RobinL Mar 4, 2023 Maintainer

Help/Feat: regarding `estimate_probability_two_random_records_match()` #1072

NickCrews
Feb 24, 2023

RobinL
Mar 4, 2023
Maintainer