Multiple blocking rules #1051

checkbook-org · 2023-02-16T00:03:11Z

checkbook-org
Feb 16, 2023

Hi good splink people (splinkies?),
I am confused about how multiple blocking rules work.

In the settings object provided to a (DuckDB) Linker object, it is possible to set multiple blocking rules.
Most methods work with settings that contain multiple blocking rules, including
estimate_m_from_label_column
estimate_u_using_random_sampling
and
count_num_comparisons_from_blocking_rule
also seems to accept multiple blocking rules.

However, when calling
estimate_parameters_using_expectation_maximisation
I am only able to specify one blocking rule. This generates an output that would seem to be specific to that rule and not to all the rules together.
I can run
estimate_parameters_using_expectation_maximisation
again and generate another results object but I'm unclear on the effect this has.
My goal is to use multiple overlapping blocking rules to address the lack of any single "perfect", as outlined in the examples.

My questions are:

Do sequential calls to estimate_parameters_using_expectation_maximisation train the linker on multiple rules, or do they overwrite the previous training?
Is there some way to run estimate_parameters_using_expectation_maximisation on multiple rules at once?

Any suggestions or clarifications would be much appreciated.

Thanks!

Eric

Answered by RobinL

Mar 7, 2023

A bit hidden in the docs, but there's also an article on the difference between blocking_rules_to_generate_predictions vs blocking rules for estimation because it's a common question (and a bit confusing)

View full answer

RossKen · 2023-03-07T10:17:07Z

RossKen
Mar 7, 2023
Maintainer

Hey @checkbook-org

There are two ways that Blocking Rules are used in splink, so I suspect that is where the confusion is occurring.

Blocking rules for runtime

The first usage of blocking in a splink pipeline is to reduce the number of comparisons that are considered when generating the model (in order to reduce computational time). So, this is where count_num_comparisons_from_blocking_rule is useful as you can see how many comparisons you are going to consider with your set of blocking rules. The function allows multiple blocking rules because you want to ensure you capture all true positive matches at this stage, so multiple rules are important.

Blocking for Expectation Maximisation algorithm

To estimate our m values in the model, we run multiple instances of the Expectation Maximisation algorithm. Here we "block" on certain columns and m values are generated for the other columns (for intuition, I like to think of each of these being a classification problem where if we block on "forename" that becomes the target variable then the EM algorithm generates the m values for all of the other columns that would predict "forename"). In this case you will not have generated m-values for "forename" so will not be able to calculate comparisons sufficiently, so you will have to run additional blocking rules. You can run multiple blocking rules and the m-values will be generated for each of them and stored. Where you have multiple predictions for a column, the model will take some sort of average (I can't remember what type off the top of my head). For example in this demo splink job there is a block on name followed by date of birth.

To see the multiple m values generated across your multiple training runs you can generate the parameter_estimate_comparisons_chart which shows the m values across all your different training sessions. The demo splink job above generates this parameter_estimate_comparisons_chart:

Where you can see first name, surname and dob only have one prediction (as they were blocked on), but the other columns have multiple values from which an average will be taken.

2 replies

RobinL Mar 7, 2023
Maintainer

A bit hidden in the docs, but there's also an article on the difference between blocking_rules_to_generate_predictions vs blocking rules for estimation because it's a common question (and a bit confusing)

Answer selected by RossKen

RossKen Mar 7, 2023
Maintainer

Ah ha! I knew that was somewhere but I couldn't find it - thanks @RobinL!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple blocking rules #1051

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Multiple blocking rules #1051

checkbook-org Feb 16, 2023

Replies: 1 comment · 2 replies

RossKen Mar 7, 2023 Maintainer

RobinL Mar 7, 2023 Maintainer

RossKen Mar 7, 2023 Maintainer

checkbook-org
Feb 16, 2023

Replies: 1 comment 2 replies

RossKen
Mar 7, 2023
Maintainer

RobinL Mar 7, 2023
Maintainer

RossKen Mar 7, 2023
Maintainer