Multiple blocking rules #1051
-
Hi good splink people (splinkies?), In the settings object provided to a (DuckDB) Linker object, it is possible to set multiple blocking rules. However, when calling My questions are:
Any suggestions or clarifications would be much appreciated. Thanks! Eric |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Hey @checkbook-org There are two ways that Blocking Rules are used in splink, so I suspect that is where the confusion is occurring. The first usage of blocking in a splink pipeline is to reduce the number of comparisons that are considered when generating the model (in order to reduce computational time). So, this is where To estimate our m values in the model, we run multiple instances of the Expectation Maximisation algorithm. Here we "block" on certain columns and m values are generated for the other columns (for intuition, I like to think of each of these being a classification problem where if we block on "forename" that becomes the target variable then the EM algorithm generates the m values for all of the other columns that would predict "forename"). In this case you will not have generated m-values for "forename" so will not be able to calculate comparisons sufficiently, so you will have to run additional blocking rules. You can run multiple blocking rules and the m-values will be generated for each of them and stored. Where you have multiple predictions for a column, the model will take some sort of average (I can't remember what type off the top of my head). For example in this demo splink job there is a block on name followed by date of birth. To see the multiple m values generated across your multiple training runs you can generate the parameter_estimate_comparisons_chart which shows the m values across all your different training sessions. The demo splink job above generates this parameter_estimate_comparisons_chart: |
Beta Was this translation helpful? Give feedback.
A bit hidden in the docs, but there's also an article on the difference between blocking_rules_to_generate_predictions vs blocking rules for estimation because it's a common question (and a bit confusing)