Best name for new efficient implementation of `count_num_comparisons_from_blocking_rule` #1391

RobinL · 2023-07-04T09:17:23Z

RobinL
Jul 4, 2023
Maintainer

PR #1388 contains a new approach to count_num_comparisons_from_blocking_rule.

The results of this approach are different to the current function.

The new implementation counts the number of comparisons pre filter conditions
The old implementation counts the number of comparisons post filter conditions

It is not possible (algorithmially) to use the faster approach to count the post-filter comparisons.

We therefore need to make a decision on naming:

Do we retain this function name, despite the different answer
Could we have an argument to this function that uses the alternative faster computation
Do we want a separate function?

The rest of this note explains why this function name is ambiguous.

It makes the distinction between:

equi-join conditions, which can be executed very efficiently
non equi-join (filter) conditions, which are much less efficient to execute

Equi join conditions are the most important driver of performance

Consider the following blocking rule:

l.first_name = r.first_name and levenshtein(l.surname, r.surname) < 3

In this blocking rule:

l.first_name = r.first_name is an 'equi join' condition
levenshtein(l.surname, r.surname) < 3 is a filter condition

In Splink, this will turn into a SQL statement similar to:

SELECT ... as match_probability
FROM
df as l
INNER JOIN df as r
ON 
l.first_name = r.first_name  AND levenshtein(l.surname, r.surname) < 3

How is this executed? A naive implementation would:

Create all pairwise comparisons (the Cartesian product)
As each comparison is created, the blocking condition is applied as a filter
Splink computes the match probability from the pairwise comparison

This naive implementation is very inefficient due to the large number of pairwise comparisons that need to be created.

Instead, databases use an optimisation which dramatically improves runtimes, but can only be used for equi-join conditions:

Identify equi-join conditions, in this case l.first_name = r.first_name
Split the dataset into chunks based on this condition. In this case, all the Johns, all the Janes etc. (Note: Multiple equi-joins would just mean more splits, into e.g. John Smith, Jane Jones etc).
Within these chunks, generate the cartesian product of pairwise comarisons
As each comparison is created, apply any filter conditions (in this case `levenshtein(l.surname, r.surname) < 3)
Splink computes the match probability from the pairwise comparison

Since steps 2 and 3 dramatically reduce the number of comparisons created, equi join conditions in a blocking rule are the most important determinant of performance.

Filter conditions (non equi joins) are also important

Once a comparison is generated, it must be scored by Splink. This is far more computationally intensive than simply creating the comparison, because there are often fuzzy matching functions like jaro_winkler that must be evaluated.

Non equi join (filter) conditions eliminate this computation by filtering a record comparison before it needs to be scored.

Relevance to new functions

So, if we were to be extremely specific our functions would need to be called something like:

count_num_comparisons_from_blocking_rule_pre_filter_conditions()
count_num_comparisons_from_blocking_rule_post_filter_conditions()

But unfortunately I think this would be meaningless to users.

I think the best solution is to have a switch:

count_num_comparisons_from_blocking_rule(apply_filter_conditions=False)

But interested in views

RossKen · 2023-07-04T12:17:34Z

RossKen
Jul 4, 2023

I think your parameter solution is probably the route to go down. I am, however, slightly wary of having the default behaviour change. If there is a warning message that we could add if a filter condition is detected that would alert the user to this change, and how to revert to the previous behaviour?

Also, there would need to be sufficient descriptions in the docstrings to describe what a filter condition vs an equi join. I will add this to the topic guides I am writing at the minute, but it should be in the API docs too.

2 replies

RobinL Jul 4, 2023
Maintainer Author

Thanks. I'm also wary of changes in default behaviour. I think the reason i'm inclined to tolerate it is that:

This function doesn't affect actual linkage results
The current implementation is pretty bad from the user's point of view. I use the function most often to ensure that a blocking rule is not too loose prior to running a linkage. But in the case it is too loose, the function takes forever to compute (or blows up).

I agree it needs some sort of message - something like 'this is the number of comparisons the sql engine will generate when applying this blocking rule - the number of rows in your output dataset will differ due to {filter conditions} amongst other things. Please refer to x documentation'

RossKen Jul 4, 2023

Yep agreed, it doesn't have a massive impact so I'm happy for it to go ahead as long as it is communicated properly to users 😊

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Best name for new efficient implementation of `count_num_comparisons_from_blocking_rule` #1391

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Best name for new efficient implementation of count_num_comparisons_from_blocking_rule #1391

Uh oh!

RobinL Jul 4, 2023 Maintainer

Equi join conditions are the most important driver of performance

Filter conditions (non equi joins) are also important

Relevance to new functions

Replies: 1 comment · 2 replies

Uh oh!

RossKen Jul 4, 2023

Uh oh!

Uh oh!

RobinL Jul 4, 2023 Maintainer Author

Uh oh!

RossKen Jul 4, 2023

Best name for new efficient implementation of `count_num_comparisons_from_blocking_rule` #1391

RobinL
Jul 4, 2023
Maintainer

Replies: 1 comment 2 replies

RossKen
Jul 4, 2023

RobinL Jul 4, 2023
Maintainer Author