Replies: 1 comment 2 replies
-
I think your parameter solution is probably the route to go down. I am, however, slightly wary of having the default behaviour change. If there is a warning message that we could add if a filter condition is detected that would alert the user to this change, and how to revert to the previous behaviour? Also, there would need to be sufficient descriptions in the docstrings to describe what a filter condition vs an equi join. I will add this to the topic guides I am writing at the minute, but it should be in the API docs too. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
PR #1388 contains a new approach to
count_num_comparisons_from_blocking_rule
.The results of this approach are different to the current function.
It is not possible (algorithmially) to use the faster approach to count the post-filter comparisons.
We therefore need to make a decision on naming:
The rest of this note explains why this function name is ambiguous.
It makes the distinction between:
Equi join conditions are the most important driver of performance
Consider the following blocking rule:
l.first_name = r.first_name and levenshtein(l.surname, r.surname) < 3
In this blocking rule:
l.first_name = r.first_name
is an 'equi join' conditionlevenshtein(l.surname, r.surname) < 3
is a filter conditionIn Splink, this will turn into a SQL statement similar to:
How is this executed? A naive implementation would:
This naive implementation is very inefficient due to the large number of pairwise comparisons that need to be created.
Instead, databases use an optimisation which dramatically improves runtimes, but can only be used for equi-join conditions:
l.first_name = r.first_name
Since steps 2 and 3 dramatically reduce the number of comparisons created, equi join conditions in a blocking rule are the most important determinant of performance.
Filter conditions (non equi joins) are also important
Once a comparison is generated, it must be scored by Splink. This is far more computationally intensive than simply creating the comparison, because there are often fuzzy matching functions like jaro_winkler that must be evaluated.
Non equi join (filter) conditions eliminate this computation by filtering a record comparison before it needs to be scored.
Relevance to new functions
So, if we were to be extremely specific our functions would need to be called something like:
count_num_comparisons_from_blocking_rule_pre_filter_conditions()
count_num_comparisons_from_blocking_rule_post_filter_conditions()
But unfortunately I think this would be meaningless to users.
I think the best solution is to have a switch:
count_num_comparisons_from_blocking_rule(apply_filter_conditions=False)
But interested in views
Beta Was this translation helpful? Give feedback.
All reactions