Replies: 1 comment
-
Hey @RobinL, yea I get your point. I am completely in agreement on the three areas of improvement we need and I am not completely wedded to any one solution - I would much rather get something that works best in the long term and reduces the amount of confusion around blocking. I agree that the My take on it in the first instance was that having a similar API to comparison levels would make the BRL easier to learn as it wouldn't feel like yet another distinct thing users need to learn. Particularly with the confusion between prediction and training blocking rules already. I can see how your @ThomasHepworth @ADBond it would be good to get your take on this |
Beta Was this translation helpful? Give feedback.
-
@RossKen @ThomasHepworth
I've been trying the new functions a bit as part of my work on auto blocking rules. This has highlighted a couple areas where I think there's room for improvement.
Apologies in advance if some of the below is already in the pipeline.
I should start by saying I think it's a great idea to work on improving blocking rules - I think there room for improvement in these areas (at minimum):
At the moment, I think there's a bit of a gap between the new functions and these aims.
My starting point is that overwhelmingly (maybe 99% of the time) the most common use of blocking rules should be using equijoin conditions, very often on more than one column.
Where we see users using conditions used such as
levenshtein
orOR
conditions, more often than not it's because the user doesn't understand the performance implications, rather than because it's a good idea.Examples
1. Common and good practice
l.first_name = r.first_name
: vanilla equi joinl.first_name = r.first_name and l.surname = r.surname
: vanilla equi joinl.first_name = r.first_name and substr(l.surname,1,2) = substr(r.surname,1,2)
: equi join with function2. Less common but still good practice equi-joins
l.first_name = r.surname
: equi join with different columnssubstr(l.surname,1,2) = substr(r.surname,4,5)
: equi join with different function3. Less common and occasionally good practice, but often misused
l.first_name = r.first_name and levenshtein(l.surname, r.surname) < 2
: equi join onfirst_name
and filter on surname. This creates cartesian product within first_name. It may occasionally be valid if blocking onfirst_name
andsurname
is too tight, and there's no set of blocking rule that can achieve this flexibility.4. Almost never good practice
OR
condition, since this creates a full cartesian join. Instead, multiple blocking rules should be usedPossible solutions
Given these usage patterns, I think the new API is a bit verbose, and perhaps could also be clarified.
Looking at (1), by far the most common 'user need', we have:
Old
l.first_name = r.first_name and l.surname = r.surname
New
The
new
is easier to read in isolution than the current syntax, but quite verbose.But also I think it's quite hard to read once integrated into a settings object where there are multiple rules:
I think it's possible to achieve something that's both more succinct and more readable.
One suggestion would be a function like:
name_tbc(['first_name', 'substr(surname,1,2)'])
which would be the equivalent of:
"l.first_name = r.first_name and substr(l.surname,1,2) = substr(r.surname,1,2)"
The fn name
name_tbc
could be, perhaps,block_on()
.Finally, at the moment I'm not sure it's necessary to cover the more advance use cases (2-4 in the above list). Because these are quite advanced functionality, and I think having an abstraction may be quite verbose for little benefit (and also result in lots of more code to maintain). For advanced users, it may be clearer simply for the user to write the SQL string, partly because I'm not sure a library could adequately cover all possible rules.
I also wonder whether library is really the right way of thinking about this, particularly because I don't think it's really comparable to our comparison libraries and therefore could be a source of confusion. It's also not as necessary to abstract over the multiple backends.
Finally - sorry for not mentioning this sooner - it's only really since using them I've been thinking about this in depth; and also, obviously this is just one opinion!
Beta Was this translation helpful? Give feedback.
All reactions