-
Hi, For expressing a blocking rule stating that a size needs to be within 20% of each other, a formula with a division is used. This is similar to the Linking financial transactions example . The data in the code snippet below is such that all three rows satisfy the rule of having a size being within 20% of each other. import pandas as pd
df = pd.DataFrame({
'unique_id': [347,1349,1503],
'text_size': [11990,10384,11020]
})
blocking_rules = [
# generates 3 comparisons as expected
"l.text_size > 10000 and r.text_size > 10000",
# all of the below, generate only 2 comparisons
"l.text_size > 10000 and r.text_size > 10000 and l.text_size / r.text_size >= 0.1 and l.text_size / r.text_size <= 2.0",
"l.text_size > 10000 and r.text_size > 10000 and l.text_size / r.text_size >= 0.83 and l.text_size / r.text_size <= 1.25",
"l.text_size > 10000 and r.text_size > 10000 and l.text_size / r.text_size between 0.83 and 1.25",
"l.text_size > 10000 and r.text_size > 10000 and l.text_size / r.text_size between 1.0/(1.0 + 0.2) and 1.0/(1.0 - 0.2)",
]
from splink.duckdb.duckdb_linker import DuckDBLinker
settings = { "link_type": "dedupe_only" }
linker = DuckDBLinker(df, settings)
for rule in blocking_rules:
count = linker.count_num_comparisons_from_blocking_rule(rule)
print(f"{count:,.0f} comparisons generated by '{rule}'") Kind regards |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
I believe the issue here is that the column You can either convert your initial column to a floating-point type (e.g. |
Beta Was this translation helpful? Give feedback.
I believe the issue here is that the column
text_size
is made of integers, and so duckdb is performing integer division - in 2 cases it rounds to 1, and in the third it rounds to 0, which is why you are getting only 2 comparisons for those blocking rules.You can either convert your initial column to a floating-point type (e.g.
df["text_size"] = df["text_size"].astype(float)
) which is probably preferable if you aim to perform these kinds of calculations, or use an explicitcast
in your blocking rules (e.g. changingl.text_size / r.text_size
toCAST(l.text_size AS double)/r.text_size
) - although in this case you will need to explicitly cast anywhere else you may want to do divisions such a…