Replies: 1 comment 2 replies
-
To make the algorithm as general as possible, we assume that users want to assign a unique ID to all the entities in the input dataset, not just those that are part of a cluster. So under the hood we supplement df_predict with the original records to recover all original records. Therefore, the result of
Any input record that is not linked to any other record will be assigned a unique cluster ID. e.g. consider the following code: import pandas as pd
from splink.duckdb.linker import DuckDBLinker
from splink.duckdb.comparison_library import exact_match
data = [
{"unique_id": 1, "first_name": "John", "surname": "Smith"},
{"unique_id": 2, "first_name": "John", "surname": "Smith"},
{"unique_id": 3, "first_name": "Lucy", "surname": "Jones"},
]
settings = {
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": [],
"comparisons": [
exact_match("first_name"),
exact_match("surname"),
],
}
linker = DuckDBLinker(pd.DataFrame(data), settings)
df_predict = linker.predict()
print(df_predict.as_pandas_dataframe().to_markdown(index=False))
clusters = linker.cluster_pairwise_predictions_at_threshold(df_predict, 0.9)
print(clusters.as_pandas_dataframe().to_markdown(index=False)) Which results in df_predict:
And clusters:
In the above, there is no blocking (full cartesian product) but you get the same result if you e.g block on settings = {
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": ["l.first_name = r.first_name"],
"comparisons": [
exact_match("first_name"),
exact_match("surname"),
],
}
linker = DuckDBLinker(pd.DataFrame(data), settings)
df_predict = linker.predict()
print(df_predict.as_pandas_dataframe().to_markdown(index=False))
clusters = linker.cluster_pairwise_predictions_at_threshold(df_predict, 0.9)
print(clusters.as_pandas_dataframe().to_markdown(index=False))
|
Beta Was this translation helpful? Give feedback.
-
Hi. I'm trying to figure out how cluster_pairwise_predictions_at_threshold works.
The API docs @ https://moj-analytical-services.github.io/splink/linker.html?h=cluster_pairwise_predictions_at_threshold#splink.linker.Linker.cluster_pairwise_predictions_at_threshold states that it:
"Clusters the pairwise match predictions that result from linker.predict() into groups of connected record using the connected components graph clustering algorithm"
The API is given as:
cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=None, pairwise_formatting=False, filter_pairwise_format_for_clusters=True)
This suggests that only entities contained with df_predict dataframe should appear in the clusters output, however I have observed entities that are not in df_predict (because they are not present in any pairs scoring above the linker threshold_match_probability) are included. These non-paired entities are assigned their own cluster.
Does cluster_pairwise_predictions_at_threshold only attempt to cluster entities that are present within df_predict, but also assigns clusters to all the additional entities that were passed to the original linker and are not found within df_predict? If so does it repeat the linking process and reapply the prediction blocking rules?
Is it correct the say that the threshold_match_probability applied to cluster_pairwise_predictions_at_threshold is intended to be a secondary (equal or higher) threshold on the inclusion of entity pairs within a cluster that is above the original threshold applied in the generation of df_predict?
Hope this makes sense. Thanks in advance for any help,
Michael.
Beta Was this translation helpful? Give feedback.
All reactions