How does cluster_pairwise_predictions_at_threshold work with df_predict? #1501

mshearer0 · 2023-08-06T16:00:35Z

mshearer0
Aug 6, 2023

Hi. I'm trying to figure out how cluster_pairwise_predictions_at_threshold works.

The API docs @ https://moj-analytical-services.github.io/splink/linker.html?h=cluster_pairwise_predictions_at_threshold#splink.linker.Linker.cluster_pairwise_predictions_at_threshold states that it:

"Clusters the pairwise match predictions that result from linker.predict() into groups of connected record using the connected components graph clustering algorithm"

The API is given as:

cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=None, pairwise_formatting=False, filter_pairwise_format_for_clusters=True)

This suggests that only entities contained with df_predict dataframe should appear in the clusters output, however I have observed entities that are not in df_predict (because they are not present in any pairs scoring above the linker threshold_match_probability) are included. These non-paired entities are assigned their own cluster.

Does cluster_pairwise_predictions_at_threshold only attempt to cluster entities that are present within df_predict, but also assigns clusters to all the additional entities that were passed to the original linker and are not found within df_predict? If so does it repeat the linking process and reapply the prediction blocking rules?

Is it correct the say that the threshold_match_probability applied to cluster_pairwise_predictions_at_threshold is intended to be a secondary (equal or higher) threshold on the inclusion of entity pairs within a cluster that is above the original threshold applied in the generation of df_predict?

Hope this makes sense. Thanks in advance for any help,

Michael.

RobinL · 2023-08-07T17:49:16Z

RobinL
Aug 7, 2023
Maintainer

To make the algorithm as general as possible, we assume that users want to assign a unique ID to all the entities in the input dataset, not just those that are part of a cluster. So under the hood we supplement df_predict with the original records to recover all original records.

Therefore, the result of cluster_pairwise_predictions_at_threshold has the same length (number of records) as the input dataset(s), irrespective of:

your blocking settings,
the threshold_match_probability used on predict
the threshold_match_probability on cluster_pairwise_predictions_at_threshold

Any input record that is not linked to any other record will be assigned a unique cluster ID.

e.g. consider the following code:

import pandas as pd
from splink.duckdb.linker import DuckDBLinker
from splink.duckdb.comparison_library import exact_match

data = [
    {"unique_id": 1, "first_name": "John", "surname": "Smith"},
    {"unique_id": 2, "first_name": "John", "surname": "Smith"},
    {"unique_id": 3, "first_name": "Lucy", "surname": "Jones"},
]

settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [],
    "comparisons": [
        exact_match("first_name"),
        exact_match("surname"),
    ],
}

linker = DuckDBLinker(pd.DataFrame(data), settings)

df_predict = linker.predict()
print(df_predict.as_pandas_dataframe().to_markdown(index=False))
clusters = linker.cluster_pairwise_predictions_at_threshold(df_predict, 0.9)
print(clusters.as_pandas_dataframe().to_markdown(index=False))

Which results in df_predict:

match_weight	match_probability	unique_id_l	unique_id_r	first_name_l	first_name_r	gamma_first_name	surname_l	surname_r	gamma_surname
6.71243	0.990554	1	2	John	John	1	Smith	Smith	1
-23.2876	9.7666e-08	1	3	John	Lucy	0	Smith	Jones	0
-23.2876	9.7666e-08	2	3	John	Lucy	0	Smith	Jones	0

And clusters:

cluster_id	unique_id	first_name	surname
1	1	John	Smith
1	2	John	Smith
3	3	Lucy	Jones

In the above, there is no blocking (full cartesian product) but you get the same result if you e.g block on first_name:

settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": ["l.first_name = r.first_name"],
    "comparisons": [
        exact_match("first_name"),
        exact_match("surname"),
    ],
}

linker = DuckDBLinker(pd.DataFrame(data), settings)

df_predict = linker.predict()
print(df_predict.as_pandas_dataframe().to_markdown(index=False))
clusters = linker.cluster_pairwise_predictions_at_threshold(df_predict, 0.9)
print(clusters.as_pandas_dataframe().to_markdown(index=False))

match_weight	match_probability	unique_id_l	unique_id_r	first_name_l	first_name_r	gamma_first_name	surname_l	surname_r	gamma_surname
6.71243	0.990554	1	2	John	John	1	Smith	Smith	1

cluster_id	unique_id	first_name	surname
1	1	John	Smith
1	2	John	Smith
3	3	Lucy	Jones

2 replies

mshearer0 Aug 7, 2023
Author

Thanks Robin. That explains my results so far.

So if a threshold_match_probability is specified in the generation of df_predict, then this value isn't used for the purposes of clustering.

I presume linker.predict persists the pairwise match probabilities (for pairs that pass blocking) and its these values that are picked up by cluster_pairwise_predictions_at_threshold and compared against the newly specified match threshold to determine whether these pairs, as edges (and nodes), should be included in the output clusters. If an entity isn't present in any pair (because it was never compared due to blocking rules) or all the pairs which contain it fall below the new threshold, then it gets assigned its own individual cluster.

Is that right?

RobinL Aug 7, 2023
Maintainer

Yes, that's correct.

If you use threshold_match_probability to produce df_predict, any pairwise comparisons below the threshold will be filtered out of df_predict before being persisted.

In turn, this means that the threshold_match_probability given to cluster_pairwise_predictions_at_threshold should be equal to or higher than the value given to df_predict.

Also, as you say, it's possible that a true match is missed by blocking rules. For the purpose of clustering, these absent pairs are implicitly treated as if they have match_probability = 0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How does cluster_pairwise_predictions_at_threshold work with df_predict? #1501

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

How does cluster_pairwise_predictions_at_threshold work with df_predict? #1501

Uh oh!

mshearer0 Aug 6, 2023

Replies: 1 comment · 2 replies

Uh oh!

Uh oh!

RobinL Aug 7, 2023 Maintainer

Uh oh!

mshearer0 Aug 7, 2023 Author

Uh oh!

Uh oh!

RobinL Aug 7, 2023 Maintainer

mshearer0
Aug 6, 2023

Replies: 1 comment 2 replies

RobinL
Aug 7, 2023
Maintainer

mshearer0 Aug 7, 2023
Author

RobinL Aug 7, 2023
Maintainer