Straightforward way to reconstruct predictions with comparison vector enabled #2373

medwar99 · 2024-09-02T22:08:53Z

medwar99
Sep 2, 2024

Due to speed and memory constraints we can often turn off retain_matching_columns and retain_intermediate_calculation_columns to gain a performance bump, however this impacts on a lot of functionality that is useful in QA (waterfall chart, comparison chart, parts of cluster studio etc.). Some can be cheated/worked-around by re-enabling these flags, or at least could in Splink 3, but the outputs rightly reflected the fact the information is missing.

One issue I'm seeing is that I want to quickly obtain predictions with the column retention off, and then do a proper QA spot-check on a sub-set of these predictions (say with a match_weight between some values).

I don't know if I'm doing this in a sensible way, but currently I'm selecting my subset of edges from df_predict, and then iterating over the pairs to obtain the record dictionary for each unique_id. I then call compare_two_records() on that pair to obtain the comparison vector. My implementation currently goes back and forth between pandas and splink, but I wonder if there's an easier way to do this in Splink almost as a linker.inference.compare_two_records() that takes in a edge list of unique_ids (with source dataset)?

Calling predict() with column retention off gives us

match_weight	match_probability	unique_id_l	unique_id_r

In a link model we also get columns for source_dataset_l and source_dataset_r, which are also easy to add in for dedupe-only models.

A new method in inference could take in something like

unique_id_l	unique_id_r	source_dataset_l	source_dataset_r

and return comparison vectors for only those edges, which can then be used for QA etc.

medwar99 · 2024-09-02T22:10:21Z

medwar99
Sep 2, 2024
Author

Ah, I see compute_comparison_vector_values_from_id_pairs_sqls() within internals which does very similar

4 replies

RobinL Sep 4, 2024
Maintainer

Here's a an example script that I think does what you want:

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
from splink.datasets import splink_dataset_labels
from splink.internals.accuracy import predictions_from_sample_of_pairwise_labels_sql
from splink.internals.pipeline import CTEPipeline
from splink.internals.vertically_concatenate import (
    compute_df_concat_with_tf,
)

db_api = DuckDBAPI()


input_data_as_pd_df = splink_datasets.fake_1000
labels_as_pd_df = splink_dataset_labels.fake_1000_labels

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.ExactMatch("first_name"),
        cl.ExactMatch("surname"),
        cl.ExactMatch("dob"),
        cl.ExactMatch("city"),
        cl.ExactMatch("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
    ],
    retain_matching_columns=True,
    retain_intermediate_calculation_columns=True,
)

linker = Linker(input_data_as_pd_df, settings, db_api)
labels_tablename = linker.table_management.register_labels_table(labels_as_pd_df)
# Start
pipeline = CTEPipeline()
nodes_with_tf = compute_df_concat_with_tf(linker, pipeline)
pipeline = CTEPipeline([nodes_with_tf])

sqls = predictions_from_sample_of_pairwise_labels_sql(linker, labels_tablename.physical_name)

pipeline.enqueue_list_of_sqls(sqls)

db_api.sql_pipeline_to_splink_dataframe(pipeline).as_pandas_dataframe()

Where

input_data_as_pd_df is

unique_id	first_name	surname	dob	city	email	cluster
0	Robert	Alan	1971-06-24	nan	[email protected]	0
1	Robert	Allen	1971-05-24	nan	[email protected]	0
2	Rob	Allen	1971-06-24	London	[email protected]	0
3	Robert	Alen	1971-06-24	Lonon	nan	0
4	Grace	nan	1997-04-26	Hull	[email protected]	1

and labels_as_pd_df is

source_dataset_l	unique_id_r	source_dataset_r	clerical_match_score
fake_1000	1	fake_1000	1
fake_1000	2	fake_1000	1
fake_1000	3	fake_1000	1
fake_1000	4	fake_1000	0
fake_1000	5	fake_1000	0

Note that for this to work, you need the clerical_match_score column to be present, because we're 'reusing' some functions that assume the input table is of labels.

This does feel like a reasonably general need - we could definitely consider adding a method like linker.inference.predict_from_pairwise_ids, which would mirror the code above (but we'd need to slightly modify predictions_from_sample_of_pairwise_labels_sql to allow the clerical_match_score to be optional

RobinL Sep 4, 2024
Maintainer

Another option would be

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, splink_datasets
from splink.datasets import splink_dataset_labels

db_api = DuckDBAPI()

input_data_as_pd_df = splink_datasets.fake_1000


settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.ExactMatch("first_name"),
        cl.ExactMatch("surname"),
        cl.ExactMatch("dob"),
        cl.ExactMatch("city"),
        cl.ExactMatch("email"),
    ],
    blocking_rules_to_generate_predictions=[
        {
            "blocking_rule": "l.first_name = r.first_name AND l.__splink_salt < 0.1",
            "salting_partitions": 2,
        },
        {
            "blocking_rule": "l.surname = r.surname AND l.__splink_salt < 0.1",
            "salting_partitions": 2,
        },
    ],
    retain_matching_columns=True,
    retain_intermediate_calculation_columns=True,
)

linker = Linker(input_data_as_pd_df, settings, db_api)


pairwise_predictions = linker.inference.predict()

The aboev is a bit of a hack that exploits the __splink_salt column, a random uniform [0,1] that's created by Splink. You don't have control over which pairwise comparisons are generated, but it does give you a smaller random sample of pairwise comparisons on which it may be computationally feasible to compute with the retain_ settings as True

medwar99 Sep 4, 2024
Author

Great stuff, thanks. I'm currently in Splink3 at the moment, but this will be useful when moving to 4. I like the sound of a linker.inference.predict_from_pairwise_ids, so look forward to it.

For the moment I can live with using compare_two_records but my iterative implementation in Pandas can be slow for over a few hundred pairs, as it needs to apply this pairwise method over the pairs rather than being able to take in a list of edge pairs and broadcasting nicely.

RobinL Nov 12, 2024
Maintainer

@medwar99 See #2498 which should be merged soon

Note that:

When these functions are called, the retain_intermediate_calculation_columns and retain_matching_columns settings are always set to True irrespective of the settings in the main settings object. This means users can e.g. run predict using efficient settings, but generate a handful of waterfalls using this function

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Straightforward way to reconstruct predictions with comparison vector enabled #2373

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Straightforward way to reconstruct predictions with comparison vector enabled #2373

Uh oh!

medwar99 Sep 2, 2024

Replies: 1 comment · 4 replies

Uh oh!

medwar99 Sep 2, 2024 Author

Uh oh!

Uh oh!

RobinL Sep 4, 2024 Maintainer

Uh oh!

RobinL Sep 4, 2024 Maintainer

Uh oh!

medwar99 Sep 4, 2024 Author

Uh oh!

Uh oh!

RobinL Nov 12, 2024 Maintainer

medwar99
Sep 2, 2024

Replies: 1 comment 4 replies

medwar99
Sep 2, 2024
Author

RobinL Sep 4, 2024
Maintainer

RobinL Sep 4, 2024
Maintainer

medwar99 Sep 4, 2024
Author

RobinL Nov 12, 2024
Maintainer