Ensuring Consistent cluster_id Assignments Across Multiple Linkage Runs in Splink #2355

Ahosseinzadeh723 · 2024-08-26T16:47:38Z

Ahosseinzadeh723
Aug 26, 2024

I'm currently using Splink for record linkage and I'm looking for best practices to ensure that cluster_ids remain consistent across multiple linkage runs. Specifically, I want to make sure that once a cluster_id is assigned to an individual or record, it will stay the same in future runs, even as new data is introduced.

My context:

I have a set of enrollment_ids that I need to link.
I’ve already performed some linkage runs, and cluster_ids have been assigned.
I am always using the "dedup only" mode in Splink, not the "dedup and linkage" or "linkage" modes.
Now, I’m planning to run the linkage process again with new data, but I want to maintain the consistency of cluster_ids for the existing records.

Additional consideration:

I understand that for most groups of matches, the cluster_id remains the same (likely the lowest one). However, in rare cases where two different clusters merge into one due to the introduction of a new record, what happens to the cluster_id? Will the new combined cluster receive the lowest cluster_id among the unique IDs involved?
What is the best practice for handling updates in such cases to ensure consistency and accuracy?

My question:

What is the recommended approach or best practice in Splink to ensure that cluster_ids for previously linked records remain consistent across multiple runs?
How should I structure my process or data to prevent reassigning new cluster_ids to existing records?
Are there any built-in features or common strategies within Splink or similar tools to address this issue, particularly in scenarios where clusters merge?

w-logan-downing · 2024-12-06T19:58:09Z

w-logan-downing
Dec 6, 2024

Not sure if you ever found your answer but since I stumbled across your question - looking for the answer myself - before finding this other post. Robin indicates that the cluster ID selected will be the minimum cluster ID. In you proposed question about merged clusters, you are correct that it would be the lowest ID in that cluster grouping. As to strategies to manage this, I don't have any suggestions. I personally am working on accepting that the cluster IDs can't be treated as stable and should be expected to change. This may not work for your use case though.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensuring Consistent cluster_id Assignments Across Multiple Linkage Runs in Splink #2355

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Ensuring Consistent cluster_id Assignments Across Multiple Linkage Runs in Splink #2355

Ahosseinzadeh723 Aug 26, 2024

Replies: 1 comment

w-logan-downing Dec 6, 2024

Ahosseinzadeh723
Aug 26, 2024

w-logan-downing
Dec 6, 2024