Ensuring Consistent cluster_id Assignments Across Multiple Linkage Runs in Splink #2355
Ahosseinzadeh723
started this conversation in
Ideas
Replies: 1 comment
-
Not sure if you ever found your answer but since I stumbled across your question - looking for the answer myself - before finding this other post. Robin indicates that the cluster ID selected will be the minimum cluster ID. In you proposed question about merged clusters, you are correct that it would be the lowest ID in that cluster grouping. As to strategies to manage this, I don't have any suggestions. I personally am working on accepting that the cluster IDs can't be treated as stable and should be expected to change. This may not work for your use case though. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm currently using Splink for record linkage and I'm looking for best practices to ensure that cluster_ids remain consistent across multiple linkage runs. Specifically, I want to make sure that once a cluster_id is assigned to an individual or record, it will stay the same in future runs, even as new data is introduced.
My context:
I have a set of enrollment_ids that I need to link.
I’ve already performed some linkage runs, and cluster_ids have been assigned.
I am always using the "dedup only" mode in Splink, not the "dedup and linkage" or "linkage" modes.
Now, I’m planning to run the linkage process again with new data, but I want to maintain the consistency of cluster_ids for the existing records.
Additional consideration:
I understand that for most groups of matches, the cluster_id remains the same (likely the lowest one). However, in rare cases where two different clusters merge into one due to the introduction of a new record, what happens to the cluster_id? Will the new combined cluster receive the lowest cluster_id among the unique IDs involved?
What is the best practice for handling updates in such cases to ensure consistency and accuracy?
My question:
What is the recommended approach or best practice in Splink to ensure that cluster_ids for previously linked records remain consistent across multiple runs?
How should I structure my process or data to prevent reassigning new cluster_ids to existing records?
Are there any built-in features or common strategies within Splink or similar tools to address this issue, particularly in scenarios where clusters merge?
Beta Was this translation helpful? Give feedback.
All reactions