Replies: 1 comment 4 replies
-
Ah, I see |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Due to speed and memory constraints we can often turn off
retain_matching_columns
andretain_intermediate_calculation_columns
to gain a performance bump, however this impacts on a lot of functionality that is useful in QA (waterfall chart, comparison chart, parts of cluster studio etc.). Some can be cheated/worked-around by re-enabling these flags, or at least could in Splink 3, but the outputs rightly reflected the fact the information is missing.One issue I'm seeing is that I want to quickly obtain predictions with the column retention off, and then do a proper QA spot-check on a sub-set of these predictions (say with a
match_weight
between some values).I don't know if I'm doing this in a sensible way, but currently I'm selecting my subset of edges from
df_predict
, and then iterating over the pairs to obtain the record dictionary for each unique_id. I then callcompare_two_records()
on that pair to obtain the comparison vector. My implementation currently goes back and forth between pandas and splink, but I wonder if there's an easier way to do this in Splink almost as alinker.inference.compare_two_records()
that takes in a edge list of unique_ids (with source dataset)?Calling
predict()
with column retention off gives usIn a link model we also get columns for
source_dataset_l
andsource_dataset_r
, which are also easy to add in for dedupe-only models.A new method in inference could take in something like
and return comparison vectors for only those edges, which can then be used for QA etc.
Beta Was this translation helpful? Give feedback.
All reactions