-
Notifications
You must be signed in to change notification settings - Fork 12
Description
When reading the source code for cytominer_eval.operations.precision_recall()
I noticed that the similarity_melted_df variable counts each replicate pair twice, e.g. A1 --> A2 and A2 --> A1.
This becomes a problem because only the first replicate_group_col in lines 49-52 is subsequently used for grouping:
49 replicate_group_cols = [
50 "{x}{suf}".format(x=x, suf=pair_ids[list(pair_ids)[0]]["suffix"]) # [0] keeps only the first of two grouping columns
51 for x in replicate_groups
52 ]
In the next step, each group is passed to calculate_precision_recall()
:
59 precision_recall_df_at_k = similarity_melted_df.groupby(
60 replicate_group_cols
61 ).apply(lambda x: calculate_precision_recall(x, k=k_))
62 precision_recall_df = precision_recall_df.append(precision_recall_df_at_k)
With the effect that all samples from within a group are counted twice. However, samples from outside the group are only counted once because group_by
will filter out one direction.
Let me clarify this with an example. Consider 5 samples, the first 3 from group 'A', the second 2 from group 'B', both with greater within-group than between group correlations:
Then what calculate_precision_recall
will see is this:
For example, one can see that the sample_pair_a
column has a row for A1-->A2
and one for A2-->A1
but only one for A1-->B1
. B1-->A1
is missing because of the way the melted data frame is generated and the grouping is performed. One can also see that the similarity metrics for within group connections appear in duplicates.
Accordingly the outcome for precision and recall at k=4 is the following:
Precision: all 4 closest connections are from within group for A but only 2 for group B.
Recall: 4/6 connections found for A but all 2 found for B.
In summary, the computations are not entirely correct, especially for smaller groups. Also consider that with odd values for k only one of the two connections of the symmetric pair is used.
Admittedly, this is a bit mind-boggling. I recommend using a debugger if you want to trace all the steps in detail by yourself.
Proposed solution: I would suggest to count each pair only once when creating the melted data frame.