Skip to content

precision_recall() counts within group connections twice but not between group connections #62

@FloHu

Description

@FloHu

When reading the source code for cytominer_eval.operations.precision_recall() I noticed that the similarity_melted_df variable counts each replicate pair twice, e.g. A1 --> A2 and A2 --> A1.
This becomes a problem because only the first replicate_group_col in lines 49-52 is subsequently used for grouping:

49 replicate_group_cols = [
50    "{x}{suf}".format(x=x, suf=pair_ids[list(pair_ids)[0]]["suffix"])  # [0] keeps only the first of two grouping columns
51    for x in replicate_groups
52 ]

In the next step, each group is passed to calculate_precision_recall():

59   precision_recall_df_at_k = similarity_melted_df.groupby(
60        replicate_group_cols
61    ).apply(lambda x: calculate_precision_recall(x, k=k_))
62    precision_recall_df = precision_recall_df.append(precision_recall_df_at_k)

With the effect that all samples from within a group are counted twice. However, samples from outside the group are only counted once because group_by will filter out one direction.

Let me clarify this with an example. Consider 5 samples, the first 3 from group 'A', the second 2 from group 'B', both with greater within-group than between group correlations:

image

Then what calculate_precision_recall will see is this:
image

For example, one can see that the sample_pair_a column has a row for A1-->A2 and one for A2-->A1 but only one for A1-->B1. B1-->A1 is missing because of the way the melted data frame is generated and the grouping is performed. One can also see that the similarity metrics for within group connections appear in duplicates.

Accordingly the outcome for precision and recall at k=4 is the following:
image
Precision: all 4 closest connections are from within group for A but only 2 for group B.
Recall: 4/6 connections found for A but all 2 found for B.

In summary, the computations are not entirely correct, especially for smaller groups. Also consider that with odd values for k only one of the two connections of the symmetric pair is used.

Admittedly, this is a bit mind-boggling. I recommend using a debugger if you want to trace all the steps in detail by yourself.

Proposed solution: I would suggest to count each pair only once when creating the melted data frame.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions