Skip to content

Issues with groupby columns  #67

@michaelbornholdt

Description

@michaelbornholdt

Soo,

For prec_recall and Hitk we know have the case that the input groupby columns determines by which columns the similarity df is sorted. This has a important impact on your solution. If you for example sort by something that is not unique, ie not unique in the input df - then you will get internal connections in the sub dataframe that you are grouping.

Lets say you have for example a df with Sampels and different dosages. If you then have groupby_columns = Metadata_broad_sample, then you will sort into sub groups that have several connections within each other (all the different doses). And your precision will have the weird effects that @FloHu described in #62 for example. Similarly, hitk will have weird results because you are now looking at internal connections and not only the nearest neighbors of one sample.

Either we keep it all this way and make users aware of this or we find some workaround here? Maybe the solution is to not allow anything other than unique groupby_cols ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions