-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluation metrics for representations #5
Comments
We are using both at the moment. We will not drop metrics at the moment, but aim to expand them. We have found enrichment analysis to be limited for capturing and interpreting profiling quality with ground truth connections. I'm happy to have in depth conversations about it, I'm interested in your feedback and would like to share our observations and results. |
Great idea to keep both evaluation metrics! I captured the definition of enrichment score below, for our notes. Once you are set, can you write down exactly how you propose to use PR AUC or Precision@k in evaluating the dataset? Also, LMK if you disagree with the highlighted part in the second para below; it's possible you are setting up the averaging differently. Enrichment scoreFrom : https://www.nature.com/articles/s41467-019-10154-8#Sec4
|
The limitations cited in the paper make sense under a 1-NN classification approach (which is the one adopted in Ljosa 2013). In fact, enrichment analysis and 1-NN classification are two extremes of performance evaluations:
What we are exploring in our experiments is an intermediate approach: a ranked list of top connections per class. This is very common in information retrieval problems (e.g the results page of a Google query), and there are many metrics that can be used to assess relevance, including, but not limited to precision, recall, F1, and even enrichment analysis. I'll post results here when we have something ready to share! |
Thanks @jccaicedo. I've made some notes below for us to discuss later. Looking forward to the results!
Updated 4/1/21 after the discussion with @jccaicedo below #5 (comment) We have a weighted graph where the vertices are perturbations with multiple labels (e.g. pathways in the case of genetic perturbations), and edges are the similarity between the vertices (e.g. the cosine similarity between image-based profiles of two CRISPR knockouts). There are three levels of ranked lists of edges, each of which can produce global metrics (based on binary classification metrics like precision, recall, F1, etc.). These global metrics can be used to compare representations. In all 3 cases, we pose it as a binary classification problem on the edges:
The three levels of ranked lists of edges, along with the metrics they induce, are below (Not all the metrics are useful, and some may be very similar to others. I have highlighted the ones I think are useful.)
Note that Rohban does type 0.a, with the global metric being enrichment score. I think this loosely relates to averaging types discussed here
Update: Juan et al. are doing 2.f with the Precision@K and PRAUC as the sample-specific metric |
I think Rohban does 0: global metric, no class specific (if I understand your classification correctly, but maybe it's just a change in indexing :P). We have discussed and developed 2 (sample-specific) in two different flavors:
A is measured with precision@K, and B is measured with an interpolated Precision-Recall curve. We obtained results applying these to the TA-ORF dataset, using this implementation. We will not do 1 in your list (class-specific evaluation) for now. Pathway and MOA annotations are not multi-class (1 out of N classes), it's multi-label (K out of N classes), which can make the connectivity and results tricky to interpret. Happy to discuss this choice further if there is interest in such a measure. |
@jccaicedo So exciting to see the new TA ORF results, with trained features being so much better than pre-trained! (and actually, even more exciting that both neural features are consistently better than CellProfiler, although I realize you've already moved beyond that :D). The Precision@K (with low K) is probably the most relevant, so it's really great to see that the gap is very high there. I assume the plot below is what you're referring to as Precision@K? It says Average Precision, so I was a bit confused. Maybe you meant the average of Precision@K across all samples? To make sure I understand, can you explain what is, for example, the meaning of the point at X=10 on the green curve (with Y = ~.47)? My interpretation is:
Is this correct? I have updated #5 (comment) with some notes. (Please forgive the tiresome categorization; I'm a bit too much into the weeds right now :D)
You are indeed right – I was off-by-one :D. Rohban computes a type 0.a metric, with the global metric being enrichment score. It sounds like you are doing type 2.f, with the sample-specific metric being Precision@K (in the example I cite above)
You are right about multi-label; I've updated #5 (comment) to reflect that this is a multi-label problem. For comparing representations – which is the main goal right now – I think your single global metric (e.g. average of Precision@K; a type 2.f if I've got that right) is perfectly sound. One can debate which one is best – average of Precision@K or mAP or something else, but it's fine – and even preferable – to report multiple. But, going beyond comparing representations, I think it would be useful to have label-specific metrics, most likely type 1.a or 2.b because we'd love to know which MOAs or pathways are best captured using a profiling method. Happy to discuss this further now, or later if you prefer :) |
Your interpretation is correct, @shntnu ! It is the average of Precision@K for all samples.
X=10 means that we look at the top 10 connections for each sample, and in average, Y=0.47 indicates that approx. 47% of them are biologically meaningful (have at least one class label in common). Agree that label-specific metrics would be useful. I think in the context of image-based profiling applications, the metrics that make more sense are sample-specific. The reason is that we usually make a query and expect to retrieve a list of candidates with as high hit rate as possible. Statistics based on samples are more biologically interpretable, and therefore, the metrics in category 2 are more compelling to me. So if performance per label is of interest, I would recommend exploring 2.b. I'll have a look and will report results in TA-ORF when ready. |
@jccaicedo can you clarify the metric you are now using for evaluating representations, and how you are reporting it?
IIUC you were previously using this
https://github.com/broadinstitute/DeepProfilerExperiments/blob/master/profiling/quality.py
but are now using precision-based metrics, possibly Average Precision?
https://github.com/broadinstitute/DeepProfilerExperiments/blob/master/profiling/metrics.py
h/t to @gwaygenomics whose issue sent me here cytomining/cytominer-eval#17
The text was updated successfully, but these errors were encountered: