Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem with gseGO() #721

Open
Flu09 opened this issue Aug 31, 2024 · 1 comment
Open

problem with gseGO() #721

Flu09 opened this issue Aug 31, 2024 · 1 comment

Comments

@Flu09
Copy link

Flu09 commented Aug 31, 2024

gsea_BP_astro_33 <- gseGO(
+   gene_list,           
+   ont = "BP",             
+   OrgDb = org.Hs.eg.db,   
+   keyType = "ENSEMBL",    
+   minGSSize = 10,         
+   maxGSSize = 500,     
+   pvalueCutoff = 0.05,    
+   by = "fgsea",          
+   seed = TRUE,          
+   pAdjustMethod = "fdr", 
+   verbose = TRUE,         
+   eps = 0,               
+   nPermSimple = 10000     
+ )
using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

preparing geneSet collections...
GSEA analysis...
leading edge analysis...
done...
Warning messages:
1: In fgseaMultilevel(pathways = pathways, stats = stats, minSize = minSize,  :
  There were 7 pathways for which P-values were not calculated properly due to unbalanced (positive and negative) gene-level statistic values. For such pathways pval, padj, NES, log2err are set to NA. You can try to increase the value of the argument nPermSimple (for example set it nPermSimple = 100000)
2: In fgseaMultilevel(pathways = pathways, stats = stats, minSize = minSize,  :
  For some of the pathways the P-values were likely overestimated. For such pathways log2err is set to NA.

I am working on single cell data. have this warning message I am not sure why unbalanced positive and negative gene-level statistic and I have more questions to ask please.

I ran function seurat findmarkers() between fibroblasts diseased and fibroblasts healthy to find DEGs and then I ranked the list only by avg_log2fc. is ranking them by avg_log2fc sufficient?

should I have inserted only the upregulated or the downregulated genes?

another question do i remove the DEGs pvalue > 0.05 before using gseGO() or keep all genes.

I would appreciate it if you can help. Thanks.

@guidohooiveld
Copy link

I don't have any experience with single-cell data nor seurat, so I cannot really comment on those questions.

The warning on unbalanced (positive and negative) gene-level statistic values is triggered because your ranked input list consists of many more genes having a positive ranking metric than negative metric (or vice versa). Usually these number are balanced.
From a practical perspective: when e.g. using the logFC as ranking metric, this means your input consists of way more up-regulated genes than down-regulated genes.
Since the basis of GSEA is basically to test which gene sets are enriched on top or bottom of the ranked (input) list, in cases of unbalanced input it is difficult to determine whether a gene set should have a positive, or negative score. As a result, the biological interpretation of the results should thus also be done with care. Hence, the warning.
You also may want to see this thread: ctlab/fgsea#124

gseGO performs a gene set enrichment analysis (based on GO categories), so you should keep all genes! Idem for gseKEGG (using KEGG gene sets) or the generic function GSEA.
If you are interested which gene sets are enriched in a subset of the genes you measured, e.g. those with p<0.05, then you should perform a so-called over-representation analysis (ORA) using the function enrichGO (or enrichKEGG, or the generic function enricher).
See also: https://yulab-smu.top/biomedical-knowledge-mining-book/enrichment-overview.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants