-
Notifications
You must be signed in to change notification settings - Fork 30
Expand file tree
/
Copy pathenrichplot.qmd
More file actions
1117 lines (848 loc) · 37.4 KB
/
enrichplot.qmd
File metadata and controls
1117 lines (848 loc) · 37.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Visualization of functional enrichment result {#sec-enrichplot}
```{r}
#| echo: false
#| message: false
#| cache: false
source("_common.R")
library(clusterProfiler)
library(enrichplot)
library(dplyr)
```
The `r Biocpkg("enrichplot")` package implements several visualization methods to help interpret enrichment results. It supports visualizing enrichment results obtained from `r Biocpkg("DOSE")` [@yu_dose_2015], `r Biocpkg("clusterProfiler")` [@yu2012; @wu_clusterprofiler_2021],
`r Biocpkg("ReactomePA")` [@yu_reactomepa_2016] and `r Biocpkg("meshes")` [@yu_meshes_2018]. Both over representation analysis (ORA) and gene set enrichment analysis (GSEA) are supported.
Note: Several visualization methods were first implemented in `r Biocpkg("DOSE")` and rewrote from scratch using `r CRANpkg("ggplot2")`. If you want to use the [old methods](https://www.biostars.org/p/375555), you can use the [doseplot](https://github.com/GuangchuangYu/doseplot) package.
## Bar Plot
Bar plot is the most widely used method to visualize enriched terms. It depicts the enrichment scores (*e.g.* p values) and gene count or ratio as bar height
and color @fig-Barplot. Users can specify the number of terms (most significant) or selected terms (see also the [FAQ](#showing-specific-pathways)) to display via the `showCategory` parameter.
```{r}
#| label: fig-enrichDGN
library(DOSE)
data(geneList)
de <- names(geneList)[abs(geneList) > 2]
edo <- enrichDGN(de)
```
```{r eval=F}
#| label: fig-barplot-eval-false
library(enrichplot)
barplot(edo, showCategory=20)
```
Other variables that derived using `mutate` can also be used as bar height or color as demonstrated in @sec-mutate and @fig-Barplot.
```{r eval=F}
#| label: fig-mutate-barplot-eval-false
library(clusterProfiler)
mutate(edo, qscore = -log(p.adjust, base=10)) |>
barplot(x="qscore")
```
```{r}
#| label: fig-Barplot
#| echo: false
#| fig-width: 11
#| fig-height: 6
#| fig-scap: |
#| Bar plot of enriched terms.
#| fig-cap: |
#| **Bar plot of enriched terms.**
p1 <- barplot(edo, showCategory=20)
p2 <- mutate(edo, qscore = -log(p.adjust, base=10)) |>
barplot(x="qscore")
library(aplot)
plot_list(p1, p2, ncol=2, tag_levels='A')
```
### Split top categories by ontology
When visualizing GO enrichment across all ontologies (`ont = "ALL"`), it is often more informative to show the top terms within each ontology (BP/CC/MF) rather than selecting the top terms globally. `barplot()` forwards additional arguments via `...`, and one useful option is `split`. Setting `split = "ONTOLOGY"` will perform the “top N per ontology” selection and makes it straightforward to color (and later facet) the results by the ontology.
```{r}
#| label: fig-enrichGO-ontology
#| echo: false
#| message: false
#| warning: false
library(DOSE)
data(geneList, package = "DOSE")
de <- names(geneList)[1:300]
library(clusterProfiler)
library(org.Hs.eg.db)
ego_all <- enrichGO(
gene = de,
OrgDb = org.Hs.eg.db,
keyType = "ENTREZID",
ont = "ALL",
readable = TRUE
)
```
```{r}
#| label: fig-barplot-ontology-nofacet
#| fig-width: 12
#| fig-height: 9
#| fig-scap: |
#| Bar plot of GO terms using `split = "ONTOLOGY"` without faceting.
#| fig-cap: |
#| **Bar plot of GO terms using `split = "ONTOLOGY"` (no faceting).**
library(enrichplot)
library(ggplot2)
p_nofacet <- barplot(
ego_all,
x = "Count",
showCategory = 15,
split = "ONTOLOGY"
) +
aes(fill = ONTOLOGY) +
scale_fill_brewer(palette = "Set2")
p_nofacet +
geom_text(aes(label = Count), nudge_x = 2) +
scale_y_discrete()
```
Note that `barplot()` wraps long term descriptions onto multiple lines by default (controlled by `label_format`, default is 30). If you prefer to keep the y-axis labels on a single line, you can override the default scale by adding `+ scale_y_discrete()` as shown above.
```{r}
#| label: fig-barplot-ontology-facet
#| fig-width: 12
#| fig-height: 9
#| fig-scap: |
#| Bar plot of GO terms with `autofacet()` splitting by ontology.
#| fig-cap: |
#| **Bar plot of GO terms with `autofacet()` splitting by ontology.**
p_facet <- barplot(
ego_all,
x = "Count",
showCategory = 15,
split = "ONTOLOGY"
) +
aes(fill = ONTOLOGY) +
scale_fill_brewer(palette = "Set2") +
enrichplot::autofacet(by = "row", scales = "free")
p_facet + scale_y_discrete()
```
If the fortified data contain an `ONTOLOGY` column (as in the output generated by `split = "ONTOLOGY"`), you can facet the plot directly with `+ enrichplot::autofacet()`, and it will automatically split panels by ontology.
## Dot plot
Dot plot is similar to bar plot with the capability to encode another score as dot size.
```{r}
#| label: fig-dotplot-setup
library(ggplot2)
edo2 <- gseDO(geneList)
dotplot(edo, showCategory=30, label_format=NULL) + ggtitle("dotplot for ORA")
dotplot(edo2, showCategory=30, label_format=NULL) + ggtitle("dotplot for GSEA")
```
```{r}
#| label: fig-Dotplotcap
#| echo: false
#| fig-width: 12
#| fig-height: 10
#| fig-scap: |
#| Dot plot of enriched terms.
#| fig-cap: |
#| **Dot plot of enriched terms.**
library(ggplot2)
edo2 <- gseDO(geneList)
p1 <- dotplot(edo, showCategory=30) + ggtitle("dotplot for ORA")
p2 <- dotplot(edo2, showCategory=30) + ggtitle("dotplot for GSEA")
plot_list(p1, p2, ncol=2, tag_levels = 'A')
```
Note: The `dotplot()` function also works with [`compareCluster()` output](#compare-dotplot).
### Highlighting specific pathways
The `label_format` parameter in `dotplot()` accepts a function to format y-axis labels, which can be used to highlight specific pathways. This is useful for emphasizing pathways of interest in publication figures.
First, perform enrichment analysis:
```{r}
#| label: fig-enrichDO
library(DOSE)
data(geneList)
de <- names(geneList)[1:100]
x <- enrichDO(de)
```
Define a function to highlight pathways containing "cancer":
```{r}
#| label: fig-label-format-func
f <- function(ids) {
i <- grepl('cancer', ids)
ids[i] <- paste0("<i style='color:#009E73'>", ids[i], "</i>")
return(ids)
}
```
To render the HTML formatting, use the `ggtext` package with `element_markdown()`:
```{r}
#| label: fig-dotplot-highlight
library(ggplot2)
library(ggtext)
library(enrichplot)
dotplot(x, label_format=f) + theme(axis.text.y = element_markdown())
```
You can also highlight specific pathways by index:
```{r}
#| label: fig-label-format-func2
f2 <- function(ids) {
ids[1] <- paste0("<i style='color:#009E73'>", ids[1], "</i>")
ids[2] <- paste0("<i style='color:#0072B2'>**", ids[2], "**</i>")
return(ids)
}
dotplot(x, label_format=f2) + theme(axis.text.y = element_markdown())
```
Or apply different colors to all pathways:
```{r}
#| label: fig-label-format-func3
f3 <- function(ids) {
# cols <- rcartocolor::carto_pal(length(ids), "Vivid")
# cols <- colorspace::rainbow_hcl(length(ids))
cols <- rainbow(length(ids))
ids <- paste0("<i style='color:", cols, "'>**", ids, "**</i>")
return(ids)
}
dotplot(x, label_format=f3) + theme(axis.text.y = element_markdown())
```
### Formula interface of dotplot
The `x` variable of `dotplot()` supports a formula interface, allowing users to use derived variables. For example, we can calculate `GeneRatio/BgRatio` or `-log(p.adjust)` and use them as the x-axis variable.
```{r}
#| label: fig-dotplot-formula
#| fig-width: 12
#| fig-height: 8
#| fig-cap: |
#| **Dotplot with formula interface.**
#| (A) `x = ~GeneRatio/BgRatio`.
#| (B) `x = ~ -log(p.adjust)`.
p1 <- dotplot(edo, x = ~GeneRatio/BgRatio)
p2 <- dotplot(edo, x = ~ -log(p.adjust))
plot_list(p1, p2, ncol=2, tag_levels='A')
```
This feature provides great flexibility in visualization. For instance, the **Rich Factor** is a common metric defined as the ratio of the number of differentially expressed genes annotated in a pathway to the total number of genes annotated in that pathway.
$$Rich Factor = \frac{Count}{BgRatio \times N}$$
where $N$ is the total number of genes in the background distribution.
We can easily visualize the **Rich Factor** using the formula interface.
```{r eval=FALSE}
#| label: fig-rich-factor
N <- 8007 # total number of genes in the background
dotplot(edo, x = ~Count/(BgRatio * N))
```
Alternatively, you can precompute derived variables with `mutate()` (e.g., add `neglog10p = -log10(p.adjust)` or `richFactor = Count / (BgRatio * N)`) and then pass the new column name to `x`. This approach is useful when you want to reuse the computed variables across multiple plots. See the dedicated section on using dplyr verbs with enrichment results in @sec-mutate.
### Adjusting dot sizes
To increase the size of the dots in the dot plot, we can use the `scale_size` function from the `ggplot2` package. This allows us to specify the range of the dot sizes.
```{r}
#| label: fig-dotplot-size
#| fig-width: 12
#| fig-height: 8
#| fig-cap: |
#| **Dotplot with adjusted dot sizes.**
#| (A) Default dotplot.
#| (B) Dotplot with `scale_size(range=c(2, 20))`.
p1 <- dotplot(edo, showCategory=10) + ggtitle("Default")
p2 <- dotplot(edo, showCategory=10) + scale_size(range=c(2, 20)) + ggtitle("Adjusted size")
plot_list(p1, p2, ncol=2, tag_levels='A')
```
## Dotplot2: Comparing two clusters
The `dotplot2()` function is a specialized version of dotplot designed for comparing two selected clusters from `compareCluster` results. This function provides focused visualization when users want to directly contrast specific biological conditions or experimental groups.
```{r}
#| label: fig-dotplot2
#| fig-width: 12
#| fig-height: 8
#| eval: false
#| fig-cap: |
#| **Dotplot2 for comparing two clusters.**
#| Direct comparison of cluster A and cluster B.
# Example using compareCluster result
library(clusterProfiler)
data(gcSample)
xx <- compareCluster(gcSample, fun="enrichKEGG",
organism="hsa", pvalueCutoff=0.05)
# Compare cluster A and cluster B
dotplot2(xx, vars = c("A", "B"))
```
The `dotplot2()` function accepts the same parameters as `dotplot()` but requires `cluster1` and `cluster2` arguments to specify which clusters to compare. This focused visualization helps in identifying differential enrichment patterns between specific experimental conditions.
## Gene-Concept Network {#cnetplot}
```{r}
#| label: fig-cnet-setup
#| echo: false
#| message: false
#| warning: false
library(clusterProfiler)
library(DOSE)
library(enrichplot)
library(org.Hs.eg.db)
data(geneList, package="DOSE")
de <- names(geneList)[abs(geneList) > 2]
edo <- enrichDGN(de)
```
Both the `barplot()` and `dotplot()` only displayed most significant or selected enriched terms,
while users may want to know which genes are involved in these significant
terms.
To consider the potential biological complexities in which a gene may belong to multiple annotation categories and provide information about numeric changes when available, we developed the `cnetplot()` function to extract the complex associations.
The `cnetplot()` depicts the linkages of genes and biological concepts (*e.g.* GO terms or KEGG pathways) as a network. GSEA result is also supported
with only core enriched genes displayed. For GSEA results, `cnetplot` specifically visualizes the core enriched genes identified through leading edge analysis, which represent the subset of genes most responsible for driving the enrichment signal.
Users can use the 'fc_threshold' parameter in the `cnetplot` function to filter genes to only include those with `|foldChange| > fc_threshold`. This allows users to plot only high fold-change genes (e.g., `fc_threshold = 1`). In addition, users can use the `node_label_gene` parameter to label genes that are shared between groups of terms (i.e. `node_label_gene = "share"`).
```{r}
#| label: fig-Networkplot
#| fig-width: 16
#| fig-height: 18
#| fig-scap: |
#| Network plot of enriched terms.
#| fig-cap: |
#| **Network plot of enriched terms.**
#|
## convert gene ID to Symbol
edox <- setReadable(edo, 'org.Hs.eg.db', 'ENTREZID')
p1 <- cnetplot(edox, foldChange=geneList)
p2 <- cnetplot(edox, categorySizeBy=~ -log10(pvalue), foldChange=geneList)
## only plot high fold-change genes
p3 <- cnetplot(edox, foldChange=geneList, fc_threshold = 2)
p4 <- cnetplot(edox, foldChange=geneList, node_label = "share")
plot_list(p1, p2, p3, p4, ncol=2, tag_levels = 'A')
```
If you would like label subset of the nodes, you can use the `node_label` parameter, which supports 4 possible selections (i.e. "category", "gene", "all" and "none"), as demonstrated in @fig-cnetNodeLabel.
The `node_label` parameter also supports enhanced functionality:
1. **Vector selection**: Specify specific genes to label using a character vector, e.g., `node_label = c("gene1", "gene2")`
2. **'exclusive' labeling**: Label only genes that belong exclusively to one category using `node_label = "exclusive"`
3. **'share' labeling**: Label genes that are shared between multiple categories using `node_label = "share"` (as shown in the example above)
4. **Conditional filtering**: Use comparison operators to filter genes based on fold change, e.g., `node_label = "> 1"` or `node_label = "< -1"` to label genes with absolute fold change greater than 1
```{r}
#| label: fig-cnetNodeLabel
#| fig-width: 12
#| fig-height: 16
#| fig-scap: |
#| Labelling nodes by selected subset.
#| fig-cap: |
#| **Labelling nodes by selected subset.**
#| gene category (A), gene name (B),
#| both gene category and gene name (C, default)
#| and not to label at all (D).
p1 <- cnetplot(edox, node_label="category")
p2 <- cnetplot(edox, node_label="gene")
p3 <- cnetplot(edox, node_label="all")
p4 <- cnetplot(edox, node_label="none",
color_category='firebrick',
color_item='steelblue')
plot_list(p1, p2, p3, p4, ncol=2, tag_levels = 'A')
```
The `cnetplot` function can be used as a general method to visualize data relationships in a network diagram. Please refer to the vignette of `r CRANpkg("ggtangle")`.
### Customizing gene colors in cnetplot
Users can color specific genes by providing a named vector of fold changes containing only those genes.
```{r}
#| label: fig-cnetplot-custom-color
foldChange <- c(rep(1, 6), rep(-1, 4))
names(foldChange) <- c("MARCO", "GZMB", "CXCL11", "CXCL10",
"LAG3", "CCL8", "PDK1", "GABRP", "MELK", "CENPE")
p <- cnetplot(edox, foldChange=foldChange)
p
```
If you want to remove the color legend:
```{r}
#| label: fig-cnetplot-no-legend
p <- p + guides(color='none')
p
```
To create a custom legend, you can use a dummy data frame and `geom_point`:
```{r}
#| label: fig-cnetplot-custom-legend
d <- data.frame(type=c('upregulated', 'downregulated'), x=0, y=0)
g <- p + geom_point(aes(alpha=type, x=x, y=y), data=d, shape=16, size=0)
g + guides(alpha=guide_legend(
override.aes=list(color=c("blue", "red"),
alpha=1,
size=3),
title = "VIP genes",
reverse = TRUE
))
```
Note: The `cnetplot()` function also works with [`compareCluster()` output](#compare-cnetplot).
## Heatmap-like functional classification
The `heatplot` is similar to `cnetplot`, but displays the relationships as a
heatmap. The gene-concept network may become too complicated if users want to
show a large number of significant terms. The `heatplot` can simplify the result
and make it easier to identify expression patterns.
```{r}
#| label: fig-Heatplot
#| fig-width: 12
#| fig-height: 5
#| fig-scap: |
#| Heatmap plot of enriched terms.
#| fig-cap: |
#| **Heatmap plot of enriched terms.**
#| default (A), `foldChange=geneList` (B)
p1 <- heatplot(edox, showCategory=5)
p2 <- heatplot(edox, foldChange=geneList, showCategory=5)
plot_list(p1, p2, ncol=1, tag_levels = 'A')
```
The `showTop` parameter can be used to limit the number of genes displayed in the heatmap. This is particularly useful when dealing with large gene sets where only the top genes based on fold change or significance need to be visualized. For example, `showTop = 20` will display only the top 20 genes in the heatmap.
## Tree plot
The `treeplot()` function performs hierarchical clustering of enriched terms. It relies on the pairwise similarities of the enriched terms calculated by the `pairwise_termsim()` function, which by default uses Jaccard's similarity index (JC). Users can also use semantic similarity values when supported (*e.g.*, [GO](#GOSemSim), [DO](#DOSE-semantic-similarity), and [MeSH](#meshes-semantic-similarity)).
The default agglomeration method in `treeplot()` is `ward.D`, and users can specify other methods via the `cluster_method` parameter (*e.g.*, 'average', 'complete', 'median', 'centroid', *etc.*; see also the documentation of the `hclust()` function). The `treeplot()` function will cut the tree into several subtrees (specified by the `nCluster` parameter, default is 5) and label subtrees using high-frequency words. This reduces the complexity of the enriched result and improves user interpretation ability.
For fine-grained control over text appearance, the `leave_fontsize` and `clade_fontsize` parameters allow users to adjust the font size of leaf labels and clade labels respectively. For example, `leave_fontsize = 8, clade_fontsize = 10` will set leaf labels to 8pt and clade labels to 10pt.
```{r}
#| label: fig-treeplot
#| fig-width: 20
#| fig-height: 10
#| fig-scap: |
#| Tree plot of enriched terms.
#| fig-cap: |
#| **Tree plot of enriched terms.**
#| default (A), `hclust_method = "average"` (B)
library(ggtree)
edox2 <- pairwise_termsim(edox)
p1 <- treeplot(edox2, cladelab_offset=8, tiplab_offset=.3, fontsize_cladelab =5) +
hexpand(.2)
p2 <- treeplot(edox2, cluster_method = "average",
cladelab_offset=14, tiplab_offset=.3, fontsize_cladelab =5) +
hexpand(.3)
aplot::plot_list(p1, p2, tag_levels='A', ncol=2)
```
## Semantic Space Plot {#ssplot}
While `treeplot()` visualizes semantic similarity using a hierarchical structure, the `ssplot()` (Semantic Space Plot) projects enriched terms into a low-dimensional space (e.g., using Multidimensional Scaling, MDS). This provides a complementary spatial view where terms with high semantic similarity are clustered together.
Like `treeplot()`, `ssplot()` requires `pairwise_termsim()` to be run first. It automatically groups terms into clusters (default `nCluster` is determined automatically) and labels them with representative words.
```{r}
#| label: fig-ssplot
#| fig-width: 15
#| fig-height: 10
#| fig-scap: |
#| Semantic space plot of enriched terms.
#| fig-cap: |
#| **Semantic space plot of enriched terms.**
#| ssplot with nCluster=5.
ssplot(edox2, nCluster=5) + ggtitle("ssplot (nCluster=5)")
```
`ssplot()` is particularly useful for visualizing the overall semantic landscape of enrichment results and identifying distinct functional modules in a continuous space.
## Enrichment Map
Enrichment map organizes enriched terms into a network with edges connecting
overlapping gene sets. In this way, mutually overlapping gene sets are tend to
cluster together, making it easy to identify functional module.
### Handling GO term redundancy
GO annotations often contain redundant terms that can dominate enrichment results, potentially obscuring other biological stories. The `simplify()` function from the `clusterProfiler` package uses semantic similarity (via the GOSemSim package) to remove redundant GO terms, providing a clearer view of distinct functional modules.
```{r}
#| label: fig-simplify-go
# Load required packages
library(clusterProfiler)
library(enrichplot)
library(DOSE)
# Prepare example data
data(geneList, package="DOSE")
de <- names(geneList)[abs(geneList) > 2]
# Perform GO enrichment analysis
ego <- enrichGO(de, OrgDb = "org.Hs.eg.db", ont="BP", readable=TRUE)
# Remove redundant GO terms using simplify()
ego_simplified <- simplify(ego, cutoff=0.7, by="p.adjust", select_fun=min)
# Visualize both original and simplified results
ego <- pairwise_termsim(ego)
ego_simplified <- pairwise_termsim(ego_simplified)
p1 <- emapplot(ego, node_label_size=.8, size_edge=.5) +
scale_fill_continuous(low = "#e06663", high = "#327eba", name = "p.adjust",
guide = guide_colorbar(reverse = TRUE, order=1), trans='log10') +
ggtitle("Original GO terms")
p2 <- emapplot(ego_simplified, node_label_size=.8, size_edge=.5) +
scale_fill_continuous(low = "#e06663", high = "#327eba", name = "p.adjust",
guide = guide_colorbar(reverse = TRUE, order=1), trans='log10') +
ggtitle("After removing redundant terms")
# Combine plots
library(patchwork)
p1 + p2 + plot_layout(ncol = 2)
```
The `simplify()` function removes redundant GO terms based on semantic similarity (default cutoff = 0.7). This reveals distinct functional modules that might be obscured by redundant terms in the original enrichment results. The `pairwise_termsim()` function calculates pairwise similarities between terms, which is required for `emapplot()` visualization.
The `emapplot` function supports results obtained from hypergeometric test and gene set enrichment analysis.
The `size_category` parameter can be used to resize nodes and the `layout` parameter can adjust the layout, as demonstrated in @fig-Enrichment.
```{r}
#| label: fig-Enrichment
#| fig-width: 15
#| fig-height: 12
#| out-width: 95%
#| fig-scap: |
#| The `emapplot`.
#| fig-cap: |
#| **The `emapplot`.**
#| default (A), `node_label="none"` (B), `size_category=1.5` (C), and `layout="with_fr"` (D)
edo <- pairwise_termsim(edo)
p1 <- emapplot(edo) # node_label = "category" (default)
p2 <- emapplot(edo, node_label = "none")
p3 <- emapplot(edo, node_label = "none", size_category=1.5)
p4 <- emapplot(edo, node_label = "none", layout="with_fr")
plot_list(p1, p2, p3, p4,
ncol=2, tag_levels = 'A',
design="AAAAAA\nBBCCDD",
heights = c(1, .3))
```
The `node_label` parameter controls how the labels were displayed. The enriched terms will be displayed by default with `node_label="category"` and it can be disabled by setting `node_label="none"`.
The `node_label_size` parameter allows users to adjust the font size of node labels in the enrichment map. This is particularly useful when dealing with many overlapping terms or when labels need to be more readable. For example, `node_label_size = 3` will increase the label size compared to the default.
If `node_label="group"`, the `emapplot` function will cluster the enriched terms into different groups and only group names (determined by wordcloud) will be displayed. If `node_label="all"`, then the enriched terms and group names will be displayed simultaneously.
```{r}
#| label: fig-Enrichment-node-label
#| fig-width: 16
#| fig-height: 21
#| fig-scap: |
#| `emapplot` with enriched terms clustering.
#| fig-cap: |
#| **`emapplot` with enriched terms clustering.**
#| `node_label="group"` (A) and
#| `node_label="all"` (B).
p5 <- emapplot(edo, node_label = "group")
p6 <- emapplot(edo, node_label = "all")
plot_list(p5, p6,
ncol=1, tag_levels = 'A')
```
## Biological theme comparison
The `emapplot` function also supports results obtained from `compareCluster` function of `clusterProfiler` package. In addition to `size_category` and `layout` parameters, the number of circles in the bottom left corner can be adjusted using the `legend_n` parameteras, and proportion of clusters in the pie chart can be adjusted using the `pie` parameter, when `pie="count"`, the proportion of clusters in the pie chart is determined by the number of genes, as demonstrated in @fig-Enrichment2.
```{r}
#| label: fig-Enrichment2
#| fig-width: 16
#| fig-height: 18
#| fig-scap: |
#| Plot for results obtained from `compareCluster` function of `clusterProfiler` package.
#| fig-cap: |
#| **Plot for results obtained from `compareCluster` function of `clusterProfiler` package.**
#| default (A), `legend_n=2` (B), `pie="count"` (C) and `pie="count",
#| size_category=1.5, layout="kk"` (D).
library(clusterProfiler)
data(gcSample)
xx <- compareCluster(gcSample, fun="enrichKEGG",
organism="hsa", pvalueCutoff=0.05)
xx <- pairwise_termsim(xx)
p1 <- emapplot(xx)
p2 <- emapplot(xx)
p3 <- emapplot(xx, pie="count")
p4 <- emapplot(xx, pie="count", size_category=1.5, layout="kk")
plot_list(p1, p2, p3, p4, ncol=2, tag_levels = 'A')
```
## UpSet Plot
The `upsetplot` is an alternative to `cnetplot` for visualizing the complex
association between genes and gene sets. It emphasizes the gene overlapping
among different gene sets.
```{r}
#| label: fig-upsetORA
#| fig-width: 12
#| fig-height: 5
#| fig-scap: |
#| Upsetplot for over-representation analysis.
#| fig-cap: |
#| **Upsetplot for over-representation analysis.**
upsetplot(edo)
```
For over-representation analysis, `upsetplot` will calculate the overlaps among different gene sets as demonstrated in @fig-upsetORA. For GSEA result, it will plot the fold change distributions of different categories (e.g. unique to pathway, overlaps among different pathways).
```{r}
#| label: fig-upsetGSEA
#| fig-width: 12
#| fig-height: 5
#| fig-scap: |
#| Upsetplot for gene set enrichment analysis.
#| fig-cap: |
#| **Upsetplot for gene set enrichment analysis.**
kk2 <- gseKEGG(geneList = geneList,
organism = 'hsa',
minGSSize = 120,
pvalueCutoff = 0.05,
verbose = FALSE)
upsetplot(kk2)
```
## ridgeline plot for expression distribution of GSEA result
The `ridgeplot` will visualize expression distributions of core enriched genes
for GSEA enriched categories. It helps users to interpret up/down-regulated pathways.
```{r}
#| label: fig-ridgeplot
#| fig-width: 12
#| fig-height: 12
#| fig-scap: |
#| Ridgeplot for gene set enrichment analysis.
#| fig-cap: |
#| **Ridgeplot for gene set enrichment analysis.**
ridgeplot(edo2)
```
## running score and preranked list of GSEA result
Running score and preranked list are traditional methods for visualizing GSEA
result. The `r Biocpkg("enrichplot")` package supports both of them to visualize
the distribution of the gene set and the enrichment score.
```{r}
#| label: fig-gseaplot
#| fig-width: 12
#| fig-height: 16
#| fig-scap: |
#| gseaplot for GSEA result(`by = "runningScore"`).
#| fig-cap: |
#| **gseaplot for GSEA result(`by = "runningScore"`).**
#| `by = "runningScore"` (A), `by = "preranked"` (B), default (C)
p1 <- gseaplot(edo2, geneSetID = 1, by = "runningScore", title = edo2$Description[1])
p2 <- gseaplot(edo2, geneSetID = 1, by = "preranked", title = edo2$Description[1])
p3 <- gseaplot(edo2, geneSetID = 1, title = edo2$Description[1])
plot_list(p1, p2, p3, ncol=1, tag_levels='A')
```
The `gseaplot` function also allows users to customize the colors of the running score line and the rank line.
```{r}
#| label: fig-gseaplot-custom-color
#| fig-width: 12
#| fig-height: 8
#| fig-cap: |
#| **gseaplot with custom colors.**
#| (A) Default color scheme.
#| (B) Customized colors using `color`, `color.line`, and `color.vline`.
p_default <- gseaplot(edo2, geneSetID = 1, title = edo2$Description[1])
p_custom <- gseaplot(edo2, geneSetID = 1, color="#DAB546", color.line='firebrick',
color.vline="steelblue", title = edo2$Description[1])
plot_list(p_default, p_custom, ncol=2, tag_levels='A')
```
### Unified color setting with `set_enrichplot_color()`
For consistent color settings across different plot types in `enrichplot`, the `set_enrichplot_color()` function provides a unified approach to customize color mappings. This avoids repetitive color code specifications in different plotting functions.
#### Function introduction
The `set_enrichplot_color()` function can be added to any `enrichplot` visualization using the `+` operator (ggplot2 style). It offers flexible control over color transformations and palettes.
#### Parameters
- `type`: Type of color aesthetic to modify (default: `"fill"`, can also be `"color"` for line colors)
- `transform`: Transformation to apply to the values used for color mapping
- `"log10"`: Apply log10 transformation (default for recent versions)
- `"identity"`: Use original values without transformation
- `colors`: Custom color palette as a vector of colors (e.g., `c("red", "blue")`)
#### Usage examples
```{r}
#| label: fig-set-enrichplot-color
#| fig-width: 12
#| fig-height: 8
#| fig-cap: |
#| **Color customization using `set_enrichplot_color()`.**
#| (A) Default log10 transformation.
#| (B) Custom red-blue palette with log10 transformation.
#| (C) Identity transformation (no transformation).
# Example data
library(DOSE)
data(geneList)
de <- names(geneList)[abs(geneList) > 2]
edo <- enrichDGN(de)
# A: Default log10 transformation
p1 <- dotplot(edo, showCategory=15) +
set_enrichplot_color(type='fill', transform='log10') +
ggtitle("Default log10 transform")
# B: Custom colors with log10 transformation
p2 <- dotplot(edo, showCategory=15) +
set_enrichplot_color(type='fill', transform='log10', colors=c("red", "blue")) +
ggtitle("Red-blue palette")
# C: Identity transformation (no transformation)
p3 <- dotplot(edo, showCategory=15) +
set_enrichplot_color(type='fill', transform='identity') +
ggtitle("Identity transform")
library(aplot)
plot_list(p1, p2, p3, ncol=3, tag_levels='A')
```
**Note**: Starting from enrichplot v1.29.2, log10 transformation of p-values is applied by default in functions like `dotplot()`. The `set_enrichplot_color()` function allows users to override this default behavior or customize color palettes consistently across all visualization types.
Another method to plot GSEA result is the `gseaplot2` function:
```{r}
#| label: fig-gseaplot2
#| fig-width: 12
#| fig-height: 8
#| fig-scap: |
#| Gseaplot2 for GSEA result.
#| fig-cap: |
#| **Gseaplot2 for GSEA result.**
gseaplot2(edo2, geneSetID = 1, title = edo2$Description[1])
```
The `gseaplot2` also supports multile gene sets to be displayed on the same figure:
```{r}
#| label: fig-gseaplot22
#| fig-width: 12
#| fig-height: 8
#| fig-scap: |
#| Gseaplot2 for GSEA result of multile gene sets.
#| fig-cap: |
#| **Gseaplot2 for GSEA result of multile gene sets.**
gseaplot2(edo2, geneSetID = 1:3)
```
User can also displaying the pvalue table on the plot via `pvalue_table`
parameter:
```{r}
#| label: fig-gseaplot23
#| fig-width: 12
#| fig-height: 8
#| fig-scap: |
#| Gseaplot2 for GSEA result of multile gene sets(add pvalue_table).
#| fig-cap: |
#| **Gseaplot2 for GSEA result of multile gene sets(add pvalue_table).**
gseaplot2(edo2, geneSetID = 1:3, pvalue_table = TRUE,
color = c("#E495A5", "#86B875", "#7DB0DD"), ES_geom = "dot")
```
The `pvalue_table` can be customized using the following parameters:
- `pvalue_table_rownames` to specify row names of the table (if NULL, no row names will be displayed)
- `pvalue_table_columns` to specify column names of the table
```{r}
#| label: fig-gseaplot-pvalue
#| fig-width: 12
#| fig-height: 8
#| fig-scap: |
#| Gseaplot2 with customized pvalue table.
#| fig-cap: |
#| **Gseaplot2 with customized pvalue table.**
#|
gseaplot2(edo2, geneSetID = 1, pvalue_table = TRUE,
pvalue_table_rownames = NULL,
pvalue_table_columns = c("ID", "NES", "p.adjust"))
```
User can specify `subplots` to only display a subset of plots:
```{r}
#| label: fig-gseaplot24
#| fig-width: 12
#| fig-height: 8
#| fig-scap: |
#| Gseaplot2 for GSEA result of multile gene sets(add subplots).
#| fig-cap: |
#| **Gseaplot2 for GSEA result of multile gene sets(add subplots).**
#| `subplots = 1` (A),`subplots = 1:2` (B)
#|
p1 <- gseaplot2(edo2, geneSetID = 1:3, subplots = 1)
p2 <- gseaplot2(edo2, geneSetID = 1:3, subplots = 1:2)
plot_list(p1, p2, ncol=1, tag_levels = 'A')
```
### Labeling genes in GSEA plot
Users can use `geom_gsea_gene()` to label specific genes in the GSEA plot.
```{r}
#| label: fig-geom-gsea-gene
#| fig-width: 12
#| fig-height: 8
#| fig-cap: |
#| **Labeling genes in GSEA plot.**
library(ggplot2)
library(ggrepel)
# Get gene set ID
id <- edo2$ID[1]
# Randomly select genes to label
set.seed(123)
genes <- sample(edo2[[id]], 5)
# Label genes on gseaplot2
p <- gseaplot2(edo2, geneSetID = 1, title = edo2$Description[1])
# Add geom_gsea_gene layer to the first subplot (running score)
p[[1]] <- p[[1]] + geom_gsea_gene(genes, geom=geom_label)
p
```
If users prefer to label with gene symbols, they can convert the gene IDs to symbols (e.g., using `setReadable`) before plotting.
```{r}
#| label: fig-geom-gsea-gene-symbol
#| fig-width: 12
#| fig-height: 8
#| fig-cap: |
#| **Labeling gene symbols in GSEA plot.**
#| eval: false
library(clusterProfiler)
# Assuming org.Hs.eg.db is available
if (require("org.Hs.eg.db")) {
edo2_symbol <- setReadable(edo2, 'org.Hs.eg.db', 'ENTREZID')
id <- edo2_symbol$ID[1]
genes_symbol <- sample(edo2_symbol[[id]], 5)
p_symbol <- gseaplot2(edo2_symbol, geneSetID = 1, title = edo2_symbol$Description[1])
p_symbol[[1]] <- p_symbol[[1]] + geom_gsea_gene(genes_symbol, geom=geom_text_repel)
p_symbol
}
```
The `gsearank` function plot the ranked list of genes belong to the specific
gene set.
```{r}
#| label: fig-gsearank
#| fig-width: 12
#| fig-height: 8
#| fig-scap: |
#| Ranked list of genes belong to the specific gene set.
#| fig-cap: |
#| **Ranked list of genes belong to the specific gene set.**
gsearank(edo2, 1, title = edo2[1, "Description"])
```
Multiple gene sets can be aligned using `cowplot`:
```{r}
#| label: fig-gsearank2
#| fig-width: 8
#| fig-height: 12
#| fig-scap: |
#| Gsearank for multiple gene sets.
#| fig-cap: |
#| **Gsearank for multiple gene sets.**
#|
library(ggplot2)
pp <- lapply(1:3, function(i) {
anno <- edo2[i, c("NES", "pvalue", "p.adjust")]
lab <- paste0(names(anno), "=", round(anno, 3), collapse="\n")
gsearank(edo2, i, edo2[i, 2]) + xlab(NULL) +ylab(NULL) +
annotate("text", 10000, edo2[i, "enrichmentScore"] * .75, label = lab, hjust=0, vjust=0)
})
plot_list(gglist=pp, ncol=1)
```
### Extracting data from gsearank plot
The `gsearank` function can also output the data used for plotting by setting `output = "table"`. This allows users to inspect the running score and other metrics for genes in the gene set.
```{r}
#| label: tbl-gsearank
#| tbl-cap: |
#| **Data extracted from gsearank plot.**
gsearank(edo2, 1, output = "table") |> head()
```
Users can also merge this table with gene information (e.g. Symbol) using `bitr` or other methods.
```{r}
#| label: tbl-gsearank-symbol
#| tbl-cap: |
#| **Data extracted from gsearank plot with gene symbols.**
#| eval: false
rank_table <- gsearank(edo2, 1, output = "table")
# Assuming 'gene' column contains Entrez IDs
if (require("org.Hs.eg.db") && require("clusterProfiler")) {
gene_info <- bitr(rank_table$gene, fromType="ENTREZID", toType="SYMBOL", OrgDb="org.Hs.eg.db")