Skip to content

Commit f87d7ab

Browse files
committed
source
0 parents  commit f87d7ab

34 files changed

+3371
-0
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
_bookdown_files
2+
#COVID19_GeneSets.gmt

01_overview_semantic_similarity.Rmd

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# (PART\*) Part I: Semantic Similarity Analysis {-}
2+
3+
4+
# Overview of semantic similarity analysis {#semantic-similarity-overview}
5+
6+
7+
Functional similarity of gene products can be estimated by controlled
8+
biological vocabularies, such as Gene Ontology (GO), Disease Ontology (DO) and Medical Subject Headings (MeSH).
9+
10+
Four methods including Resnik [@philip_semantic_1999], Jiang [@jiang_semantic_1997], Lin [@lin_information-theoretic_1998] and Schlicker [@schlicker_new_2006] have been presented to determine the semantic similarity of two GO terms based on the annotation statistics of their common ancestor terms. Wang [@wang_new_2007]
11+
proposed a method to measure the similarity based on the graph structure of GO. Each of these methods has its own advantages and
12+
weaknesses and can be applied to other ontologies that have similar structure (i.e. directed acyclic graph).
13+
14+
15+
16+
## Information content-based methods
17+
18+
Four methods proposed by Resnik [@philip_semantic_1999],
19+
Jiang [@jiang_semantic_1997], Lin [@lin_information-theoretic_1998]
20+
and Schlicker [@schlicker_new_2006] are information content (IC) based, which depend on the frequencies of two GO terms involved and that of their closest common ancestor term in a specific corpus of GO
21+
annotations. The information content of a GO term is computed by the
22+
negative log probability of the term occurring in GO corpus. A rarely used term contains a greater amount of information.
23+
24+
The frequency of a term t is defined as:
25+
26+
27+
$$p(t) = \frac{n_{t'}}{N} | t' \in \left\{t, \; children\: of\: t \right\}$$
28+
29+
where $n_{t'}$ is the number of term $t'$, and $N$ is the total number of terms in GO corpus.
30+
31+
Thus the information content is defined as:
32+
33+
$$IC(t) = -\log(p(t))$$
34+
35+
As GO allow multiple parents for each concept, two terms can share
36+
parents by multiple paths. IC-based methods calculate similarity of two GO terms based on the information content of their closest common ancestor term, which was also called most informative common ancestor (MICA).
37+
38+
### Resnik method
39+
40+
The Resnik method is defined as:
41+
42+
43+
$$sim_{Resnik}(t_1,t_2) = IC(MICA)$$
44+
45+
### Lin method
46+
47+
The Lin method is defined as:
48+
49+
$$sim_{Lin}(t_1,t_2) = \frac{2IC(MICA)}{IC(t_1)+IC(t_2)}$$
50+
51+
### Rel method
52+
53+
The Relevance method, which was proposed by Schlicker, combine Resnik's and Lin's method and is defined as:
54+
55+
$$sim_{Rel}(t_1,t_2) = \frac{2IC(MICA)(1-p(MICA))}{IC(t_1)+IC(t_2)}$$
56+
57+
### Jiang method
58+
59+
The Jiang and Conrath's method is defined as:
60+
61+
$$sim_{Jiang}(t_1,t_2) = 1-\min(1, IC(t_1) + IC(t_2) - 2IC(MICA))$$
62+
63+
64+
65+
## Graph-based method
66+
67+
Graph-based methods using the topology of GO graph structure to
68+
compute semantic similarity. Formally, a GO term A can be represented
69+
as $DAG_{A}=(A,T_{A},E_{A})$ where $T_{A}$ is the set of GO terms in
70+
$DAG_{A}$, including term A and all of its ancestor terms in the GO
71+
graph, and $E_{A}$ is the set of edges connecting the GO terms in
72+
$DAG_{A}$.
73+
74+
### Wang method
75+
76+
To encode the semantic of a GO term in a measurable format to enable a quantitative comparison, Wang[@wang_new_2007] firstly defined the semantic value of term A as the aggregate contribution of all terms in $DAG_{A}$ to the semantics of term A, terms closer to term A in $DAG_{A}$ contribute more to its semantics. Thus, defined the contribution of a GO term $t$ to the semantic of GO term $A$ as the S-value of GO term $t$ related to term $A$.
77+
78+
For any of term $t$ in $DAG_{A}$, its S-value related to term $A$, $S_{A}(\textit{t})$ is defined as:
79+
80+
81+
$$\left\{\begin{array}{l} S_{A}(A)=1 \\ S_{A}(\textit{t})=\max\{w_{e} \times S_{A}(\textit{t}') | \textit{t}' \in children \: of(\textit{t}) \} \; if \: \textit{t} \ne A \end{array} \right.$$
82+
83+
where $w_{e}$ is the semantic contribution factor for edge $e \in E_{A}$ linking term $t$ with its child term $t'$.
84+
Term $A$ contributes to its own is defined as 1. After obtaining the S-values for all terms in $DAG_{A}$,
85+
the semantic value of DO term A, $SV(A)$, is calculated as:
86+
87+
$$SV(A)=\displaystyle\sum_{t \in T_{A}} S_{A}(t)$$
88+
89+
Thus given two GO terms A and B, the semantic similarity between these two terms is defined as:
90+
91+
$$sim_{Wang}(A, B) = \frac{\displaystyle\sum_{t \in T_{A} \cap T_{B}}{S_{A}(t) + S_{B}(t)}}{SV(A) + SV(B)}$$
92+
93+
where $S_{A}(\textit{t})$ is the S-value of GO term $t$ related to term $A$
94+
and $S_{B}(\textit{t})$ is the S-value of GO term $t$ related to term $B$.
95+
96+
This method proposed by Wang [@wang_new_2007] determines the semantic
97+
similarity of two GO terms based on both the locations of these terms
98+
in the GO graph and their relations with their ancestor terms.
99+
100+
101+
102+
## Combine methods
103+
104+
Since a gene product can be annotated by multiple GO terms, semantic similarity among gene products needs to be aggregated from different semantic similarity scores of multiple GO terms associated with genes, including `max`, `avg`, `rcmax` and `BMA`.
105+
106+
### max
107+
108+
The `max` method calculates the maximum semantic similarity score over all pairs of GO terms between these two GO term sets.
109+
110+
111+
$$sim_{max}(g_1, g_2) = \displaystyle\max_{1 \le i \le m, 1 \le j \le n} sim(go_{1i}, go_{2j})$$
112+
113+
### avg
114+
115+
The `avg` calculates the average semantic similarity score over all pairs of GO terms.
116+
117+
118+
$$sim_{avg}(g_1, g_2) = \frac{\displaystyle\sum_{i=1}^m\sum_{j=1}^nsim(go_{1i}, go_{2j})}{m \times n}$$
119+
120+
### rcmax
121+
122+
Similarities among two sets of GO terms form a matrix, the `rcmax` method uses the maximum of `RowScore` and `ColumnScore`, where `RowScore` (or `ColumnScore`) is the average of maximum similarity on each row (or column).
123+
124+
125+
$$sim_{rcmax}(g_1, g_2) = \max(\frac{\displaystyle\sum_{i=1}^m \max_{1 \le j \le n} sim(go_{1i}, go_{2j})}{m},\frac{\displaystyle\sum_{j=1}^n \max_{1 \le i \le m} sim(go_{1i},go_{2j})}{n})$$
126+
127+
### BMA
128+
129+
The `BMA` method, used the **B**est-**M**atch **A**verage strategy, calculates the average of all maximum similarities on each row and column, and is defined as:
130+
131+
132+
$$sim_{BMA}(g_1, g_2) = \frac{\displaystyle\sum_{1=i}^m \max_{1 \le j \le n}sim(go_{1i}, go_{2j}) + \displaystyle\sum_{1=j}^n \max_{1 \le i \le m}sim(go_{1i}, go_{2j})} {m+n}$$
133+
134+
135+
## Summary
136+
137+
The idea behind semantic similarity measurement is the notion that genes with similar function should have similar annotation vocabulary and have a close relationship in the ontology strucutre. Measuring similarity is critical for expanding knownledge, since similar objects tend to behave similarly, which supports many bioinformatics applications to infer gene/protein functions, miRNA function, genetic interaction, protein-protein interaction, miRNA-mRNA interaction and celluar localization.
138+
139+
140+
We developed several Bioconductor packages, including `r Biocpkg("GOSemSim")` [@yu2010; @yu_gosemsim_2020] for computing semantic similarity among GO terms, sets of GO terms, gene products and gene clusters (see also [Chapter 2](#GOSemSim)), `r Biocpkg("DOSE")` [@yu_dose_2015] for Disease Ontology (DO) (see also [Chapter 3](#DOSE-semantic-similarity)) and `r Biocpkg("meshes")` [@yu_meshes_2018] that based on Medical Subject Headings (MeSH) (see also [Chapter 4](#meshes-semantic-similarity)).
141+
142+

02_GOSimSim.Rmd

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# GO semantic similarity analysis {#GOSemSim}
2+
3+
`r Biocpkg("GOSemSim")` implemented all methods described in [Chapter 1](#semantic-similarity-overview), including four IC-based methods and one graph-based method.
4+
5+
## Semantic data {#semantic-data}
6+
7+
To measure semantic similarity, we need to prepare GO annotations including GO structure (i.e. GO term relationships) and gene to GO mapping. For IC-based methods, information of GO term is species specific. We need to calculate `IC` for all GO terms of a species before we measure semantic similarity.
8+
9+
10+
`r Biocpkg("GOSemSim")` provides the `godata()` function to prepare semantic data to support measuring GO and gene simiarlity. It internally used the `r Biocpkg("GO.db")` package to obtain GO strucuture and `OrgDb` for gene to GO mapping.
11+
12+
13+
```{r godata}
14+
library(GOSemSim)
15+
hsGO <- godata('org.Hs.eg.db', ont="MF")
16+
```
17+
18+
User can set `computeIC=FALSE` if they only want to use Wang's method.
19+
20+
21+
## Supported organisms {#gosemsim-supported-organisms}
22+
23+
`r Biocpkg("GOSemSim")` supports all organisms that have an `OrgDb` object available.
24+
25+
Bioconductor have already provided `OrgDb` for [about 20 species](http://bioconductor.org/packages/release/BiocViews.html#___OrgDb).
26+
27+
We can query `OrgDb` online via the `r Biocpkg("AnnotationHub")` package. For example:
28+
29+
```{r eval=FALSE}
30+
library(AnnotationHub)
31+
hub <- AnnotationHub()
32+
q <- query(hub, "Cricetulus")
33+
id <- q$ah_id[length(q)]
34+
Cgriseus <- hub[[id]]
35+
```
36+
37+
If an organism is not supported by `r Biocpkg("AnnotationHub")`, user can use the `r Biocpkg("AnnotationForge")` package to build `OrgDb` manually.
38+
39+
Once we have `OrgDb`, we can build annotation data needed by `r Biocpkg("GOSemSim")` via `godata()` function described previously.
40+
41+
42+
## GO semantic similarity measurement {#go-semantic-simiarlity}
43+
44+
The `goSim()` function calculates semantic similarity between two GO terms, while the `mgoSim()` function calculates semantic similarity between two sets of GO terms.
45+
46+
47+
```{r gosemsim-gosim}
48+
goSim("GO:0004022", "GO:0005515", semData=hsGO, measure="Jiang")
49+
goSim("GO:0004022", "GO:0005515", semData=hsGO, measure="Wang")
50+
go1 = c("GO:0004022","GO:0004024","GO:0004174")
51+
go2 = c("GO:0009055","GO:0005515")
52+
mgoSim(go1, go2, semData=hsGO, measure="Wang", combine=NULL)
53+
mgoSim(go1, go2, semData=hsGO, measure="Wang", combine="BMA")
54+
```
55+
56+
## Gene semantic similarity measurement {#gene-go-semantic-similarity}
57+
58+
On the basis of semantic similarity between GO terms, [GOSemSim](https://www.bioconductor.org/packages/GOSemSim) can
59+
also compute semantic similarity among sets of GO terms, gene products, and gene clusters.
60+
61+
Suppose we have gene $g_1$ annotated by GO terms sets $GO_{1}=\{go_{11},go_{12} \cdots go_{1m}\}$
62+
and $g_2$ annotated by $GO_{2}=\{go_{21},go_{22} \cdots go_{2n}\}$, `r Biocpkg("GOSemSim")` implemented four combine methods, including __*max*__, __*avg*__, __*rcmax*__, and __*BMA*__, to aggregate semantic similarity scores of multiple GO terms (see also [session 1.3](#combine-methods)). The similarities
63+
among gene products and gene clusters which annotated by multiple GO
64+
terms are also calculated by the these combine methods.
65+
66+
67+
`r Biocpkg("GOSemSim")` provides `geneSim()` to calculate semantic similarity between two gene products, and `mgeneSim()` to calculate semantic similarity among multiple gene products.
68+
69+
```{r gosemsim-genesim}
70+
geneSim("241", "251", semData=hsGO, measure="Wang", combine="BMA")
71+
mgeneSim(genes=c("835", "5261","241", "994"),
72+
semData=hsGO, measure="Wang",verbose=FALSE)
73+
mgeneSim(genes=c("835", "5261","241", "994"),
74+
semData=hsGO, measure="Rel",verbose=FALSE)
75+
```
76+
77+
By default, `godata` function use `ENTREZID` as keytype, and the input ID type is `ENTREZID`. User can use other ID types such as `ENSEMBL`, `UNIPROT`, `REFSEQ`, `ACCNUM`, `SYMBOL` _et al_.
78+
79+
Here as an example, we use `SYMBOL` as `keytype` and calculate semantic similarities among several genes by using their gene symbol as input.
80+
81+
```{r gosemsim-mgeneSim}
82+
hsGO2 <- godata('org.Hs.eg.db', keytype = "SYMBOL", ont="MF", computeIC=FALSE)
83+
genes <- c("CDC45", "MCM10", "CDC20", "NMU", "MMP1")
84+
mgeneSim(genes, semData=hsGO2, measure="Wang", combine="BMA", verbose=FALSE)
85+
```
86+
87+
Users can also use [`clusterProfiler::bitr`](#bitr) to translate biological IDs.
88+
89+
## Gene cluster semantic similarity measurement {#gene-cluster-go-semantic-similarity}
90+
91+
92+
`r Biocpkg("GOSemSim")` also supports calculating semantic similarity between two gene clusters using `clusterSim()` function and measuring semantic similarity among multiple gene clusters using `mclusterSim()` function.
93+
94+
```{r gosemsim-clusterSim}
95+
gs1 <- c("835", "5261","241", "994", "514", "533")
96+
gs2 <- c("578","582", "400", "409", "411")
97+
clusterSim(gs1, gs2, semData=hsGO, measure="Wang", combine="BMA")
98+
99+
library(org.Hs.eg.db)
100+
x <- org.Hs.egGO
101+
hsEG <- mappedkeys(x)
102+
set.seed <- 123
103+
clusters <- list(a=sample(hsEG, 20), b=sample(hsEG, 20), c=sample(hsEG, 20))
104+
mclusterSim(clusters, semData=hsGO, measure="Wang", combine="BMA")
105+
```
106+
107+
108+
109+
110+
<!--
111+
112+
113+
## Applications
114+
115+
[GOSemSim](https://www.bioconductor.org/packages/GOSemSim) was cited by more than [200 papers](https://scholar.google.com.hk/scholar?oi=bibs&hl=en&cites=9484177541993722322,17633835198940746971,18126401808149291947) and had been applied to many research domains, including:
116+
117+
+ [Disease or Drug analysis](https://guangchuangyu.github.io/software/GOSemSim/featuredArticles/#diease-or-drug-analysis)
118+
+ [Gene/Protein functional analysis](https://guangchuangyu.github.io/software/GOSemSim/featuredArticles/#geneprotein-functional-analysis)
119+
+ [Protein-Protein interaction](https://guangchuangyu.github.io/software/GOSemSim/featuredArticles/#protein-protein-interaction)
120+
+ [miRNA-mRNA interaction](https://guangchuangyu.github.io/software/GOSemSim/featuredArticles/#mirna-mrna-interaction)
121+
+ [sRNA regulation](https://guangchuangyu.github.io/software/GOSemSim/featuredArticles/#srna-regulation)
122+
+ [Evolution](https://guangchuangyu.github.io/software/GOSemSim/featuredArticles/#evolution)
123+
124+
Find out more on <https://guangchuangyu.github.io/software/GOSemSim/featuredArticles/>.
125+
126+
127+
# GO enrichment analysis
128+
129+
GO enrichment analysis can be supported by our package [clusterProfiler](https://www.bioconductor.org/packages/clusterProfiler)[@yu2012], which supports hypergeometric test and Gene Set Enrichment Analysis (GSEA). Enrichment results across different gene clusters can be compared using __*compareCluster*__ function.
130+
131+
# Disease Ontology Semantic and Enrichment analysis
132+
133+
Disease Ontology (DO) annotates human genes in the context of disease. DO is an important annotation in translating molecular findings from high-throughput data to clinical relevance.
134+
[DOSE](https://www.bioconductor.org/packages/DOSE)[@yu_dose_2015] supports semantic similarity computation among DO terms and genes.
135+
Enrichment analysis including hypergeometric model and GSEA are also implemented to support discovering disease associations of high-throughput biological data.
136+
137+
# MeSH enrichment and semantic analyses
138+
139+
MeSH (Medical Subject Headings) is the NLM controlled vocabulary used to manually index articles for MEDLINE/PubMed. [meshes](https://www.bioconductor.org/packages/meshes) supports enrichment (hypergeometric test and GSEA) and semantic similarity analyses for more than 70 species.
140+
141+
142+
-->

0 commit comments

Comments
 (0)