|
| 1 | +# GO semantic similarity analysis {#GOSemSim} |
| 2 | + |
| 3 | +`r Biocpkg("GOSemSim")` implemented all methods described in [Chapter 1](#semantic-similarity-overview), including four IC-based methods and one graph-based method. |
| 4 | + |
| 5 | +## Semantic data {#semantic-data} |
| 6 | + |
| 7 | +To measure semantic similarity, we need to prepare GO annotations including GO structure (i.e. GO term relationships) and gene to GO mapping. For IC-based methods, information of GO term is species specific. We need to calculate `IC` for all GO terms of a species before we measure semantic similarity. |
| 8 | + |
| 9 | + |
| 10 | +`r Biocpkg("GOSemSim")` provides the `godata()` function to prepare semantic data to support measuring GO and gene simiarlity. It internally used the `r Biocpkg("GO.db")` package to obtain GO strucuture and `OrgDb` for gene to GO mapping. |
| 11 | + |
| 12 | + |
| 13 | +```{r godata} |
| 14 | +library(GOSemSim) |
| 15 | +hsGO <- godata('org.Hs.eg.db', ont="MF") |
| 16 | +``` |
| 17 | + |
| 18 | +User can set `computeIC=FALSE` if they only want to use Wang's method. |
| 19 | + |
| 20 | + |
| 21 | +## Supported organisms {#gosemsim-supported-organisms} |
| 22 | + |
| 23 | +`r Biocpkg("GOSemSim")` supports all organisms that have an `OrgDb` object available. |
| 24 | + |
| 25 | +Bioconductor have already provided `OrgDb` for [about 20 species](http://bioconductor.org/packages/release/BiocViews.html#___OrgDb). |
| 26 | + |
| 27 | +We can query `OrgDb` online via the `r Biocpkg("AnnotationHub")` package. For example: |
| 28 | + |
| 29 | +```{r eval=FALSE} |
| 30 | +library(AnnotationHub) |
| 31 | +hub <- AnnotationHub() |
| 32 | +q <- query(hub, "Cricetulus") |
| 33 | +id <- q$ah_id[length(q)] |
| 34 | +Cgriseus <- hub[[id]] |
| 35 | +``` |
| 36 | + |
| 37 | +If an organism is not supported by `r Biocpkg("AnnotationHub")`, user can use the `r Biocpkg("AnnotationForge")` package to build `OrgDb` manually. |
| 38 | + |
| 39 | +Once we have `OrgDb`, we can build annotation data needed by `r Biocpkg("GOSemSim")` via `godata()` function described previously. |
| 40 | + |
| 41 | + |
| 42 | +## GO semantic similarity measurement {#go-semantic-simiarlity} |
| 43 | + |
| 44 | +The `goSim()` function calculates semantic similarity between two GO terms, while the `mgoSim()` function calculates semantic similarity between two sets of GO terms. |
| 45 | + |
| 46 | + |
| 47 | +```{r gosemsim-gosim} |
| 48 | +goSim("GO:0004022", "GO:0005515", semData=hsGO, measure="Jiang") |
| 49 | +goSim("GO:0004022", "GO:0005515", semData=hsGO, measure="Wang") |
| 50 | +go1 = c("GO:0004022","GO:0004024","GO:0004174") |
| 51 | +go2 = c("GO:0009055","GO:0005515") |
| 52 | +mgoSim(go1, go2, semData=hsGO, measure="Wang", combine=NULL) |
| 53 | +mgoSim(go1, go2, semData=hsGO, measure="Wang", combine="BMA") |
| 54 | +``` |
| 55 | + |
| 56 | +## Gene semantic similarity measurement {#gene-go-semantic-similarity} |
| 57 | + |
| 58 | +On the basis of semantic similarity between GO terms, [GOSemSim](https://www.bioconductor.org/packages/GOSemSim) can |
| 59 | +also compute semantic similarity among sets of GO terms, gene products, and gene clusters. |
| 60 | + |
| 61 | +Suppose we have gene $g_1$ annotated by GO terms sets $GO_{1}=\{go_{11},go_{12} \cdots go_{1m}\}$ |
| 62 | +and $g_2$ annotated by $GO_{2}=\{go_{21},go_{22} \cdots go_{2n}\}$, `r Biocpkg("GOSemSim")` implemented four combine methods, including __*max*__, __*avg*__, __*rcmax*__, and __*BMA*__, to aggregate semantic similarity scores of multiple GO terms (see also [session 1.3](#combine-methods)). The similarities |
| 63 | +among gene products and gene clusters which annotated by multiple GO |
| 64 | +terms are also calculated by the these combine methods. |
| 65 | + |
| 66 | + |
| 67 | +`r Biocpkg("GOSemSim")` provides `geneSim()` to calculate semantic similarity between two gene products, and `mgeneSim()` to calculate semantic similarity among multiple gene products. |
| 68 | + |
| 69 | +```{r gosemsim-genesim} |
| 70 | +geneSim("241", "251", semData=hsGO, measure="Wang", combine="BMA") |
| 71 | +mgeneSim(genes=c("835", "5261","241", "994"), |
| 72 | + semData=hsGO, measure="Wang",verbose=FALSE) |
| 73 | +mgeneSim(genes=c("835", "5261","241", "994"), |
| 74 | + semData=hsGO, measure="Rel",verbose=FALSE) |
| 75 | +``` |
| 76 | + |
| 77 | +By default, `godata` function use `ENTREZID` as keytype, and the input ID type is `ENTREZID`. User can use other ID types such as `ENSEMBL`, `UNIPROT`, `REFSEQ`, `ACCNUM`, `SYMBOL` _et al_. |
| 78 | + |
| 79 | +Here as an example, we use `SYMBOL` as `keytype` and calculate semantic similarities among several genes by using their gene symbol as input. |
| 80 | + |
| 81 | +```{r gosemsim-mgeneSim} |
| 82 | +hsGO2 <- godata('org.Hs.eg.db', keytype = "SYMBOL", ont="MF", computeIC=FALSE) |
| 83 | +genes <- c("CDC45", "MCM10", "CDC20", "NMU", "MMP1") |
| 84 | +mgeneSim(genes, semData=hsGO2, measure="Wang", combine="BMA", verbose=FALSE) |
| 85 | +``` |
| 86 | + |
| 87 | +Users can also use [`clusterProfiler::bitr`](#bitr) to translate biological IDs. |
| 88 | + |
| 89 | +## Gene cluster semantic similarity measurement {#gene-cluster-go-semantic-similarity} |
| 90 | + |
| 91 | + |
| 92 | +`r Biocpkg("GOSemSim")` also supports calculating semantic similarity between two gene clusters using `clusterSim()` function and measuring semantic similarity among multiple gene clusters using `mclusterSim()` function. |
| 93 | + |
| 94 | +```{r gosemsim-clusterSim} |
| 95 | +gs1 <- c("835", "5261","241", "994", "514", "533") |
| 96 | +gs2 <- c("578","582", "400", "409", "411") |
| 97 | +clusterSim(gs1, gs2, semData=hsGO, measure="Wang", combine="BMA") |
| 98 | +
|
| 99 | +library(org.Hs.eg.db) |
| 100 | +x <- org.Hs.egGO |
| 101 | +hsEG <- mappedkeys(x) |
| 102 | +set.seed <- 123 |
| 103 | +clusters <- list(a=sample(hsEG, 20), b=sample(hsEG, 20), c=sample(hsEG, 20)) |
| 104 | +mclusterSim(clusters, semData=hsGO, measure="Wang", combine="BMA") |
| 105 | +``` |
| 106 | + |
| 107 | + |
| 108 | + |
| 109 | + |
| 110 | +<!-- |
| 111 | +
|
| 112 | +
|
| 113 | +## Applications |
| 114 | +
|
| 115 | +[GOSemSim](https://www.bioconductor.org/packages/GOSemSim) was cited by more than [200 papers](https://scholar.google.com.hk/scholar?oi=bibs&hl=en&cites=9484177541993722322,17633835198940746971,18126401808149291947) and had been applied to many research domains, including: |
| 116 | +
|
| 117 | ++ [Disease or Drug analysis](https://guangchuangyu.github.io/software/GOSemSim/featuredArticles/#diease-or-drug-analysis) |
| 118 | ++ [Gene/Protein functional analysis](https://guangchuangyu.github.io/software/GOSemSim/featuredArticles/#geneprotein-functional-analysis) |
| 119 | ++ [Protein-Protein interaction](https://guangchuangyu.github.io/software/GOSemSim/featuredArticles/#protein-protein-interaction) |
| 120 | ++ [miRNA-mRNA interaction](https://guangchuangyu.github.io/software/GOSemSim/featuredArticles/#mirna-mrna-interaction) |
| 121 | ++ [sRNA regulation](https://guangchuangyu.github.io/software/GOSemSim/featuredArticles/#srna-regulation) |
| 122 | ++ [Evolution](https://guangchuangyu.github.io/software/GOSemSim/featuredArticles/#evolution) |
| 123 | +
|
| 124 | +Find out more on <https://guangchuangyu.github.io/software/GOSemSim/featuredArticles/>. |
| 125 | +
|
| 126 | +
|
| 127 | +# GO enrichment analysis |
| 128 | +
|
| 129 | +GO enrichment analysis can be supported by our package [clusterProfiler](https://www.bioconductor.org/packages/clusterProfiler)[@yu2012], which supports hypergeometric test and Gene Set Enrichment Analysis (GSEA). Enrichment results across different gene clusters can be compared using __*compareCluster*__ function. |
| 130 | +
|
| 131 | +# Disease Ontology Semantic and Enrichment analysis |
| 132 | +
|
| 133 | +Disease Ontology (DO) annotates human genes in the context of disease. DO is an important annotation in translating molecular findings from high-throughput data to clinical relevance. |
| 134 | +[DOSE](https://www.bioconductor.org/packages/DOSE)[@yu_dose_2015] supports semantic similarity computation among DO terms and genes. |
| 135 | +Enrichment analysis including hypergeometric model and GSEA are also implemented to support discovering disease associations of high-throughput biological data. |
| 136 | +
|
| 137 | +# MeSH enrichment and semantic analyses |
| 138 | +
|
| 139 | +MeSH (Medical Subject Headings) is the NLM controlled vocabulary used to manually index articles for MEDLINE/PubMed. [meshes](https://www.bioconductor.org/packages/meshes) supports enrichment (hypergeometric test and GSEA) and semantic similarity analyses for more than 70 species. |
| 140 | +
|
| 141 | +
|
| 142 | +--> |
0 commit comments