diff --git a/episodes/03-import-annotate.Rmd b/episodes/03-import-annotate.Rmd index 2ae1993a..e0a72772 100644 --- a/episodes/03-import-annotate.Rmd +++ b/episodes/03-import-annotate.Rmd @@ -46,7 +46,7 @@ If you get any error messages about `there is no package called 'XXXX'` it means ## Load data -In the last episode, we used R to download 4 files from the internet and saved them on our computer. But we do not have these files loaded into R yet so that we can work with them. The original experimental design in [Blackmore et al. 2017](https://pubmed.ncbi.nlm.nih.gov/28696309/) was fairly complex: 8-12 weeks old male and female mice, with two different genetic backgrounds (wild-type and D2D), were collected at Day 0 (before influenza infection), Day 4 and Day 8 after influenza infection. From each mouse, cerebellum and spinal cord tissues were taken for RNA-seq. There were originally 4 mice per 'Sex X Day x Genotype' group, but a few were lost along the way resulting in a total of 45 samples. For this workshop, we are going to simplify the analysis by only using the 22 cerebellum samples. Expression quantification was done using STAR to align to the mouse genome and then counting reads that map to genes. In addition to the counts per gene per sample, we also need information on which sample belongs to which Sex/Time point/Replicate. And for the genes, it is helpful to have extra information called annotation. +In the last episode, we used R to download 4 files from the internet and saved them on our computer. But we do not have these files loaded into R yet so that we can work with them. The original experimental design in [Blackmore et al. 2017](https://pubmed.ncbi.nlm.nih.gov/28696309/) was fairly complex: 8 week old male and female C57BL/6 mice were collected at Day 0 (before influenza infection), Day 4 and Day 8 after influenza infection. From each mouse, cerebellum and spinal cord tissues were taken for RNA-seq. There were originally 4 mice per 'Sex x Time x Tissue' group, but a few were lost along the way resulting in a total of 45 samples. For this workshop, we are going to simplify the analysis by only using the 22 cerebellum samples. Expression quantification was done using STAR to align to the mouse genome and then counting reads that map to genes. In addition to the counts per gene per sample, we also need information on which sample belongs to which Sex/Time point/Replicate. And for the genes, it is helpful to have extra information called annotation. Let's read in the data files that we downloaded in the last episode and start to explore them: @@ -59,7 +59,7 @@ dim(counts) # View(counts) ``` -Genes are in rows and samples are in columns, so we have counts for 41,786 genes and 22 samples. The `View()` command has been comment out for the website, but running it will open a tab in RStudio that lets us look at the data and even sort the table by a particular column. However, the viewer cannot change the data inside the `counts` object, so we can only look, not permanently sort nor edit the entries. When finished, close the viewer using the X in the tab. Looks like the rownames are gene symbols and the column names are the GEO sample IDs, which are not very informative for telling us which sample is what. +Genes are in rows and samples are in columns, so we have counts for 41,786 genes and 22 samples. The `View()` command has been commented out for the website, but running it will open a tab in RStudio that lets us look at the data and even sort the table by a particular column. However, the viewer cannot change the data inside the `counts` object, so we can only look, not permanently sort nor edit the entries. When finished, close the viewer using the X in the tab. Looks like the rownames are gene symbols and the column names are the GEO sample IDs, which are not very informative for telling us which sample is what. ### Sample annotations @@ -349,7 +349,7 @@ manual curation. You can find more information in Bioconductor [Annotation Workshop](https://jmacdon.github.io/Bioc2022Anno/articles/AnnotationWorkshop.html) material. -Bioconductor has many packages and functions that can help you to get additional annotation information for your genes. The available resources are covered in more detail in [Episode 6 Gene set enrichment analysis](https://carpentries-incubator.github.io/bioc-rnaseq/06-gene-set-analysis.html#gene-set-resources). +Bioconductor has many packages and functions that can help you to get additional annotation information for your genes. The available resources are covered in more detail in [Episode 7 Gene set enrichment analysis](https://carpentries-incubator.github.io/bioc-rnaseq/07-gene-set-analysis.html#gene-set-resources). Here, we will introduce one of the gene ID mapping functions, `mapIds`: ``` @@ -357,6 +357,7 @@ mapIds(annopkg, keys, column, keytype, ..., multiVals) ``` Where + - *annopkg* is the annotation package - *keys* are the IDs that we **know** - *column* is the value we **want** diff --git a/episodes/04-exploratory-qc.Rmd b/episodes/04-exploratory-qc.Rmd index e5d3bdab..ef918abe 100644 --- a/episodes/04-exploratory-qc.Rmd +++ b/episodes/04-exploratory-qc.Rmd @@ -169,7 +169,7 @@ meanSdPlot(assay(dds), ranks = FALSE) ``` There are two ways around this: either we develop methods specifically adapted to count data, or we adapt (transform) the count data so that the existing methods are applicable. -Both ways have been explored; however, at the moment the second approach is arguably more widely applied in practice. We can transform our data using DESeq2's variance stablizing transformation and then verify that is had removed the correlation between average read count and variance. +Both ways have been explored; however, at the moment the second approach is arguably more widely applied in practice. We can transform our data using DESeq2's variance stabilizing transformation and then verify that it has removed the correlation between average read count and variance. ```{r mean-sd-plot-vst} vsd <- DESeq2::vst(dds, blind = TRUE) @@ -228,7 +228,7 @@ ggplot(pcaData, aes(x = PC1, y = PC2)) + 1. Assume you are mainly interested in expression changes associated with the time after infection (Reminder Day0 -> before infection). What do you need to consider in downstream analysis? -2. Consider an experimental design where you have multiple sample from the same donor. You are still interested in differences by time and observe the following PCA plot. What does this PCA plot suggest? +2. Consider an experimental design where you have multiple samples from the same donor. You are still interested in differences by time and observe the following PCA plot. What does this PCA plot suggest? ```{r pca-exercise, eval=TRUE, echo=FALSE, include=TRUE} vsd_ex <- vsd @@ -270,7 +270,7 @@ ggplot(pcaData, aes(x = PC1, y = PC2)) + ## Challenge: Plot the PCA colored by library sizes. -Compare before and after variance stablizing transformation. +Compare before and after variance stabilizing transformation. *Hint: The `DESeq2::plotPCA` expect an object of the class `DESeqTransform` as input. You can transform a `SummarizedExperiment` object using `plotPCA(DESeqTransform(se))`* @@ -311,7 +311,7 @@ ggplot(pcaData, aes(x = PC1, y = PC2)) + ## Interactive exploratory data analysis Often it is useful to look at QC plots in an interactive way to directly explore different experimental factors or get insides from someone without coding experience. -Useful tools for interactive exploratory data analysis for RNA-seq are [glimma](https://bioconductor.org/packages/release/bioc/html/Glimma.html) and [iSEE](https://bioconductor.org/packages/release/bioc/html/iSEE.html) +Useful tools for interactive exploratory data analysis for RNA-seq are [Glimma](https://bioconductor.org/packages/release/bioc/html/Glimma.html) and [iSEE](https://bioconductor.org/packages/release/bioc/html/iSEE.html) ::::::::::::::::::::::::::::::::::::::: challenge diff --git a/episodes/05-differential-expression.Rmd b/episodes/05-differential-expression.Rmd index 6996b084..2e6087e4 100644 --- a/episodes/05-differential-expression.Rmd +++ b/episodes/05-differential-expression.Rmd @@ -120,7 +120,7 @@ The within-group variance of the counts for a gene following a negative binomial $var = \mu + \theta \mu^2$ -$\theta$ represents the gene-specific __dispersion__, a measure of variability or spread in the data. As a second step, we need to estimate gene-wise dispersions to get the expected within-group variance and test for group differences. Good dispersion estimates are challenging with a few sample per group only. Thus, information from genes with similar expression pattern are "borrowed". Gene-wise dispersion estimates are *shrinked* towards center values of the observed distribution of dispersions. With `DESeq2` we can get dispersion estimates using the `estimateDispersions()` function. +$\theta$ represents the gene-specific __dispersion__, a measure of variability or spread in the data. As a second step, we need to estimate gene-wise dispersions to get the expected within-group variance and test for group differences. Good dispersion estimates are challenging with a few samples per group only. Thus, information from genes with similar expression pattern are "borrowed". Gene-wise dispersion estimates are *shrinked* towards center values of the observed distribution of dispersions. With `DESeq2` we can get dispersion estimates using the `estimateDispersions()` function. We can visualize the effect of *shrinkage* using `plotDispEsts()`: ```{r estimate-dispersions} @@ -130,11 +130,11 @@ plotDispEsts(dds) ### Testing -We can use the `nbinomWaldTest()`function of `DESeq2` to fit a *generalized linear model (GLM)* and compute *log2 fold changes* (synonymous with "GLM coefficients", "beta coefficients" or "effect size") corresponding to the variables of the *design matrix*. The *design matrix* is directly related to the *design formular* and automatically derived from it. Assume a design formular with one variable (`~ treatment`) and two factor level (treatment and control). The mean expression $\mu_{j}$ of a specific gene in sample $j$ will be modeled as following: +We can use the `nbinomWaldTest()`function of `DESeq2` to fit a *generalized linear model (GLM)* and compute *log2 fold changes* (synonymous with "GLM coefficients", "beta coefficients" or "effect size") corresponding to the variables of the *design matrix*. The *design matrix* is directly related to the *design formula* and automatically derived from it. Assume a design formula with one variable (`~ treatment`) and two factor levels (treatment and control). The mean expression $\mu_{j}$ of a specific gene in sample $j$ will be modeled as following: $log(μ_j) = β_0 + x_j β_T$, - with $β_T$ corresponding to the log2 fold change of the treatment group, $x_j$ = 1, if $j$ belongs to the treatment group and $x_j$ = 0, if $j$ belongs to the control group. + with $β_T$ corresponding to the log2 fold change of the treatment groups, $x_j$ = 1, if $j$ belongs to the treatment group and $x_j$ = 0, if $j$ belongs to the control group. Finally, the estimated log2 fold changes are scaled by their standard error and tested for being significantly different from 0 using the *Wald test*. @@ -165,18 +165,18 @@ dds <- nbinomWaldTest(dds) ## Explore results for specific contrasts -The `results()` function can be used to extract gene-wise test statistics, such as log2 fold changes and (adjusted) p-values. The comparison of interest can be defined using contrasts, which are linear combinations of the model coefficients (equivalent to combinations of columns within the *design matrix*) and thus directly related to the design formular. A detailed explanation of design matrices and how to use them to specify different contrasts of interest can be found in the section on the [exploration of design matrices](../episodes/06-extra-design.Rmd). In the `results()` function a contrast can be represented by the variable of interest (reference variable) and the related level to compare using the `contrast` argument. By default the reference variable will be the __last variable__ of the design formular, the *reference level* will be the first factor level and the *last level* will be used for comparison. You can also explicitly specify a contrast by the `name` argument of the `results()` function. Names of all available contrasts can be accessed using `resultsNames()`. +The `results()` function can be used to extract gene-wise test statistics, such as log2 fold changes and (adjusted) p-values. The comparison of interest can be defined using contrasts, which are linear combinations of the model coefficients (equivalent to combinations of columns within the *design matrix*) and thus directly related to the design formula. A detailed explanation of design matrices and how to use them to specify different contrasts of interest can be found in the section on the [exploration of design matrices](../episodes/06-extra-design.Rmd). In the `results()` function a contrast can be represented by the variable of interest (reference variable) and the related level to compare using the `contrast` argument. By default the reference variable will be the __last variable__ of the design formula, the *reference level* will be the first factor level and the *last level* will be used for comparison. You can also explicitly specify a contrast by the `name` argument of the `results()` function. Names of all available contrasts can be accessed using `resultsNames()`. ::::::::::::::::::::::::::::::::::::: challenge What will be the default __contrast__, __reference level__ and __"last level"__ for comparisons when running `results(dds)` for the example used in this lesson? -*Hint: Check the design formular used to build the object.* +*Hint: Check the design formula used to build the object.* :::::::::::::::::::::::: solution -In the lesson example the last variable of the design formular is `time`. +In the lesson example the last variable of the design formula is `time`. The __reference level__ (first in alphabetical order) is `Day0` and the __last level__ is `Day8` ```{r check-time-levels} @@ -235,7 +235,7 @@ head(resSex[order(resSex$pvalue), ]) ### Multiple testing correction -Due to the high number of tests (one per gene) our DE results will contain a substantial number of __false positives__. For example, if we tested 20,000 genes at a threshold of $\alpha = 0.05$ we would expect 1000 significant DE genes with no differential expression. +Due to the high number of tests (one per gene) our DE results will contain a substantial number of __false positives__. For example, if we tested 20,000 genes at a threshold of $\alpha = 0.05$ we would expect 1,000 significant DE genes with no differential expression. To account for this expected high number of false positives, we can correct our results for __multiple testing__. By default `DESeq2` uses the [Benjamini-Hochberg procedure](https://link.springer.com/referenceworkentry/10.1007/978-1-4419-9863-7_1215) to calculate __adjusted p-values__ (padj) for DE results. diff --git a/episodes/07-gene-set-analysis.Rmd b/episodes/07-gene-set-analysis.Rmd index e587c280..3d6a3853 100644 --- a/episodes/07-gene-set-analysis.Rmd +++ b/episodes/07-gene-set-analysis.Rmd @@ -115,7 +115,7 @@ between genders. The following code performs **DESeq2** analysis which you should have already learnt in the previous episode. In the end, we have a list of DE genes filtered by FDR < 0.05, and save it in the object `sexDEgenes`. -The file `data/GSE96870_se.rds` contains a `RangedSummarizedExperiment` contains RNA-Seq counts that were downloaded in [Episode 2](../episodes/02-setup.Rmd) and constructed in [Episode 3](../episodes/03-import-annotate.Rmd) (minimal codes for downloading and constructing in the script +The file `data/GSE96870_se.rds` contains a `RangedSummarizedExperiment` that contains RNA-Seq counts that were downloaded in [Episode 2](../episodes/02-setup.Rmd) and constructed in [Episode 3](../episodes/03-import-annotate.Rmd) (minimal codes for downloading and constructing in the script [`download_data.R`](https://github.com/carpentries-incubator/bioc-rnaseq/blob/main/episodes/download_data.R). In the following code, there are also comments that explain every step of the analysis. @@ -183,7 +183,7 @@ consistent in the two. Later in this episode, we will learn how to perform gene ID conversion in the ORA analysis. -Since the DE genes and the gene set can be mathematically thought as two sets, +Since the DE genes and the gene set can be mathematically thought of as two sets, a natural way is to first visualize them with a Venn diagram. ```{r venn-diagram, fig.width = 5*2, fig.height = 5*2, dpi = 72*2} @@ -208,7 +208,7 @@ go to the next section. ### Fisher's exact test To statistically measure the enrichment, the relationship of DE genes and the -gene set is normally formatted into the following 2x2 contigency table, where +gene set is normally formatted into the following 2x2 contingency table, where in the table are the numbers of genes in different categories. $n_{+1}$ is the size of the _XY gene set_ (i.e. the number of member genes), $n_{1+}$ is the number of DE genes, $n$ is the number of total genes. @@ -248,7 +248,7 @@ matrix(c(n_11, n_12, n_10, n_21, n_22, n_20, n_01, n_02, n), nrow = 3, byrow = TRUE) ``` -And we fill these numbers into the 2x2 contigency table: +And we fill these numbers into the 2x2 contingency table: