From b91e3a3b1794c97e20ec37428079577572c788d6 Mon Sep 17 00:00:00 2001 From: candicechu Date: Sun, 13 Mar 2016 18:06:56 -0500 Subject: [PATCH] Updated CTEHR Training (markdown) --- CTEHR-Training.md | 74 ++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 73 insertions(+), 1 deletion(-) diff --git a/CTEHR-Training.md b/CTEHR-Training.md index 433cf09..8ebdaba 100644 --- a/CTEHR-Training.md +++ b/CTEHR-Training.md @@ -101,5 +101,77 @@ Run `map.sh`: $ cd ../sequencing-pipeline/ $ main-scripts/map.sh lists/candice_list 2> err.log | tee out.log +>If there is a problem running the post ribosomal reads, julia needs to be set up in the home directory. If you are on nfsc-oracle do the following: +cd ~ +julia +julia> Pkg.update() +julia> Pkg.add("HDF5") +julia> Pkg.add("JLD") +This will setup the ~/.julia directory, update the julia main packages, and add the HDF5 and JLD packages necessary to run misc-scripts/ribo.jl, which does the post ribosomal processing. + #6. Analysis Summarization -#7. Gene Differential Expression Analysis \ No newline at end of file +>In order to tell how the mapping went, we would like to see an overview (summary). Often times, an experiment will have additional 'metadata' describing the experimental conditions of each sample. It is typically useful to have this information included alongside the sequencing summary information to determine if there are any patterns (perhaps all treatment samples have lower number of mapped reads). This is often difficult using only sample names. + +>To do this, we need to create a `key file` as a comma delimited file (.csv). You have to have the first column be your sample name with a title of `sample`. The other columns may indicate the different treatment groups, metadata etc. It is usually easiest to do this in Excel (use one of the existing keyfiles for guidance) and then saving the output as a .csv file. This key file will also be used when running edgeR for running statistical comparisons (differential expression) between groups. + +Create a key file: + + $ vim candice-key.csv + +candice-key.csv: + + sample, type + DMSO1, control + DMSO2, control + DMSO3, control + TCDD1, treatment + TCDD2, treatment + TCDD3, treatment + + $ main-scripts/summary.py lists/candice_list keys/candice-key.csv + +>This will create output in the analysis/ directory. In it you can find the raw counts of the experimental samples in both samples x genes and genes x samples format (in the -count.csv and -count.T.csv files respectively). You can also find a summary spreadsheet in .summary.csv. There is also an .h5 file which can be read efficiently if you are using Python using the Pandas library. + +>The columns of the `summary.csv` spreadsheet are as follows: +total-reads: Number of reads in the fastQ file +uniq-reads: Number of unique reads in the FastQ file. +grch38-reads: Number of mapped reads (counting each multi-mapper separately) +grch38-uniq: Number of reads which mapped uniquely (once) to the reference genome (this is separate from uniq-reads above). +grch38-multi: Number of reads which mapped multiple locations to the reference. +grch38-annotated-reads: Number of resulting reads mapped to annotated regions of the genome (genes, lncrna, etc.) +mito-reads: Number of reads mapping to mitochondrial regions of the reference. +ercc-reads: Number of reads which mapped to ERCC transcript sequences. +rrna-reads: " " ribosomal sequences. +htseq-0-genes: Number of genes which had at least 1 read +htseq-3-genes: Number of genes which had at least 4 reads +htseq-10-genes: Number of genes which had at least 11 reads + +#7. Gene Differential Expression Analysis +>Now you are ready to find the differentially expressed genes. You will need to edit `run-edger.R` to set up the comparisons you want to run. +Edit `/home/candice/Assignment/sequencing-pipeline/main-scripts/run-edger.R` + $ vim run-edger.R +press `esc` then type `:set nu` to display line numbers. +>The list file and the key file locations need to be specified inside the script, in line 19. Replace this line with the one containing the actual key and list files to be used. + + args = c("lists/candice_list","keys/candice-key.csv") + +>Then, provide an experiment name (result directory name) that is specific to the analysis, by modifying line 45. + + ename = "edger-pair-treatment" + +>Add the keys (columns in the key file) across which the differential expression analysis has to performed. This is done by modifying line 46. + + factors = key[order(rownames(key)), c("type")] + +>Here, "type" and "treatment" are the columns used. Replace these with the relevant ones in your key file. Also change line 49 with the same columns. + + design = model.matrix(~type, data=factors) + +>The factor for the pairwise analysis is specified in line 50. "treatment" in this line has to be replaced with the factor against which the differential expression analysis is performed. + + groups = factors$type + +>Run run-edger.R. + + $ Rscript main-scripts/run_edger.R +