Skip to content

Latest commit

 

History

History
481 lines (393 loc) · 27.2 KB

Workflow_simplified.md

File metadata and controls

481 lines (393 loc) · 27.2 KB

Differential microRNA expression

The goal of this project was, starting from the sequencing files provided by the Sequencing Core of WTCHG, to generate a list of candidate microRNAs differentially expressed in healthy controls versus patients with ankylosing spondylitis for further experimental validation. Sequencing project number is P150188.

Experimental Design

Two groups of cells -- Th17 and non-Th17 -- were separately sequenced for healthy controls and AS patients. This table provides the names of the samples and the names of the corresponding sequencing files. Note that one of the samples -- AS3-Th17 was not sequenced due to failed library prep (message from the core: One sample, AS3-TH17, failed the library prep and could not be included.)

The miRNA samples were run on a HiSeq2500 Rapid mode. With Rapid mode the libraries are run on a single sequencing flowcell that has two lanes, so two sequencing files will be produced for one sample (message from the core: the data can be merged between the lanes).

In this table AS stands for AS patients and HC stands for healthy controls.

Sample type File names
HC1-nonTh17 WTCHG_189135_274, WTCHG_189136_274
HC2-nonTh17 WTCHG_189135_276, WTCHG_189136_276
HC3-nonTh17 WTCHG_189135_278, WTCHG_189136_278
HC4-nonTh17 WTCHG_189135_280, WTCHG_189136_280
HC1-Th17 WTCHG_189135_275, WTCHG_189136_275
HC2-Th17 WTCHG_189135_277, WTCHG_189136_277
HC3-Th17 WTCHG_189135_279, WTCHG_189136_279
HC4-Th17 WTCHG_189135_281, WTCHG_189136_281
AS1-nonTh17 WTCHG_189135_282, WTCHG_189136_282
AS2-nonTh17 WTCHG_189135_284, WTCHG_189136_284
AS3-nonTh17 WTCHG_189135_286, WTCHG_189136_286
AS4-nonTh17 WTCHG_189135_288, WTCHG_189136_288
AS1-Th17 WTCHG_189135_283, WTCHG_189136_283
AS2-Th17 WTCHG_189135_285, WTCHG_189136_285
AS4-Th17 WTCHG_189135_289, WTCHG_189136_289

Bioinformatic pipeline to analyze miRNA sequencing data

Sequencing data is stored in fastq files -- text files of a certain format containing sequenced reads' names, sequences and sequencing quality scores.

First eight lines of WTCHG_189135_286_1.fastq:

@HISEQ2500-09:311:H3K5LBCXX:1:1101:1471:1980 1:N:0:CTCAGCTG # read name
NAGGGAGGACTTCTCTGAGGAGATCGGAAGAGCACACGTCTGAACTCCAGT         # read sequence
+                                                           # delimiter line
#<DDDGHHHIIIIIIIIHIIHIIIIIIIIIIIIIIIIIIIIIIHIIIIIHI         # sequencing quality
@HISEQ2500-09:311:H3K5LBCXX:1:1101:3278:1986 1:N:0:CTCAGCTG # read name
NAAGTGGACGTATAGGGTGTGACGTAGATCGGAAGAGCACACGTCTGAACT         # read sequence
+                                                           # delimiter line
#<DDDIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIHHHIIIHIIIIIIIF         # sequencing quality

Analysis workflow starting from the fastq files:

  • Checking quality of the initial fastq files: checking whether any Illumina adapters used for sequencing can be found; whether reads contain low quality bases; what is the length of the reads. This is done with a tool called fastQC.
  • Cleaning the fastq files: removing adapters ("Overrepresented sequences" in the fastQC report) found during the previous analysis step (bash command); trimming low quality bases at the end of the reads; discarding reads shorter than 15nt or longer than 35nt (tool cutadapt).
  • Rerunning quality control on trimmed fastq files.
  • Aligning trimmed fastq files to the database of known human miRNA sequences with bowtie.
  • Checking the quality of alignment: how many reads were aligned perfectly (no mismatches) and uniquely (tool samtools).
  • Extracting numbers of reads mapped to various microRNAs.
  • Calculating expression of microRNAs -- normalizing the numbers of mapped reads by the number of sequenced reads (library size) and performing statistical test to find significant differential expression (tool DESeq).

Analysis workflow

Checking quality control on initial fastq files

An example report of quality checks on initial fastq files for the sample AS2-Th17 can be found here. From the report we can see that the following checks failed:

  • per base sequence content: this is expected, as the tool was initially designed to work with whole genome reads. MicroRNAs have a sequence content different from the whole genome, this is why this check is expected to fail. Also, as we will see later, adapters were not trimmed yet and are overrepresented -- their sequence also biases the sequence content of the reads.
  • sequence duplication level: this is expected, as the tool was designed to work with DNA data, and RNA (and miRNA) has expression, so fragments of certain reads are expected to be highly represented.
  • overrepresented sequences and adapter content: this failed because the Sequencing Core has not trimmed Illumina sequencing adapters after sequencing. We will remove these adapters.

Cleaning the fastq files

Identified adapters were removed from the 5' end of reads, reads shorter than 15 or longer than 35 nucleotides were discarded.

Checking quality control on trimmed fastq files

An example report of quality checks on trimmed fastq files for the sample AS2-Th17 can be found here. We can see that only sequence duplication level (and for some files -- per nucleotide sequence content) still fails -- as expected (see the explanation above). Note that reads which are too short or too long after removing adapters will be discarded, so the number of reads in a file can change. This table contains the information on the number of reads before and after cleaning.

Sample type File name1 Before trim After trim File name2 Before trim After trim
HC1-nonTh17 WTCHG_189135_274 16,658,312 10,307,964 WTCHG_189136_274 17,111,904 10,735,176
HC2-nonTh17 WTCHG_189135_276 23,180,836 12,294,948 WTCHG_189136_276 23,817,544 12,692,556
HC3-nonTh17 WTCHG_189135_278 28,641,100 14,340,808 WTCHG_189136_278 29,263,204 14,677,208
HC4-nonTh17 WTCHG_189135_280 27,021,828 14,380,188 WTCHG_189136_280 27,634,528 14,715,880
HC1-Th17 WTCHG_189135_275 18,712,164 11,088,016 WTCHG_189136_275 19,226,884 11,487,684
HC2-Th17 WTCHG_189135_277 25,634,288 12,997,776 WTCHG_189136_277 26,178,792 13,286,152
HC3-Th17 WTCHG_189135_279 24,934,788 12,534,560 WTCHG_189136_279 25,451,552 12,808,388
HC4-Th17 WTCHG_189135_281 36,458,196 20,089,092 WTCHG_189136_281 37,117,612 20,519,576
AS1-nonTh17 WTCHG_189135_282 77,242,220 43,030,708 WTCHG_189136_282 79,433,628 44,313,332
AS2-nonTh17 WTCHG_189135_284 20,299,204 13,918,564 WTCHG_189136_284 20,800,688 14,275,212
AS3-nonTh17 WTCHG_189135_286 23,372,832 14,729,640 WTCHG_189136_286 23,979,804 15,319,740
AS4-nonTh17 WTCHG_189135_288 31,656,996 16,691,080 WTCHG_189136_288 32,283,960 17,033,808
AS1-Th17 WTCHG_189135_283 30,916,908 16,576,272 WTCHG_189136_283 31,723,492 17,048,232
AS2-Th17 WTCHG_189135_285 23,164,272 19,867,816 WTCHG_189136_285 23,761,664 20,393,544
AS4-Th17 WTCHG_189135_289 39,985,232 30,642,800 WTCHG_189136_289 41,047,460 31,462,132

After trimming and discarding reads, all samples have number of reads sufficient for the downstream analysis.

Aligning reads to the database of human microRNAs

This table contains numbers of reads mapped to miRNAs (merged for two sequenced files). Column RNA contains the information about the input material (number of ng).

Sample type File name1 File name2 Number of mapped reads RNA, ng
HC1-nonTh17 WTCHG_189135_274 WTCHG_189136_274 42,748 90.83
HC2-nonTh17 WTCHG_189135_276 WTCHG_189136_276 995,035 100.00
HC3-nonTh17 WTCHG_189135_278 WTCHG_189136_278 677,173 100.00
HC4-nonTh17 WTCHG_189135_280 WTCHG_189136_280 450,159 100.00
HC1-Th17 WTCHG_189135_275 WTCHG_189136_275 86,821 85.03
HC2-Th17 WTCHG_189135_277 WTCHG_189136_277 955,892 69.76
HC3-Th17 WTCHG_189135_279 WTCHG_189136_279 926,892 74.42
HC4-Th17 WTCHG_189135_281 WTCHG_189136_281 153,514 49.08
AS1-nonTh17 WTCHG_189135_282 WTCHG_189136_282 1,188,284 100.00
AS2-nonTh17 WTCHG_189135_284 WTCHG_189136_284 298,505 36.46
AS3-nonTh17 WTCHG_189135_286 WTCHG_189136_286 665,606 46.10
AS4-nonTh17 WTCHG_189135_288 WTCHG_189136_288 1,305,340 77.02
AS1-Th17 WTCHG_189135_283 WTCHG_189136_283 574,407 43.61
AS2-Th17 WTCHG_189135_285 WTCHG_189136_285 10,110 33.27
AS4-Th17 WTCHG_189135_289 WTCHG_189136_289 706,156 37.87

Correlation between to sequencing runs is 0.99. Correlation between number of mapped reads and the amount of RNA input material is 0.39.

Here you can see how the aligner was chosen and all bioinformatic details including commands.

Checking alignment quality

Here you can see other alignment results -- alignment to miRNAs from other organisms than human, to human DNA, to all non-coding human RNAs.

Extracting reads mapped to miRNAs

This table was generated by the Sequencing Core and contains numbers of reads mapped to various human miRNAs. Note that this is not expression data.

Looking at the number of reads mapped to miRNAs, one sample has too low number of reads mapped to miRNAs and was excluded from further analysis -- sample 285 a.k.a. AS2-Th17.

alt text

Here is the file containing numbers of reads mapped to microRNAs, excluding sample 285.

Differential miRNA expression analysis

Differential expression was performed by the DESeq package. Here you can find more bioinformatic details including commands (R code).

Firstly, pairwise differential expression analysis between each pair of conditions was performed. Unfortunately this analysis produced only one statistically significant difference in miRNA expressions (see the description after the links to the files with differential expression analysis results). The absence of almost any differentially expressed miRNAs is most likely due to the low amount of identified miRNAs and mainly low number of reads mapped to miRNAs.

These two figures demonstrate how low number of identified miRNAs affect the analysis. The first row of the figures contains scatter plots with the dispersion estimation. The second row of figures contains log2 fold change of gene expression against the mean (average) of normalized counts (significant differential expression is highlighted in red).

Taejong's data Public data
alt text alt text
alt text alt text

We have four groups, HC_Th17, HC_nonTh17, AS_Th17, AS_nonTh17. The results of pairwise differential miRNA expression analysis for these groups can be found here:

One statistically significant hit: in the comparison of HC Th17 vs AS Th17.

name_of_miRNA_	baseMean	baseMeanA	baseMeanB	foldChange	log2FoldChange	pvalue	padj
hsa-miR-10b-5p	29.11241	0.1963728	86.944483	442.752059	8.790355204839	1.5e-05	0.013

More information about the hsa-miR-10b-5p microRNA can be found here. Limited information about upregulation of miR-10b in cancers is available. In this data, this microRNA is upregulated in AS versus HC. However, the expression of this microRNA in HC is probably too low to select it as a candidate for experimental validation with qPCR (qPCR might not be sensivite enough). Raw numbers of reads mapped to this microRNA:

mirna         	HC1-Th17	HC2-Th17	HC3-Th17	HC4-Th17	AS1-Th17	AS4-nonTh17
hsa-miR-10b-5p	0       	0       	1       	0       	97      	22

Secondly, we performed ANOVA test across all four groups of samples. It is done with an R package limma, and the details (including commands) can be found here.

When running limma for only reasonably expressed microRNAs (at least 750 reads mapped to a microRNA in all samples in total), all microRNAs were identified as significantly differentially. expressed Even when run for 1500 microRNAs, (at least 1 read mapped to each microRNA in at least one of the samples), limma gave significant pvalues. For absolutely each microRNA.

As any other statistical test, ANOVA works best when the assumptions -- in case of ANOVA, normal distribution of the data and comparable standard deviation -- are true for the analyzed dataset. Unfortunately, for this dataset the number of expressed microRNAs and the levels of expression are very low, which makes the data non-randomly distributed (maybe rather a Poisson distribution?). Additionally, the level of noise increases when the signal (number of mapped reads) is so low.

Thirdly, we ignored pvalues and applied cutoffs on miRNA expression under each condition and the fold change to create a list of miRNA candidates for future experimental validation.

Filtering steps:

  • remove lines containing NA -- this means that miRNA expression was zero under both conditions.
  • remove lines containing Inf -- this means that zero reads was mapped in one of the samples. Genes and miRNAs never get absolute zero expression (unless they are knocked out), so when zero reads map to an miRNA, this indicates that not enough miRNA was sequenced or this particular miRNA was not captured -- but not that it is 100% silenced.
  • log2foldChange should be either below -2 or above 2 -- such filter is common practice in any differential expression analysis (also for genes), as it is believed/assumed that qPCR won''t capture smaller differences in expression.
  • average miRNA expression under one condition (baseMeanA) should be above 5 -- lower expression probably won't be accurately measured by qPCR.
  • average miRNA expression under the second condition (baseMeanB) should be above 5.

Commands used for filtering and creating candidate lists can be found here.

This figure illustrates which fold changes we select. All log2 of fold changes are shown in black, the selected ones (above 2 or below -2) are shown in red.

alt text

This table contains number of miRNAs remaining after each filtering step in each pairwise comparison.

Filtering steps HC Th17 vs HC nonTh17* AS Th17 vs AS nonTh17 HC Th17 vs AS Th17 HC nonTh17 vs AS nonTh17 HC nonTh17 vs AS Th17 HC Th17 vs AS nonTh17
initially 2576 2576 2576 2576 2576 2576
no NA 897 888 845 965 880 953
no Inf 632 546 504 649 514 616
log2FoldChange 70 63 70 66 90 93
baseMeanA 4 12 27 13 28 15
baseMeanB 2 1 11 5 13 8

* used cutoff of 4 instead of 5 at the baseMeanA step (one of the potential candidate miRNAs had expression of 4.87).

Lists of miRNA candidates for validation can be found here:

Final list of candidates with relaxed filtering criteria

Previously, we asked for the relative miRNA expression under both conditions to be at least 5. Here, we ask for the expression at least just one condition to be above 5. This resulted in four extra miRNA candidates for the HC_Th17 vs AS_Th17 comparison; and two extra miRNA candidates for the HC_nonTh17 vs AS_Th17 comparison.

Lists of final miRNA candidates for validation can be found here:

Results of Kruskal-Wallis test

Kruskal-Wallis test , a rank-based non-parametric test (one-way ANOVA on ranks) run to determine is there is a statistically significant difference between two or more groups, was run per miRNA. Unfortunately the test did not produce any p-value lower than 0.5.You can find the code here and the results here.

miRNA targets

Here is the list of miRNAs and their targets (target gene names). Only targets with strong experimental evidence (shown with a reporter assay, Western blot or qPCR -- at least one of the three) are listed here. When the last column of the file is empty, that means that no targets with strong experimental evidence were ever published and only some predictions from some microarray or RNA-Seq data are available. Information about the targets was obtained from the miRNA targets database miRTarBase.

Compare RNA-Seq and qPCR results

Four microRNAs -- miR-155-5p, miR-146a-5p, miR-210-3p and miR-21-5p -- were previously experimentally validated with qPCR. None of these microRNAs showed significant differences in expression across different conditions due to the low fold change. However, we decided to check expression levels of these miRNAs. Here we compared changes in relative expression of these microRNAs in qPCR vs RNA-Seq data (results of qPCR experiments were provided by Taejong Kim). Unfortunately the results of miRNA sequencing do not confirm the resuts of qPCR.

Different number of samples was used for qPCR and for RNA-Seq

miR-155-5p HC Th17 HC nonTh17 AS Th17 AS non Th17
RNA-Seq 34,084 28,681 19,189 11,178
qPCR 7.1 1.5 48 1
miR-146a-5p HC Th17 HC nonTh17 AS Th17 AS non Th17
RNA-Seq 7,641 4,664 17,742 14,838
qPCR 8.8 1.5 4.2 1
miR-210-3p HC Th17 HC nonTh17 AS Th17 AS non Th17
RNA-Seq 33 43 40 35
qPCR 11.2 2 25.5 2.1
miR-21-5p HC Th17 HC nonTh17 AS Th17 AS non Th17
RNA-Seq 286,653 175,452 239,449 211,160
qPCR 9.5 2.4 1.4 0.7

alt text

alt text

Same samples were used for qPCR and for RNA-Seq

miR-155-5p HC Th17 HC nonTh17 AS Th17 AS non Th17
RNA-Seq 34,084 28,681 19,189 11,178
qPCR 10.97 1.45 3.49 1.13
miR-146a-5p HC Th17 HC nonTh17 AS Th17 AS non Th17
RNA-Seq 7,641 4,664 17,742 14,838
qPCR 18.26 0.74 3.28 1.78
miR-210-3p HC Th17 HC nonTh17 AS Th17 AS non Th17
RNA-Seq 33 43 40 35
qPCR 11.43 2.51 4.64 2.95
miR-21-5p HC Th17 HC nonTh17 AS Th17 AS non Th17
RNA-Seq 286,653 175,452 239,449 211,160
qPCR 36.78 2.13 2.57 1.02
miR-10b-5p HC Th17 HC nonTh17 AS Th17 AS non Th17
RNA-Seq 0.25 2 111.5 34.75
qPCR 11.42 2.47 25.7 5.79

alt text

alt text

Designed by Irina Pulyakhina [email protected]