Differential microRNA expression

The goal of this project was, starting from the sequencing files provided by the Sequencing Core of WTCHG, to generate a list of candidate microRNAs differentially expressed in healthy controls versus patients with ankylosing spondylitis for further experimental validation. Sequencing project number is P150188.

Experimental Design

Two groups of cells -- Th17 and non-Th17 -- were separately sequenced for healthy controls and AS patients. This table provides the names of the samples and the names of the corresponding sequencing files. Note that one of the samples -- AS3-Th17 was not sequenced due to failed library prep (message from the core: One sample, AS3-TH17, failed the library prep and could not be included.)

The miRNA samples were run on a HiSeq2500 Rapid mode. With Rapid mode the libraries are run on a single sequencing flowcell that has two lanes, so two sequencing files will be produced for one sample (message from the core: the data can be merged between the lanes).

In this table AS stands for AS patients and HC stands for healthy controls.

Sample type	File names
HC1-nonTh17	WTCHG_189135_274, WTCHG_189136_274
HC2-nonTh17	WTCHG_189135_276, WTCHG_189136_276
HC3-nonTh17	WTCHG_189135_278, WTCHG_189136_278
HC4-nonTh17	WTCHG_189135_280, WTCHG_189136_280

HC1-Th17	WTCHG_189135_275, WTCHG_189136_275
HC2-Th17	WTCHG_189135_277, WTCHG_189136_277
HC3-Th17	WTCHG_189135_279, WTCHG_189136_279
HC4-Th17	WTCHG_189135_281, WTCHG_189136_281

AS1-nonTh17	WTCHG_189135_282, WTCHG_189136_282
AS2-nonTh17	WTCHG_189135_284, WTCHG_189136_284
AS3-nonTh17	WTCHG_189135_286, WTCHG_189136_286
AS4-nonTh17	WTCHG_189135_288, WTCHG_189136_288

AS1-Th17	WTCHG_189135_283, WTCHG_189136_283
AS2-Th17	WTCHG_189135_285, WTCHG_189136_285
AS4-Th17	WTCHG_189135_289, WTCHG_189136_289

Bioinformatic pipeline to analyze miRNA sequencing data

Sequencing data is stored in fastq files -- text files of a certain format containing sequenced reads' names, sequences and sequencing quality scores.

First eight lines of WTCHG_189135_286_1.fastq:

@HISEQ2500-09:311:H3K5LBCXX:1:1101:1471:1980 1:N:0:CTCAGCTG # read name
NAGGGAGGACTTCTCTGAGGAGATCGGAAGAGCACACGTCTGAACTCCAGT         # read sequence
+                                                           # delimiter line
#<DDDGHHHIIIIIIIIHIIHIIIIIIIIIIIIIIIIIIIIIIHIIIIIHI         # sequencing quality
@HISEQ2500-09:311:H3K5LBCXX:1:1101:3278:1986 1:N:0:CTCAGCTG # read name
NAAGTGGACGTATAGGGTGTGACGTAGATCGGAAGAGCACACGTCTGAACT         # read sequence
+                                                           # delimiter line
#<DDDIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIHHHIIIHIIIIIIIF         # sequencing quality

Analysis workflow starting from the fastq files:

Checking quality of the initial fastq files: checking whether any Illumina adapters used for sequencing can be found; whether reads contain low quality bases; what is the length of the reads. This is done with a tool called fastQC.
Cleaning the fastq files: removing adapters ("Overrepresented sequences" in the fastQC report) found during the previous analysis step (bash command); trimming low quality bases at the end of the reads; discarding reads shorter than 15nt or longer than 35nt (tool cutadapt).
Rerunning quality control on trimmed fastq files.
Aligning trimmed fastq files to the database of known human miRNA sequences with bowtie.
Checking the quality of alignment: how many reads were aligned perfectly (no mismatches) and uniquely (tool samtools).
Extracting numbers of reads mapped to various microRNAs.
Calculating expression of microRNAs -- normalizing the numbers of mapped reads by the number of sequenced reads (library size) and performing statistical test to find significant differential expression (tool DESeq).

Analysis workflow

Checking quality control on initial fastq files

An example report of quality checks on initial fastq files for the sample AS2-Th17 can be found here. From the report we can see that the following checks failed:

per base sequence content: this is expected, as the tool was initially designed to work with whole genome reads. MicroRNAs have a sequence content different from the whole genome, this is why this check is expected to fail. Also, as we will see later, adapters were not trimmed yet and are overrepresented -- their sequence also biases the sequence content of the reads.
sequence duplication level: this is expected, as the tool was designed to work with DNA data, and RNA (and miRNA) has expression, so fragments of certain reads are expected to be highly represented.
overrepresented sequences and adapter content: this failed because the Sequencing Core has not trimmed Illumina sequencing adapters after sequencing. We will remove these adapters.

Cleaning the fastq files

Identified adapters were removed from the 5' end of reads, reads shorter than 15 or longer than 35 nucleotides were discarded.

Checking quality control on trimmed fastq files

An example report of quality checks on trimmed fastq files for the sample AS2-Th17 can be found here. We can see that only sequence duplication level (and for some files -- per nucleotide sequence content) still fails -- as expected (see the explanation above). Note that reads which are too short or too long after removing adapters will be discarded, so the number of reads in a file can change. This table contains the information on the number of reads before and after cleaning.

_{Sample type}	_{File name1}	_{Before trim}	_{After trim}	_{File name2}	_{Before trim}	_{After trim}
_HC1-nonTh17	_{WTCHG_189135_274}	_16,658,312	_10,307,964	_{WTCHG_189136_274}	_17,111,904	_10,735,176
_HC2-nonTh17	_{WTCHG_189135_276}	_23,180,836	_12,294,948	_{WTCHG_189136_276}	_23,817,544	_12,692,556
_HC3-nonTh17	_{WTCHG_189135_278}	_28,641,100	_14,340,808	_{WTCHG_189136_278}	_29,263,204	_14,677,208
_HC4-nonTh17	_{WTCHG_189135_280}	_27,021,828	_14,380,188	_{WTCHG_189136_280}	_27,634,528	_14,715,880

_HC1-Th17	_{WTCHG_189135_275}	_18,712,164	_11,088,016	_{WTCHG_189136_275}	_19,226,884	_11,487,684
_HC2-Th17	_{WTCHG_189135_277}	_25,634,288	_12,997,776	_{WTCHG_189136_277}	_26,178,792	_13,286,152
_HC3-Th17	_{WTCHG_189135_279}	_24,934,788	_12,534,560	_{WTCHG_189136_279}	_25,451,552	_12,808,388
_HC4-Th17	_{WTCHG_189135_281}	_36,458,196	_20,089,092	_{WTCHG_189136_281}	_37,117,612	_20,519,576

_AS1-nonTh17	_{WTCHG_189135_282}	_77,242,220	_43,030,708	_{WTCHG_189136_282}	_79,433,628	_44,313,332
_AS2-nonTh17	_{WTCHG_189135_284}	_20,299,204	_13,918,564	_{WTCHG_189136_284}	_20,800,688	_14,275,212
_AS3-nonTh17	_{WTCHG_189135_286}	_23,372,832	_14,729,640	_{WTCHG_189136_286}	_23,979,804	_15,319,740
_AS4-nonTh17	_{WTCHG_189135_288}	_31,656,996	_16,691,080	_{WTCHG_189136_288}	_32,283,960	_17,033,808

_AS1-Th17	_{WTCHG_189135_283}	_30,916,908	_16,576,272	_{WTCHG_189136_283}	_31,723,492	_17,048,232
_AS2-Th17	_{WTCHG_189135_285}	_23,164,272	_19,867,816	_{WTCHG_189136_285}	_23,761,664	_20,393,544
_AS4-Th17	_{WTCHG_189135_289}	_39,985,232	_30,642,800	_{WTCHG_189136_289}	_41,047,460	_31,462,132

After trimming and discarding reads, all samples have number of reads sufficient for the downstream analysis.

Aligning reads to the database of human microRNAs

This table contains numbers of reads mapped to miRNAs (merged for two sequenced files). Column RNA contains the information about the input material (number of ng).

_{Sample type}	_{File name1}	_{File name2}	_{Number of mapped reads}	_{RNA, ng}
_HC1-nonTh17	_{WTCHG_189135_274}	_{WTCHG_189136_274}	_42,748	_90.83
_HC2-nonTh17	_{WTCHG_189135_276}	_{WTCHG_189136_276}	_995,035	_100.00
_HC3-nonTh17	_{WTCHG_189135_278}	_{WTCHG_189136_278}	_677,173	_100.00
_HC4-nonTh17	_{WTCHG_189135_280}	_{WTCHG_189136_280}	_450,159	_100.00

_HC1-Th17	_{WTCHG_189135_275}	_{WTCHG_189136_275}	_86,821	_85.03
_HC2-Th17	_{WTCHG_189135_277}	_{WTCHG_189136_277}	_955,892	_69.76
_HC3-Th17	_{WTCHG_189135_279}	_{WTCHG_189136_279}	_926,892	_74.42
_HC4-Th17	_{WTCHG_189135_281}	_{WTCHG_189136_281}	_153,514	_49.08

_AS1-nonTh17	_{WTCHG_189135_282}	_{WTCHG_189136_282}	_1,188,284	_100.00
_AS2-nonTh17	_{WTCHG_189135_284}	_{WTCHG_189136_284}	_298,505	_36.46
_AS3-nonTh17	_{WTCHG_189135_286}	_{WTCHG_189136_286}	_665,606	_46.10
_AS4-nonTh17	_{WTCHG_189135_288}	_{WTCHG_189136_288}	_1,305,340	_77.02

_AS1-Th17	_{WTCHG_189135_283}	_{WTCHG_189136_283}	_574,407	_43.61
_AS2-Th17	_{WTCHG_189135_285}	_{WTCHG_189136_285}	_10,110	_33.27
_AS4-Th17	_{WTCHG_189135_289}	_{WTCHG_189136_289}	_706,156	_37.87

Correlation between to sequencing runs is 0.99. Correlation between number of mapped reads and the amount of RNA input material is 0.39.

Here you can see how the aligner was chosen and all bioinformatic details including commands.

Checking alignment quality

Here you can see other alignment results -- alignment to miRNAs from other organisms than human, to human DNA, to all non-coding human RNAs.

Extracting reads mapped to miRNAs

This table was generated by the Sequencing Core and contains numbers of reads mapped to various human miRNAs. Note that this is not expression data.

Looking at the number of reads mapped to miRNAs, one sample has too low number of reads mapped to miRNAs and was excluded from further analysis -- sample 285 a.k.a. AS2-Th17.

Here is the file containing numbers of reads mapped to microRNAs, excluding sample 285.

Differential miRNA expression analysis

Differential expression was performed by the DESeq package. Here you can find more bioinformatic details including commands (R code).

Firstly, pairwise differential expression analysis between each pair of conditions was performed. Unfortunately this analysis produced only one statistically significant difference in miRNA expressions (see the description after the links to the files with differential expression analysis results). The absence of almost any differentially expressed miRNAs is most likely due to the low amount of identified miRNAs and mainly low number of reads mapped to miRNAs.

These two figures demonstrate how low number of identified miRNAs affect the analysis. The first row of the figures contains scatter plots with the dispersion estimation. The second row of figures contains log2 fold change of gene expression against the mean (average) of normalized counts (significant differential expression is highlighted in red).

Taejong's data	Public data

We have four groups, HC_Th17, HC_nonTh17, AS_Th17, AS_nonTh17. The results of pairwise differential miRNA expression analysis for these groups can be found here:

HC_Th17 vs HC_nonTh17
HC_Th17 vs AS_Th17
HC_Th17 vs AS_nonTh17
HC_nonTh17 vs AS_Th27
HC_nonTh17 vs AS_nonTh27
AS_Th17 vs AS_nonTh17
HC vs AS (all HC vs all AS)

One statistically significant hit: in the comparison of HC Th17 vs AS Th17.

name_of_miRNA_	baseMean	baseMeanA	baseMeanB	foldChange	log2FoldChange	pvalue	padj
hsa-miR-10b-5p	29.11241	0.1963728	86.944483	442.752059	8.790355204839	1.5e-05	0.013

More information about the hsa-miR-10b-5p microRNA can be found here. Limited information about upregulation of miR-10b in cancers is available. In this data, this microRNA is upregulated in AS versus HC. However, the expression of this microRNA in HC is probably too low to select it as a candidate for experimental validation with qPCR (qPCR might not be sensivite enough). Raw numbers of reads mapped to this microRNA:

mirna         	HC1-Th17	HC2-Th17	HC3-Th17	HC4-Th17	AS1-Th17	AS4-nonTh17
hsa-miR-10b-5p	0       	0       	1       	0       	97      	22

Secondly, we performed ANOVA test across all four groups of samples. It is done with an R package limma, and the details (including commands) can be found here.

When running limma for only reasonably expressed microRNAs (at least 750 reads mapped to a microRNA in all samples in total), all microRNAs were identified as significantly differentially. expressed Even when run for 1500 microRNAs, (at least 1 read mapped to each microRNA in at least one of the samples), limma gave significant pvalues. For absolutely each microRNA.

As any other statistical test, ANOVA works best when the assumptions -- in case of ANOVA, normal distribution of the data and comparable standard deviation -- are true for the analyzed dataset. Unfortunately, for this dataset the number of expressed microRNAs and the levels of expression are very low, which makes the data non-randomly distributed (maybe rather a Poisson distribution?). Additionally, the level of noise increases when the signal (number of mapped reads) is so low.

Thirdly, we ignored pvalues and applied cutoffs on miRNA expression under each condition and the fold change to create a list of miRNA candidates for future experimental validation.

Filtering steps:

remove lines containing NA -- this means that miRNA expression was zero under both conditions.
remove lines containing Inf -- this means that zero reads was mapped in one of the samples. Genes and miRNAs never get absolute zero expression (unless they are knocked out), so when zero reads map to an miRNA, this indicates that not enough miRNA was sequenced or this particular miRNA was not captured -- but not that it is 100% silenced.
log2foldChange should be either below -2 or above 2 -- such filter is common practice in any differential expression analysis (also for genes), as it is believed/assumed that qPCR won''t capture smaller differences in expression.
average miRNA expression under one condition (baseMeanA) should be above 5 -- lower expression probably won't be accurately measured by qPCR.
average miRNA expression under the second condition (baseMeanB) should be above 5.

Commands used for filtering and creating candidate lists can be found here.

This figure illustrates which fold changes we select. All log2 of fold changes are shown in black, the selected ones (above 2 or below -2) are shown in red.

This table contains number of miRNAs remaining after each filtering step in each pairwise comparison.

_{Filtering steps}	_{HC Th17 vs HC nonTh17}*	_{AS Th17 vs AS nonTh17}	_{HC Th17 vs AS Th17}	_{HC nonTh17 vs AS nonTh17}	_{HC nonTh17 vs AS Th17}	_{HC Th17 vs AS nonTh17}
initially	2576	2576	2576	2576	2576	2576
no NA	897	888	845	965	880	953
no Inf	632	546	504	649	514	616
log2FoldChange	70	63	70	66	90	93
baseMeanA	4	12	27	13	28	15
baseMeanB	2	1	11	5	13	8

* used cutoff of 4 instead of 5 at the baseMeanA step (one of the potential candidate miRNAs had expression of 4.87).

Lists of miRNA candidates for validation can be found here:

HC_Th17 vs HC_nonTh17
HC_Th17 vs AS_Th17
HC_Th17 vs AS_nonTh17
HC_nonTh17 vs AS_Th27
HC_nonTh17 vs AS_nonTh27
AS_Th17 vs AS_nonTh17

Final list of candidates with relaxed filtering criteria

Previously, we asked for the relative miRNA expression under both conditions to be at least 5. Here, we ask for the expression at least just one condition to be above 5. This resulted in four extra miRNA candidates for the HC_Th17 vs AS_Th17 comparison; and two extra miRNA candidates for the HC_nonTh17 vs AS_Th17 comparison.

Lists of final miRNA candidates for validation can be found here:

HC_Th17 vs HC_nonTh17
HC_Th17 vs AS_Th17
HC_Th17 vs AS_nonTh17
HC_nonTh17 vs AS_Th27
HC_nonTh17 vs AS_nonTh27
AS_Th17 vs AS_nonTh17

Results of Kruskal-Wallis test

Kruskal-Wallis test , a rank-based non-parametric test (one-way ANOVA on ranks) run to determine is there is a statistically significant difference between two or more groups, was run per miRNA. Unfortunately the test did not produce any p-value lower than 0.5.You can find the code here and the results here.

miRNA targets

Here is the list of miRNAs and their targets (target gene names). Only targets with strong experimental evidence (shown with a reporter assay, Western blot or qPCR -- at least one of the three) are listed here. When the last column of the file is empty, that means that no targets with strong experimental evidence were ever published and only some predictions from some microarray or RNA-Seq data are available. Information about the targets was obtained from the miRNA targets database miRTarBase.

Compare RNA-Seq and qPCR results

Four microRNAs -- miR-155-5p, miR-146a-5p, miR-210-3p and miR-21-5p -- were previously experimentally validated with qPCR. None of these microRNAs showed significant differences in expression across different conditions due to the low fold change. However, we decided to check expression levels of these miRNAs. Here we compared changes in relative expression of these microRNAs in qPCR vs RNA-Seq data (results of qPCR experiments were provided by Taejong Kim). Unfortunately the results of miRNA sequencing do not confirm the resuts of qPCR.

Different number of samples was used for qPCR and for RNA-Seq

miR-155-5p	HC Th17	HC nonTh17	AS Th17	AS non Th17
RNA-Seq	34,084	28,681	19,189	11,178
qPCR	7.1	1.5	48	1

miR-146a-5p	HC Th17	HC nonTh17	AS Th17	AS non Th17
RNA-Seq	7,641	4,664	17,742	14,838
qPCR	8.8	1.5	4.2	1

miR-210-3p	HC Th17	HC nonTh17	AS Th17	AS non Th17
RNA-Seq	33	43	40	35
qPCR	11.2	2	25.5	2.1

miR-21-5p	HC Th17	HC nonTh17	AS Th17	AS non Th17
RNA-Seq	286,653	175,452	239,449	211,160
qPCR	9.5	2.4	1.4	0.7

Same samples were used for qPCR and for RNA-Seq

miR-155-5p	HC Th17	HC nonTh17	AS Th17	AS non Th17
RNA-Seq	34,084	28,681	19,189	11,178
qPCR	10.97	1.45	3.49	1.13

miR-146a-5p	HC Th17	HC nonTh17	AS Th17	AS non Th17
RNA-Seq	7,641	4,664	17,742	14,838
qPCR	18.26	0.74	3.28	1.78

miR-210-3p	HC Th17	HC nonTh17	AS Th17	AS non Th17
RNA-Seq	33	43	40	35
qPCR	11.43	2.51	4.64	2.95

miR-21-5p	HC Th17	HC nonTh17	AS Th17	AS non Th17
RNA-Seq	286,653	175,452	239,449	211,160
qPCR	36.78	2.13	2.57	1.02

miR-10b-5p	HC Th17	HC nonTh17	AS Th17	AS non Th17
RNA-Seq	0.25	2	111.5	34.75
qPCR	11.42	2.47	25.7	5.79

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow_simplified.md

Workflow_simplified.md

Differential microRNA expression

Experimental Design

Bioinformatic pipeline to analyze miRNA sequencing data

Analysis workflow

Final list of candidates with relaxed filtering criteria

Results of Kruskal-Wallis test

miRNA targets

Compare RNA-Seq and qPCR results

Designed by Irina Pulyakhina [email protected]

Files

Workflow_simplified.md

Latest commit

History

Workflow_simplified.md

File metadata and controls

Differential microRNA expression

Experimental Design

Bioinformatic pipeline to analyze miRNA sequencing data

Analysis workflow

Final list of candidates with relaxed filtering criteria

Results of Kruskal-Wallis test

miRNA targets

Compare RNA-Seq and qPCR results

Designed by Irina Pulyakhina [email protected]