The goal of this project was, starting from the sequencing files provided by the Sequencing Core of WTCHG, to generate a list of candidate microRNAs differentially expressed in healthy controls versus patients with ankylosing spondylitis for further experimental validation. Sequencing project number is P150188.
Two groups of cells -- Th17 and non-Th17 -- were separately sequenced for healthy controls and AS patients. This table provides the names of the samples and the names of the corresponding sequencing files. Note that one of the samples -- AS3-Th17 was not sequenced due to failed library prep (message from the core: One sample, AS3-TH17, failed the library prep and could not be included.)
The miRNA samples were run on a HiSeq2500 Rapid mode. With Rapid mode the libraries are run on a single sequencing flowcell that has two lanes, so two sequencing files will be produced for one sample (message from the core: the data can be merged between the lanes).
In this table AS
stands for AS patients and HC
stands
for healthy controls.
Sample type | File names |
---|---|
HC1-nonTh17 | WTCHG_189135_274, WTCHG_189136_274 |
HC2-nonTh17 | WTCHG_189135_276, WTCHG_189136_276 |
HC3-nonTh17 | WTCHG_189135_278, WTCHG_189136_278 |
HC4-nonTh17 | WTCHG_189135_280, WTCHG_189136_280 |
HC1-Th17 | WTCHG_189135_275, WTCHG_189136_275 |
HC2-Th17 | WTCHG_189135_277, WTCHG_189136_277 |
HC3-Th17 | WTCHG_189135_279, WTCHG_189136_279 |
HC4-Th17 | WTCHG_189135_281, WTCHG_189136_281 |
AS1-nonTh17 | WTCHG_189135_282, WTCHG_189136_282 |
AS2-nonTh17 | WTCHG_189135_284, WTCHG_189136_284 |
AS3-nonTh17 | WTCHG_189135_286, WTCHG_189136_286 |
AS4-nonTh17 | WTCHG_189135_288, WTCHG_189136_288 |
AS1-Th17 | WTCHG_189135_283, WTCHG_189136_283 |
AS2-Th17 | WTCHG_189135_285, WTCHG_189136_285 |
AS4-Th17 | WTCHG_189135_289, WTCHG_189136_289 |
Sequencing data is stored in fastq
files -- text files
of a certain
format
containing sequenced reads' names, sequences and
sequencing
quality scores.
First eight lines of WTCHG_189135_286_1.fastq
:
@HISEQ2500-09:311:H3K5LBCXX:1:1101:1471:1980 1:N:0:CTCAGCTG # read name
NAGGGAGGACTTCTCTGAGGAGATCGGAAGAGCACACGTCTGAACTCCAGT # read sequence
+ # delimiter line
#<DDDGHHHIIIIIIIIHIIHIIIIIIIIIIIIIIIIIIIIIIHIIIIIHI # sequencing quality
@HISEQ2500-09:311:H3K5LBCXX:1:1101:3278:1986 1:N:0:CTCAGCTG # read name
NAAGTGGACGTATAGGGTGTGACGTAGATCGGAAGAGCACACGTCTGAACT # read sequence
+ # delimiter line
#<DDDIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIHHHIIIHIIIIIIIF # sequencing quality
Analysis workflow starting from the fastq
files:
- Checking quality of the initial
fastq
files: checking whether any Illumina adapters used for sequencing can be found; whether reads contain low quality bases; what is the length of the reads. This is done with a tool calledfastQC
. - Cleaning the
fastq
files: removing adapters ("Overrepresented sequences" in thefastQC
report) found during the previous analysis step (bash command); trimming low quality bases at the end of the reads; discarding reads shorter than 15nt or longer than 35nt (tool cutadapt). - Rerunning quality control on trimmed
fastq
files. - Aligning
trimmed
fastq
files to the database of known human miRNA sequences with bowtie. - Checking the quality of alignment: how many reads were aligned perfectly (no mismatches) and uniquely (tool samtools).
- Extracting numbers of reads mapped to various microRNAs.
- Calculating expression of microRNAs -- normalizing the numbers of mapped reads by the number of sequenced reads (library size) and performing statistical test to find significant differential expression (tool DESeq).
Checking quality control on initial fastq files
An example report of quality checks on initial fastq
files for the sample AS2-Th17 can be found
here.
From the report we can see that the following checks failed:
- per base sequence content: this is expected, as the tool was initially designed to work with whole genome reads. MicroRNAs have a sequence content different from the whole genome, this is why this check is expected to fail. Also, as we will see later, adapters were not trimmed yet and are overrepresented -- their sequence also biases the sequence content of the reads.
- sequence duplication level: this is expected, as the tool was designed to work with DNA data, and RNA (and miRNA) has expression, so fragments of certain reads are expected to be highly represented.
- overrepresented sequences and adapter content: this failed because the Sequencing Core has not trimmed Illumina sequencing adapters after sequencing. We will remove these adapters.
Cleaning the fastq files
Identified adapters were removed from the 5' end of reads, reads shorter than 15 or longer than 35 nucleotides were discarded.
Checking quality control on trimmed fastq files
An example report of quality checks on trimmed fastq
files for the sample AS2-Th17 can be found
here.
We can see that only sequence duplication level (and for some
files -- per nucleotide sequence content) still fails -- as
expected (see the explanation above). Note that reads which are too short or
too long after removing adapters will be discarded, so the
number of reads in a file can change. This table contains the
information on the number of reads before and after cleaning.
Sample type | File name1 | Before trim | After trim | File name2 | Before trim | After trim |
---|---|---|---|---|---|---|
HC1-nonTh17 | WTCHG_189135_274 | 16,658,312 | 10,307,964 | WTCHG_189136_274 | 17,111,904 | 10,735,176 |
HC2-nonTh17 | WTCHG_189135_276 | 23,180,836 | 12,294,948 | WTCHG_189136_276 | 23,817,544 | 12,692,556 |
HC3-nonTh17 | WTCHG_189135_278 | 28,641,100 | 14,340,808 | WTCHG_189136_278 | 29,263,204 | 14,677,208 |
HC4-nonTh17 | WTCHG_189135_280 | 27,021,828 | 14,380,188 | WTCHG_189136_280 | 27,634,528 | 14,715,880 |
HC1-Th17 | WTCHG_189135_275 | 18,712,164 | 11,088,016 | WTCHG_189136_275 | 19,226,884 | 11,487,684 |
HC2-Th17 | WTCHG_189135_277 | 25,634,288 | 12,997,776 | WTCHG_189136_277 | 26,178,792 | 13,286,152 |
HC3-Th17 | WTCHG_189135_279 | 24,934,788 | 12,534,560 | WTCHG_189136_279 | 25,451,552 | 12,808,388 |
HC4-Th17 | WTCHG_189135_281 | 36,458,196 | 20,089,092 | WTCHG_189136_281 | 37,117,612 | 20,519,576 |
AS1-nonTh17 | WTCHG_189135_282 | 77,242,220 | 43,030,708 | WTCHG_189136_282 | 79,433,628 | 44,313,332 |
AS2-nonTh17 | WTCHG_189135_284 | 20,299,204 | 13,918,564 | WTCHG_189136_284 | 20,800,688 | 14,275,212 |
AS3-nonTh17 | WTCHG_189135_286 | 23,372,832 | 14,729,640 | WTCHG_189136_286 | 23,979,804 | 15,319,740 |
AS4-nonTh17 | WTCHG_189135_288 | 31,656,996 | 16,691,080 | WTCHG_189136_288 | 32,283,960 | 17,033,808 |
AS1-Th17 | WTCHG_189135_283 | 30,916,908 | 16,576,272 | WTCHG_189136_283 | 31,723,492 | 17,048,232 |
AS2-Th17 | WTCHG_189135_285 | 23,164,272 | 19,867,816 | WTCHG_189136_285 | 23,761,664 | 20,393,544 |
AS4-Th17 | WTCHG_189135_289 | 39,985,232 | 30,642,800 | WTCHG_189136_289 | 41,047,460 | 31,462,132 |
After trimming and discarding reads, all samples have number of reads sufficient for the downstream analysis.
Aligning reads to the database of human microRNAs
This table contains numbers of reads mapped to miRNAs (merged for two
sequenced files). Column RNA
contains the information about the
input material (number of ng).
Sample type | File name1 | File name2 | Number of mapped reads | RNA, ng |
---|---|---|---|---|
HC1-nonTh17 | WTCHG_189135_274 | WTCHG_189136_274 | 42,748 | 90.83 |
HC2-nonTh17 | WTCHG_189135_276 | WTCHG_189136_276 | 995,035 | 100.00 |
HC3-nonTh17 | WTCHG_189135_278 | WTCHG_189136_278 | 677,173 | 100.00 |
HC4-nonTh17 | WTCHG_189135_280 | WTCHG_189136_280 | 450,159 | 100.00 |
HC1-Th17 | WTCHG_189135_275 | WTCHG_189136_275 | 86,821 | 85.03 |
HC2-Th17 | WTCHG_189135_277 | WTCHG_189136_277 | 955,892 | 69.76 |
HC3-Th17 | WTCHG_189135_279 | WTCHG_189136_279 | 926,892 | 74.42 |
HC4-Th17 | WTCHG_189135_281 | WTCHG_189136_281 | 153,514 | 49.08 |
AS1-nonTh17 | WTCHG_189135_282 | WTCHG_189136_282 | 1,188,284 | 100.00 |
AS2-nonTh17 | WTCHG_189135_284 | WTCHG_189136_284 | 298,505 | 36.46 |
AS3-nonTh17 | WTCHG_189135_286 | WTCHG_189136_286 | 665,606 | 46.10 |
AS4-nonTh17 | WTCHG_189135_288 | WTCHG_189136_288 | 1,305,340 | 77.02 |
AS1-Th17 | WTCHG_189135_283 | WTCHG_189136_283 | 574,407 | 43.61 |
AS2-Th17 | WTCHG_189135_285 | WTCHG_189136_285 | 10,110 | 33.27 |
AS4-Th17 | WTCHG_189135_289 | WTCHG_189136_289 | 706,156 | 37.87 |
Correlation between to sequencing runs is 0.99. Correlation between number of mapped reads and the amount of RNA input material is 0.39.
Here you can see how the aligner was chosen and all bioinformatic details including commands.
Checking alignment quality
Here you can see other alignment results -- alignment to miRNAs from other organisms than human, to human DNA, to all non-coding human RNAs.
Extracting reads mapped to miRNAs
This table was generated by the Sequencing Core and contains numbers of reads mapped to various human miRNAs. Note that this is not expression data.
Looking at the number of reads mapped to miRNAs, one sample has too low number of reads mapped to miRNAs and was excluded from further analysis -- sample 285 a.k.a. AS2-Th17.
Here is the file containing numbers of reads mapped to microRNAs, excluding sample 285.
Differential miRNA expression analysis
Differential expression was performed by the DESeq
package.
Here
you can find more bioinformatic details including commands (R
code).
Firstly, pairwise differential expression analysis between each pair of conditions was performed. Unfortunately this analysis produced only one statistically significant difference in miRNA expressions (see the description after the links to the files with differential expression analysis results). The absence of almost any differentially expressed miRNAs is most likely due to the low amount of identified miRNAs and mainly low number of reads mapped to miRNAs.
These two figures demonstrate how low number of identified miRNAs affect the analysis. The first row of the figures contains scatter plots with the dispersion estimation. The second row of figures contains log2 fold change of gene expression against the mean (average) of normalized counts (significant differential expression is highlighted in red).
Taejong's data | Public data |
---|---|
![]() |
![]() |
![]() |
![]() |
We have four groups, HC_Th17, HC_nonTh17, AS_Th17, AS_nonTh17. The results of pairwise differential miRNA expression analysis for these groups can be found here:
- HC_Th17 vs HC_nonTh17
- HC_Th17 vs AS_Th17
- HC_Th17 vs AS_nonTh17
- HC_nonTh17 vs AS_Th27
- HC_nonTh17 vs AS_nonTh27
- AS_Th17 vs AS_nonTh17
- HC vs AS (all HC vs all AS)
One statistically significant hit: in the comparison of HC Th17 vs AS Th17.
name_of_miRNA_ baseMean baseMeanA baseMeanB foldChange log2FoldChange pvalue padj
hsa-miR-10b-5p 29.11241 0.1963728 86.944483 442.752059 8.790355204839 1.5e-05 0.013
More information about the hsa-miR-10b-5p microRNA can be found
here.
Limited information about upregulation of miR-10b in cancers is available.
In this data, this microRNA is upregulated in AS
versus HC
. However,
the expression of this microRNA in HC
is probably too low to select it
as a candidate for experimental validation with qPCR (qPCR might not be
sensivite enough). Raw numbers of reads mapped to this microRNA:
mirna HC1-Th17 HC2-Th17 HC3-Th17 HC4-Th17 AS1-Th17 AS4-nonTh17
hsa-miR-10b-5p 0 0 1 0 97 22
Secondly, we performed ANOVA test across all four groups
of samples. It is done with an R
package limma
, and the
details (including commands) can be found
here.
When running limma for only reasonably expressed microRNAs (at least 750 reads mapped to a microRNA in all samples in total), all microRNAs were identified as significantly differentially. expressed Even when run for 1500 microRNAs, (at least 1 read mapped to each microRNA in at least one of the samples), limma gave significant pvalues. For absolutely each microRNA.
As any other statistical test, ANOVA works best when the assumptions -- in case of ANOVA, normal distribution of the data and comparable standard deviation -- are true for the analyzed dataset. Unfortunately, for this dataset the number of expressed microRNAs and the levels of expression are very low, which makes the data non-randomly distributed (maybe rather a Poisson distribution?). Additionally, the level of noise increases when the signal (number of mapped reads) is so low.
Thirdly, we ignored pvalues and applied cutoffs on miRNA expression under each condition and the fold change to create a list of miRNA candidates for future experimental validation.
Filtering steps:
- remove lines containing
NA
-- this means that miRNA expression was zero under both conditions. - remove lines containing
Inf
-- this means that zero reads was mapped in one of the samples. Genes and miRNAs never get absolute zero expression (unless they are knocked out), so when zero reads map to an miRNA, this indicates that not enough miRNA was sequenced or this particular miRNA was not captured -- but not that it is 100% silenced. - log2foldChange should be either below -2 or above 2 -- such filter is common practice in any differential expression analysis (also for genes), as it is believed/assumed that qPCR won''t capture smaller differences in expression.
- average miRNA expression under one condition (
baseMeanA
) should be above 5 -- lower expression probably won't be accurately measured by qPCR. - average miRNA expression under the second condition (
baseMeanB
) should be above 5.
Commands used for filtering and creating candidate lists can be found here.
This figure illustrates which fold changes we select. All log2 of fold changes are shown in black, the selected ones (above 2 or below -2) are shown in red.
This table contains number of miRNAs remaining after each filtering step in each pairwise comparison.
Filtering steps | HC Th17 vs HC nonTh17* | AS Th17 vs AS nonTh17 | HC Th17 vs AS Th17 | HC nonTh17 vs AS nonTh17 | HC nonTh17 vs AS Th17 | HC Th17 vs AS nonTh17 |
---|---|---|---|---|---|---|
initially | 2576 | 2576 | 2576 | 2576 | 2576 | 2576 |
no NA | 897 | 888 | 845 | 965 | 880 | 953 |
no Inf | 632 | 546 | 504 | 649 | 514 | 616 |
log2FoldChange | 70 | 63 | 70 | 66 | 90 | 93 |
baseMeanA | 4 | 12 | 27 | 13 | 28 | 15 |
baseMeanB | 2 | 1 | 11 | 5 | 13 | 8 |
*
used cutoff of 4 instead of 5 at the baseMeanA step (one
of the potential candidate miRNAs had expression of 4.87).
Lists of miRNA candidates for validation can be found here:
- HC_Th17 vs HC_nonTh17
- HC_Th17 vs AS_Th17
- HC_Th17 vs AS_nonTh17
- HC_nonTh17 vs AS_Th27
- HC_nonTh17 vs AS_nonTh27
- AS_Th17 vs AS_nonTh17
Previously, we asked for the relative miRNA expression under both conditions to be at least 5. Here, we ask for the expression at least just one condition to be above 5. This resulted in four extra miRNA candidates for the HC_Th17 vs AS_Th17 comparison; and two extra miRNA candidates for the HC_nonTh17 vs AS_Th17 comparison.
Lists of final miRNA candidates for validation can be found here:
- HC_Th17 vs HC_nonTh17
- HC_Th17 vs AS_Th17
- HC_Th17 vs AS_nonTh17
- HC_nonTh17 vs AS_Th27
- HC_nonTh17 vs AS_nonTh27
- AS_Th17 vs AS_nonTh17
Kruskal-Wallis test , a rank-based non-parametric test (one-way ANOVA on ranks) run to determine is there is a statistically significant difference between two or more groups, was run per miRNA. Unfortunately the test did not produce any p-value lower than 0.5.You can find the code here and the results here.
Here is the list of miRNAs and their targets (target gene names). Only targets with strong experimental evidence (shown with a reporter assay, Western blot or qPCR -- at least one of the three) are listed here. When the last column of the file is empty, that means that no targets with strong experimental evidence were ever published and only some predictions from some microarray or RNA-Seq data are available. Information about the targets was obtained from the miRNA targets database miRTarBase.
Four microRNAs -- miR-155-5p, miR-146a-5p, miR-210-3p and miR-21-5p -- were previously experimentally validated with qPCR. None of these microRNAs showed significant differences in expression across different conditions due to the low fold change. However, we decided to check expression levels of these miRNAs. Here we compared changes in relative expression of these microRNAs in qPCR vs RNA-Seq data (results of qPCR experiments were provided by Taejong Kim). Unfortunately the results of miRNA sequencing do not confirm the resuts of qPCR.
Different number of samples was used for qPCR and for RNA-Seq
miR-155-5p | HC Th17 | HC nonTh17 | AS Th17 | AS non Th17 |
---|---|---|---|---|
RNA-Seq | 34,084 | 28,681 | 19,189 | 11,178 |
qPCR | 7.1 | 1.5 | 48 | 1 |
miR-146a-5p | HC Th17 | HC nonTh17 | AS Th17 | AS non Th17 |
---|---|---|---|---|
RNA-Seq | 7,641 | 4,664 | 17,742 | 14,838 |
qPCR | 8.8 | 1.5 | 4.2 | 1 |
miR-210-3p | HC Th17 | HC nonTh17 | AS Th17 | AS non Th17 |
---|---|---|---|---|
RNA-Seq | 33 | 43 | 40 | 35 |
qPCR | 11.2 | 2 | 25.5 | 2.1 |
miR-21-5p | HC Th17 | HC nonTh17 | AS Th17 | AS non Th17 |
---|---|---|---|---|
RNA-Seq | 286,653 | 175,452 | 239,449 | 211,160 |
qPCR | 9.5 | 2.4 | 1.4 | 0.7 |
Same samples were used for qPCR and for RNA-Seq
miR-155-5p | HC Th17 | HC nonTh17 | AS Th17 | AS non Th17 |
---|---|---|---|---|
RNA-Seq | 34,084 | 28,681 | 19,189 | 11,178 |
qPCR | 10.97 | 1.45 | 3.49 | 1.13 |
miR-146a-5p | HC Th17 | HC nonTh17 | AS Th17 | AS non Th17 |
---|---|---|---|---|
RNA-Seq | 7,641 | 4,664 | 17,742 | 14,838 |
qPCR | 18.26 | 0.74 | 3.28 | 1.78 |
miR-210-3p | HC Th17 | HC nonTh17 | AS Th17 | AS non Th17 |
---|---|---|---|---|
RNA-Seq | 33 | 43 | 40 | 35 |
qPCR | 11.43 | 2.51 | 4.64 | 2.95 |
miR-21-5p | HC Th17 | HC nonTh17 | AS Th17 | AS non Th17 |
---|---|---|---|---|
RNA-Seq | 286,653 | 175,452 | 239,449 | 211,160 |
qPCR | 36.78 | 2.13 | 2.57 | 1.02 |
miR-10b-5p | HC Th17 | HC nonTh17 | AS Th17 | AS non Th17 |
---|---|---|---|---|
RNA-Seq | 0.25 | 2 | 111.5 | 34.75 |
qPCR | 11.42 | 2.47 | 25.7 | 5.79 |