Utility of RNA sequencing for the diagnosis and genomic classification of paediatric acute lymphoblastic leukaemia
Additional data and scripts for the paper.
Gene expression tables for the underlying 126 ALL RNA-seq samples are available in the counts
directory:
- counts_raw.txt - gene counts
Script to perform the IKZF1 deletion detection are described below:
- bedtools
- salmon > v0.12
Additional data:
- reference genome
- reference transcriptome
For the gene of interest, IKZF1, we define a BED12 format .bed
file with the exons to be retained for each of del2-8, del2-7, del4-7 and del4-8, and a modified sequence name to indicate the unique transcript modification (col 4) in a deletions.bed
file:
chr7 50304715 50405101 ENST00000331340_del27 0 + 50319061 50400627 0 2 207,5184, 0,95202,
chr7 50304715 50405101 ENST00000331340_del28 0 + 50319061 50400627 0 2 207,1, 0,208,
chr7 50304715 50405101 ENST00000331340_del47 0 + 50319061 50400627 0 4 207,54,120,5184, 0,14332,22922,95202,
chr7 50304715 50405101 ENST00000331340_del48 0 + 50319061 50400627 0 3 207,54,120, 0,14332,22922,
Next, create the transcripts with bedtools
:
bedtools getfasta -fi reference_genome.fasta -bed deletions.bed -split -name > deletions.fasta
Note: due to limitations with the getfasta
a description of one exon fail to produce a truncated sequence. One workaround is to define a second small exon and then manually delete it from the resulting FASTA, as we have done for del28.
Example .bed
and .fasta
files are provided in the bed
and fasta
directories respectivley for IKZF1.
These deletion transcripts are then appended to the reference transcriptome:
cat reference_transcriptome.fasta deletion.fasta > extended_transcriptome.fasta
Note: if there are specific existing transcripts that contain the missing exons, as was the case for del4-7 and ENST00000426121.1, then the reference transcriptome should be modified to remove them first.
Once created, the extended transcriptome can be indexed in the usual way. Using Salmon:
salmon index -t extended_transcriptome.fasta -i extended_transcriptome.idx/ -k 31
Samples can then be quantified using this extended custom transcriptome, for example with paired end reads:
salmon quant --dumpEq -i extended_transcriptome.idx/ -l A -1 r1.fastq.gz -2 r2.fastq.gz -o output_directory
The primary output file from Salmon, quant.sf
can now be analysed to extract the deletion abundances. Since all the deletion transcripts are named as an extension of a canonical IKZF1 transcript, then simply selecting all IKZF1 transcripts will capture the extended transcripts as well:
Using a file such as ikzf1_transcripts.txt
with each transcript name:
ENST00000331340
ENST00000343574
ENST00000346667
ENST00000349824
ENST00000357364
ENST00000359197
ENST00000413698
ENST00000426121
ENST00000438033
ENST00000439701
ENST00000440768
ENST00000462201
ENST00000471793
ENST00000484847
ENST00000492119
ENST00000492782
ENST00000612658
ENST00000615491
ENST00000641948
ENST00000642219
ENST00000645066
ENST00000646110
can then be used to get the quantifications for the gene of interest:
grep -f ikzf1_transcripts.txt output/quant.sf > ikzf1.quant.sf
These results can then be analysed for the relative rank of the deletion transcripts, e.g. the fourth column of TPM:
cat ikzf1.quant.sf | sort -nrk4
Empirically, when the del4-7 transcript contributes 5% or more of the total TPM for IKZF1 a biochemcial assay via qPCR confirmed the deletion.
An R script, filter_del47.R
takes two arguments (the threshold, and a complete quant.sf
) and will calculate if a sample exceeds this threshold.
R --vanilla --args --del47 0.05 --files ikzf1.quant.sf < filter_del47.R
Multiple files can be seperated by :
.
It will return two files: ikzf1_tpms.csv
which contains the reference and extended TPM values for each sample, and threshold_results.csv
which will indicate TRUE/FALSE for each sample if the del4-7 transcript exceeded the threshold.
A Bpipe (http://bpipe.org) pipeline for testing samples against a custom index is provided in ikzf1deletions.groovy
and can be invoked:
bpipe -n NUM_THREADS ikzf1deletions.groovy <fastq_files>
which will quantify the samples against the custom index with Salmon and run filer_del47.R
.
By defining a .bed
file with the expected deletion transcripts and collating all original and modified transcripts of a gene, the approaches used here can be modified for any gene type.
Extending the deletion methods described here to any gene of interest automatically is also in development as Toblerone (https://github.com/Oshlack/Toblerone/).