Build bioinformatics pipelines to improve efficiency and reduce time with Docker and Nextflow in AWS
Sam (Cheng-Hsiang) Lu
Email: [email protected]
Dr. Venkata
Email: [email protected]
B-cell acute lymphoblastic leukemia (B-ALL) is a type of cancer that affects the white blood cells. It is a type of cancer that starts in the bone marrow, where blood cells are made. In people with B-ALL, the bone marrow makes too many immature B-cells, which are a type of white blood cell. These cells are not able to function properly, and they can build up in the blood and bone marrow, crowding out healthy blood cells. B-ALL can be treated with chemotherapy and other medications, but it can be a serious and life-threatening condition. There are several subtypes of B-ALL, which are distinguished based on the specific genetic changes that are present in the cancer cells.
This project use ALLSorts and MiXCR to analyze B-Cell Acute Lymphoblastic Leukemia patients data on the AWS ec2 instance. However, each package has it own additional packages and depedencies to download. Plus, I have to be aware of each package's version might not be compatible to others as well. Another issue is the time. When you have multiple samples, it would take you a lot of time compare to process only one sample. Therefore, we come up with a plan to combine our packages with Docker and Nextflow. By doing so, we can not only solve the compatibility issue but also save time by processing multiple samples at the same time.
ALLSorts is a package that can classify B-Cell Acute Lymphoblastic Leukemia (B-ALL) subtype. In my project, I use RNA-seq data to classify B-ALL 18 know subtypes and 5 meta-subtypes. Here is the link to ALLSorts github page: ALLSorts
MiXCR is a package that can analyze raw T or B cell receptor repertoire sequencing data. Here is the link to MiXCR github page: MiXCR
With Docker, people are allowed using packages without downloading any packages and their dependencies. Plus, we can automatically run different types of codes with Nextflow by organizing each section's input and output. Therefore, by combining packages with Docker and Nextflow, people can easily generate their results with one line of code and save their time as well.
wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.11.0-Linux-x86_64.sh
You can install Git by following these steps:
https://cloudaffaire.com/how-to-install-git-in-aws-ec2-instance/
You can find the original installation steps in this link: ALLSorts installation
- Create a folder in your terminal and use
git clone https://github.com/Oshlack/ALLSorts.git
to install and execute ALLSorts. - Find the ALLSorts where you installed and then execute
conda env create -f env/allsorts.yml
. It will create the "allsorts" environment. - You can either activate "allsorts" environment with
source activate allsorts
orconda activate allsorts
. - Then, install ALLSorts with
pip install .
(Notice that you have to include the "." in this code). - After multiple try and errors, you have to remove your numba by
conda uninstall numba --force
and change numba version withconda install numba=0.52.0
in order to solve further errors. - Before you run a test, create a folder where you store all your results. For example,
mkdir ../test_results
from the ALLSorts root. - You can now run a test with
python ALLSorts -samples tests/counts/test_counts.csv -destination ../test_results
.
Install docker on AWS ec2 instance:
sudo yum update -y
sudo amazon-linux-extras install docker
sudo service docker start
sudo systemctl enable docker
sudo usermod -a -G docker ec2-user
docker -v
- After the docker installation, now I have to create a Dockerfile for ALLSorts. You can Click here to check my ALLSorts Dockerfile.
- Create the ALLSorts container and push it to your DockerHub. If you don't have a DockerHub account yet, you can have one right here: DockerHub. (Or use mine that is already built: My DockerHub)
# Create the ALLSorts container
docker build -t allsort_dockerfile_082322 .
docker image ls
docker run -it allsort_dockerfile_082322:latest bash
# Push and pull a container
docker images
docker login ## type in your user name and password
docker tag allsorts_dockerfile_083022:latest chenghsianglu/allsorts_dockerfile_083022
docker push chenghsianglu/allsorts_dockerfile_083022
docker pull chenghsianglu/allsorts_dockerfile_083022:latest
docker rmi chenghsianglu/allsorts_dockerfile_083022 ## if your want to remove it
Install nextflow:
curl -s https://get.nextflow.io | bash
chmod +x nextflow
./nextflow
vi .bashrc
export PATH="/home/ec2-user:$PATH"
source ~/.bashrc
- Once you install nextflow, you can start writing your ALLSorts nextflow script called
main.nf
. This is my script Click here. - Download
Rscript.R
,gtf.txt
, anddf.txt
in the same location withmain.nf
. Find them with this link: Click here - Also download
gencode.v38.annotation.gtf
as well. Download - Use
mkdir Files
to make a folder which can run your files. - The detail in
main.nf
, the first processrun_R
inputs all dragon files in theFiles
folder which end withquant.genes.sf
. With myRscript.R
, it will covert dragon files into gene expression counts filecounts.csv
. - The second process
run_Allsorts
inputs eachcounts.csv
, activate the allsorts environment, run ALLSorts, and store all B-ALL subtype predictions in theResults
folder.
After you install Docker and Nextflow and put all your quant.genes.sf
files in the Files
folder where is as the same location as main.nf
, Rscript.R
, gtf.txt
, and df.txt
, you can just run ALLSorts pipelines with one single line of code as below.
nextflow run main.nf -with-docker chenghsianglu/allsorts_dockerfile_083022
You can find the original installation steps in this link: MiXCR installation
- This time, I use Homebrew to install MiXCR package:
brew install milaboratory/all/mixcr
- Upgrade your MiXCR to the latest version:
brew upgrade mixcr
Install docker (you can pass this step if you have already installed it):
sudo yum update -y
sudo amazon-linux-extras install docker
sudo service docker start
sudo systemctl enable docker
sudo usermod -a -G docker ec2-user
docker -v
- Create a Dockerfile for MiXCR. Click here
- Create the MiXCR container and push it to your DockerHub. ( Or use mine that is already built: my Dockerhub
# Create the MiXCR container
docker build -t mixcr_dockerfile_082322 .
docker image ls
docker run -it mixcr_dockerfile_082322:latest bash
# Push and pull a container
docker images
docker login ## type in your user name and password
docker tag mixcr_dockerfile_082322:latest chenghsianglu/mixcr_dockerfile_082322
docker push chenghsianglu/mixcr_dockerfile_082322
docker pull chenghsianglu/mixcr_dockerfile_082322:latest
docker rmi chenghsianglu/mixcr_dockerfile_082322 ## if your want to remove it
Install nextflow (you can pass this step if you have already installed it):
curl -s https://get.nextflow.io | bash
chmod +x nextflow
./nextflow
vi .bashrc
export PATH="/home/ec2-user:$PATH"
source ~/.bashrc
- Start writing your MiXCR nextflow script
main.nf
. This is my script Click here. - Use
mkdir Files
to create a folder which can run your files. In my case, it will be pair-ended fastq files. - In the first process of my script
run_mixcr_align
, it inputs one pair-ended fastq files in theFiles
folder which end withfastq.gz
. This step aligns raw sequencing data against V-, D-, J- and C- gene segment references library database for specified species and generatealignments.vdjca
as its output. - In the second process
run_mixcr_assemblePartial_1
, this step overlaps alignments coming from the same molecule which partially cover CDR3 regions. - In the third process performs the second process again because the author strongly recommands that sometimes the efficiency is increased if you perform two consecutive rounds of assembplePartial. Therefore, I process
run_mixcr_assemblePartial_2
once again. - In the forth process
run_mixcr_extend
, this process is typically used as a part of non-targeted RNA-Seq analysis pipeline for T-cells, to recover some of useful TCRs. The command takes alignments (.vdjca) file as input and generateclones.clns
as output. - Last, the process
run_mixcr_export
export clonotypes or raw alignments in a tabular form. I export three different outputs:clones.txt
,clones.TRB.txt
, andclones.IGH.txt
.
You can also find full details at MiLaboratories.
After you install Docker and Nextflow and put all your pair-ended fastq files in the Files
folder where is as the same location as main.nf
, you can just run MiXCR pipelines with one single line of code as below. (If you have only 8 cores in your AWS ec2-instance, it is suggested to run 4 pair-ended fastq files at a time.)
nextflow run main.nf -with-docker chenghsianglu/mixcr_dockerfile_082322
- Numbat pipelines
[1] Arber, D. A., Orazi, A., Hasserjian, R., Thiele, J., Borowitz, M. J., Le Beau, M. M., … Vardiman, J. W. (2016). The 2016 revision to the World Health Organization classification of myeloid neoplasms and acute leukemia. Blood, 127(20), 2391–2405.
[2] Gu, Z., Churchman, M. L., Roberts, K. G., Moore, I., Zhou, X., Nakitandwe, J., … Mullighan, C. G. (2019). PAX5-driven subtypes of B-progenitor acute lymphoblastic leukemia. Nature Genetics, 51(2), 296–307.
[3] Dmitriy A. Bolotin, Stanislav Poslavsky, Igor Mitrophanov, Mikhail Shugay, Ilgar Z. Mamedov, Ekaterina V. Putintseva, and Dmitriy M. Chudakov. "MiXCR: software for comprehensive adaptive immunity profiling." Nature methods 12, no. 5 (2015): 380-381.
[4] Dmitriy A. Bolotin, Stanislav Poslavsky, Alexey N. Davydov, Felix E. Frenkel, Lorenzo Fanchi, Olga I. Zolotareva, Saskia Hemmers, Ekaterina V. Putintseva, Anna S. Obraztsova, Mikhail Shugay, Ravshan I. Ataullakhanov, Alexander Y. Rudensky, Ton N. Schumacher & Dmitriy M. Chudakov. "Antigen receptor repertoire profiling from RNA-seq data." Nature Biotechnology 35, 908–911 (2017)