Improve cleanups, finish reviewing denovo asembly, and add visidata t…

…o denovo assembly conda env
SPAAM-community · Oct 4, 2023 · da654d8 · da654d8
1 parent 7867a78
commit da654d8
Show file tree

Hide file tree

Showing 11 changed files with 82 additions and 37 deletions.
diff --git a/accessing-ancientmetagenomic-data.qmd b/accessing-ancientmetagenomic-data.qmd
@@ -443,7 +443,7 @@ ENTER!
 :::
 
 ```bash
-rm -r /<PATH>/<TO>/accessing-metagenomic-data
+rm -r /<PATH>/<TO>/accessing-metagenomic-data*
 ```
 
 Once deleted you can move elsewhere (e.g. `cd ~`).

diff --git a/ancient-metagenomic-pipelines.qmd b/ancient-metagenomic-pipelines.qmd
@@ -10,8 +10,25 @@ For this chapter's exercises, if not already performed, you will need to create
 ```bash
 conda activate ancient-metagenomic-pipelines
 ```
+
+To download the data for this chapter, please download following archive with, extract the tar, and change into the directory.
+
+For example
+
+```bash
+wget -O ancient-metagenomic-pipelines.tar.gz https://www.dropbox.com/scl/fi/jyz9n3h8yt8jeovviizf4/phylogenomics.tar.gz?rlkey=1aowh0grdvphht2h8h6qox0fa&dl=0 -P /<PATH>/<TO>/<SOMEWHERE_TO_STORE>/ 
+tar -xzf ancient-metagenomic-pipelines.tar.gz
+cd ancient-metagenomic-pipelines
+```
 :::
 
+::: {.callout-warning}
+There are additional software requirements for this chapter
+
+Please see the relevant chapter section in [Before you start](/before-you-start.qmd) before continuing with this chapter.
+:::
+
+
 ## Introduction
 
 A **pipeline** is a series of linked computational steps, where the output of one process becomes the input of the next. Pipelines are critical for managing the huge quantities of data that are now being generated regularly as part of ancient DNA analyses. In this chapter we will go through three dedicated ancient DNA pipelines - all with some (or all!) functionality geared to ancient metagenomics - to show you how you can speed up the more routine aspects of the basic analyses we've learnt about earlier in this text book through workflow automation.  

diff --git a/assets/envs/denovo-assembly.yml b/assets/envs/denovo-assembly.yml
@@ -4,17 +4,18 @@ channels:
   - bioconda
   - defaults
 dependencies:
-  - fastp
-  - megahit
-  - bowtie2
-  - samtools
-  - bioawk
-  - diamond
-  - metabat2
-  - maxbin2
-  - concoct
-  - checkm-genome
-  - gunc
-  - pydamage
-  - prokka
-  - k8
+  - bioconda::fastp
+  - bioconda::megahit
+  - bioconda::bowtie2
+  - bioconda::samtools
+  - bioconda::bioawk
+  - bioconda::diamond
+  - bioconda::metabat2
+  - bioconda::maxbin2
+  - bioconda::concoct
+  - bioconda::checkm-genome
+  - bioconda::gunc
+  - bioconda::pydamage
+  - bioconda::prokka
+  - conda-forge::k8
+  - conda-forge::visidata
diff --git a/authentication-decontamination.qmd b/authentication-decontamination.qmd
@@ -796,7 +796,7 @@ ENTER!
 :::
 
 ```bash
-rm -r /<PATH>/<TO>/authentication-decontamination
+rm -rf /<PATH>/<TO>/authentication-decontamination*
 ```
 
 Once deleted you can move elsewhere (e.g. `cd ~`).

diff --git a/bare-bones-bash.qmd b/bare-bones-bash.qmd
@@ -720,7 +720,7 @@ ENTER!
 
 ```bash
 cd ~     ## We shouldn't delete a directory while we are still in it. (It is possible though).
-rm -r /<PATH>/<TO>/BareBonesBash
+rm -r /<PATH>/<TO>/BareBonesBash*
 ```
 
 We can also get out of the `conda` environment with

diff --git a/denovo-assembly.qmd b/denovo-assembly.qmd
@@ -400,15 +400,18 @@ During its bin refinement, metaWRAP first combines the results of all binning to
 
 Using this approach, the authors of metaWRAP could show that they can outperform the individual binning tools and other bin refinement algorithms both regarding the **completeness** and **contamination** that was estimated for the MAGs (@fig-denovoassembly-metawrap).
 
+Due to the large memory requirements of the the bin refinment module, we have prepared the results of metaWRAP's bin refinement algorithm, `metawrap_50_10_bins.stats` for you already, which can be found in the folder `/<PATH>/<TO>/denovo-assembly/seqdata_files`.
+
 ::: {.callout-caution}
 Running metaWRAP's bin refinement module requires about 72 GB of memory because it has to load a larger reference database containing the lineage-specific marker genes of checkM.
 
-If your computer infrastructure cannot provide so much memory, I have prepared the results of metaWRAP's bin refinement algorithm, `metawrap_50_10_bins.stats`, which can be found in the folder `/<PATH>/<TO>/denovo-assembly/seqdata_files`.
-:::
+If you have sufficient computational memory resources, you can run the following steps to run the bin refinement yourself.
 
+::: {.callout-warning title="Example commands - do not run!" collapse="true"}
 To apply metaWRAP's bin refinement to the bins that we obtained from metaBAT2 and MaxBin2, we first need to install the software checkM [@Parks2015] that will provide the lineage-specific marker gene catalogue:
 
 ```{bash, eval = F}
+## Only run if you have at least 72 GB of memory!
 mkdir checkM
 wget -O checkM/checkm_data_2015_01_16.tar.gz \
     https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
@@ -417,11 +420,11 @@ tar xvf checkM/checkm_data_2015_01_16.tar.gz -C checkM
 echo checkM | checkm data setRoot checkM
 ```
 
-<!-- up to here -->
 
 Afterwards, we can execute metaWRAP's bin refinement module:
 
 ```{bash, eval = F}
+## Only run if you have at least 72 GB of memory!
 mkdir -p metawrap/BIN_REFINEMENT/2612
 metawrap bin_refinement -o metawrap/BIN_REFINEMENT/2612 \
     -t 14 \
@@ -436,8 +439,11 @@ Once finished you can deactivate the metawrap conda environment.
 ```bash 
 conda deactivate
 ```
+:::
 
 The latter step will produce a summary file, `metawrap_50_10_bins.stats`, that lists all retained bins and some key characteristics, such as the genome size, the completeness estimate, and the contamination estimate. The latter two can be used to assign a quality score according to the Minimum Information for MAG (MIMAG; see info box).
+:::
+
 
 ::: {.callout-note title="The Minimum Information for MAG (MIMAG)"}
 
@@ -458,7 +464,7 @@ As these two steps will run rather long and need a large amount of memory and di
 
 **How many bins were retained after the refinement with metaWRAP? How many high-quality and medium-quality MAGs did the refinement yield following the MIMAG criteria?**
 
-Hint: You can more easily visualise tables on the terminal using the Python program `visidata`. After installing it with `pip install visidata`, you can open a table using `vd -f tsv /<PATH>/<TO>/denovo-assembly/metawrap_50_10_bins.stats`. Next to separating the columns nicely, it allows you to perform a lot of operations like sorting conveniently. Check the cheat sheet [here](https://jsvine.github.io/visidata-cheat-sheet/en/).
+Hint: You can more easily visualise tables on the terminal using the Python program `visidata`. You can open a table using `vd -f tsv /<PATH>/<TO>/denovo-assembly/metawrap_50_10_bins.stats`.  (press <kdb>ctrl</kbd>+<kbd>q</kbd> to exit). Next to separating the columns nicely, it allows you to perform a lot of operations like sorting conveniently. Check the cheat sheet [here](https://jsvine.github.io/visidata-cheat-sheet/en/).
 
 :::
 
@@ -495,14 +501,17 @@ While all of these tools can do the job, I typically prefer to use the program M
 ::: {.callout-caution}
 The latest version of the pre-computed GTDB reference database (r207) requires about 105 GB of harddisk space and 700 GB of memory for running.
 
-As our computer infrastructure does not provide so much memory for every user, I pre-computed the results. You can find the results `2612.mmseqs2_gtdb.tsv` in the folder `/<PATH>/<TO>/denovo-assembly/seqdata_files`.
+Therefore for those on smaller computational infrastructure, I have pre-computed the results. You can find the results `2612.mmseqs2_gtdb.tsv` in the folder `/<PATH>/<TO>/denovo-assembly/seqdata_files`.
 
-An alternative for users with a less powerful infrastructure is the program [KrakenUniq](https://github.com/fbreitwieser/krakenuniq).
-:::
+An alternative for users with a less powerful infrastructure is the program [KrakenUniq](https://github.com/fbreitwieser/krakenuniq), however if your infrastructure has sufficient computational resources, or you want to see how you would run `mmseqs2` see the collapsed block.
+
+
+::: {.callout-warning title="Example commands - do not run!" collapse="true"}
 
 Before running MMSeqs2's _taxonomy_ workflow against the GTDB reference database, we need to install it. 
 
 ```{bash, eval = F}
+## Only run if you have at least 105 GB of disk space!
 mkdir -p refdbs/mmseqs2/gtdb
 mmseqs databases GTDB \
     refdbs/mmseqs2/gtdb /tmp --threads 14
@@ -511,6 +520,7 @@ mmseqs databases GTDB \
 Subsequently, we can align all the contigs of the sample 2612 against the GTDB r207 with MMSeqs2:
 
 ```{bash, eval = F}
+## Only run if you have at least 700 GB of memory!
 mkdir mmseqs2
 mmseqs createdb alignment/2612.fa mmseqs2/2612.contigs
 mmseqs taxonomy  mmseqs2/2612.contigs \
@@ -525,12 +535,16 @@ mmseqs taxonomy  mmseqs2/2612.contigs \
 mmseqs createtsv mmseqs2/2612.contigs \
     mmseqs2/2612.mmseqs2_gtdb \
 ```
+:::
+:::
+
+Lets inspect the file `2612.mmseqs2_gtdb.tsv` in the folder `/<PATH>/<TO>/denovo-assembly/seqdata_files`.
 
 ::: {.callout-tip title="Question" appearence="simple"}
 
 **What is the proportion of contigs that could be assigned to the different taxonomic ranks, such as species or genus? What are the dominant taxa?**
 
-Hint: You can access this information easily by opening the file using visidata: `vd 2612.mmseqs2_gtdb.tsv` 
+Hint: You can access this information easily by opening the file using visidata: `vd seqdata/2612.mmseqs2_gtdb.tsv`  (press <kdb>ctrl</kbd>+<kbd>q</kbd> to exit)
 
 :::
 
@@ -564,12 +578,16 @@ To infer which taxa our five reconstructed MAGs represent, we can run the GTDBTK
 ::: {.callout-caution}
 The latest version of the database used by GTDBTK (r207) requires about 70 GB of harddisk space and 80 GB of memory for running.
 
-As our computer infrastructure does not provide so much memory for every user, I pre-computed the results. You can find the results `2612.gtdbtk_archaea.tsv` and `2612.gtdbtk_bacteria.tsv` in the folder `/<PATH>/<TO>/denovo-assembly/seqdata_files`.
-:::
+Therefore for those on smaller computational infrastructure, I have pre-computed the results. You can find the results `2612.gtdbtk_archaea.tsv` and `2612.gtdbtk_bacteria.tsv` in the folder `/<PATH>/<TO>/denovo-assembly/seqdata_files`.
+
+If you want to see how you would run the `gtdbtk classify_wf` see the collapsed block.
+
+::: {.callout-warning title="Example commands - do not run!" collapse="true"}
 
 First, we need to install the GTDB database:
 
 ```{bash, eval = F}
+## Only run if you have at least 70 GB disk space!
 mkdir -p refdbs/gtdbtk
 wget -O refdbs/gtdbtk/gtdbtk_v2_data.tar.gz \
     https://data.gtdb.ecogenomic.org/releases/latest/auxillary_files/gtdbtk_v2_data.tar.gz
@@ -579,18 +597,23 @@ tar xvf refdbs/gtdbtk/gtdbtk_v2_data.tar.gz -C refdbs/gtdbtk
 Afterwards, we can run GTDBTK's _classify_ workflow:
 
 ```{bash, eval = F}
+## Only run if you have at least 80 GB of memory!
 mkdir gtdbtk
 GTDBTK_DATA_PATH="$PWD/refdbs/gtdbtk/gtdbtk_r207_v2" \
 gtdbtk classify_wf --cpu 14 --extension fa \
     --genome_dir metawrap/BIN_REFINEMENT/2612/metawrap_50_10_bins \
     --out_dir gtdbtk
 ```
+:::
+:::
+
+Lets inspect files `2612.gtdbtk_archaea.tsv` and `2612.gtdbtk_bacteria.tsv` in the folder `/<PATH>/<TO>/denovo-assembly/seqdata_files`.
 
 ::: {.callout-tip title="Question" appearence="simple"}
 
 **Do the classifications obtained by GTDBTK match the classifications that were assigned to the contigs using MMSeqs2? Would you expect these taxa given the archaeological context of the samples?**
 
-Hint: You can access the classification results of GTDBTK easily by opening the file using visidata: `vd 2612.gtdbtk_archaea.tsv` and  `vd 2612.gtdbtk_bacteria.tsv`
+Hint: You can access the classification results of GTDBTK easily by opening the file using visidata: `vd 2612.gtdbtk_archaea.tsv` and `vd 2612.gtdbtk_bacteria.tsv`  (press <kdb>ctrl</kbd>+<kbd>q</kbd> to exit)
 
 :::
 
@@ -633,7 +656,7 @@ contigs! How many contigs does pyDamage consider to show evidence for ancient DN
 power (prediction accuracy) does it have for this decision? Which MAGs are strongly "ancient" or
 "modern"?**
 
-Hint: You can access pyDamage's results easily by opening the file using visidata: `vd pydamage_results/pydamage_results.tsv`
+Hint: You can access pyDamage's results easily by opening the file using visidata: `vd pydamage_results/pydamage_results.csv` (press <kbd>ctrl</kbd> + <kbd>q</kbd> to exit)
 
 :::
 
@@ -658,15 +681,19 @@ One of the most commonly used annotation tools for MAGs is Prokka [@Seemann2014]
 
 Next to returning information on the protein-coding genes, Prokka also returns the annotation of RNA genes (tRNAs and rRNAs), which will help us to evaluate the quality of MAGs regarding the MIMAG criteria.
 
-For this practical course, we will use Prokka and we will focus on annotating the MAG `bin.3.fa` that we reconstructed from the sample 2612. 
+For this practical course, we will use Prokka and we will focus on annotating the pre-refined MAG bin `bin.3.fa` that we reconstructed from the sample 2612. 
 
 ```{bash, eval = F}
 prokka --outdir prokka \
     --prefix 2612_003 \
     --compliant --metagenome --cpus 14 \
-    metawrap_50_10_bins/bin.3.fa
+    seqdata_files/metawrap_50_10_bins/bin.3.fa
 ```
 
+:::{.callout-note}
+If you performed your own metaWRAP bin refineent, you would find the `bin.3.fa` in the `metawrap/BIN_REFINEMENT` directory.
+:::
+
 ::: {.callout-tip title="Question" appearence="simple"}
 
 **Prokka has annotated our MAG. What type of files does Prokka return? How many genes/tRNAs/rRNAs were detected?**
@@ -705,7 +732,7 @@ ENTER!
 :::
 
 ```bash
-rm -r /<PATH>/<TO>/denovo-assembly
+rm -rf /<PATH>/<TO>/denovo-assembly*
 ```
 
 Once deleted you can move elsewhere (e.g. `cd ~`).

diff --git a/genome-mapping.qmd b/genome-mapping.qmd
@@ -435,7 +435,7 @@ ENTER!
 :::
 
 ```bash
-rm -r /<PATH>/<TO>/genome-mapping
+rm -r /<PATH>/<TO>/genome-mapping*
 ```
 
 Once deleted you can move elsewhere (e.g. `cd ~`).

diff --git a/phylogenomics.qmd b/phylogenomics.qmd
@@ -472,7 +472,7 @@ ENTER!
 :::
 
 ```bash
-rm -r /<PATH>/<TO>/phylogenomics
+rm -r /<PATH>/<TO>/phylogenomics*
 ```
 
 Once deleted you can move elsewhere (e.g. `cd ~`).

diff --git a/python-pandas.qmd b/python-pandas.qmd
@@ -925,7 +925,7 @@ ENTER!
 :::
 
 ```bash
-rm -r /<PATH>/<TO>/python-pandas
+rm -r /<PATH>/<TO>/python-pandas*
 ```
 
 Once deleted you can move elsewhere (e.g. `cd ~`).

diff --git a/r-tidyverse.qmd b/r-tidyverse.qmd
@@ -968,7 +968,7 @@ ENTER!
 :::
 
 ```bash
-rm -r /<PATH>/<TO>/r-tidyverse
+rm -r /<PATH>/<TO>/r-tidyverse*
 ```
 
 Once deleted you can move elsewhere (e.g. `cd ~`).

diff --git a/taxonomic-profiling.qmd b/taxonomic-profiling.qmd
@@ -1478,7 +1478,7 @@ ENTER!
 :::
 
 ```bash
-rm -r /<PATH>/<TO>/taxonomic-profiling
+rm -r /<PATH>/<TO>/taxonomic-profiling*
 ```
 
 Once deleted you can move elsewhere (e.g. `cd ~`).