Skip to content

Commit

Permalink
Merge pull request #58 from SPAAM-community/post-session-cleanup
Browse files Browse the repository at this point in the history
Finish reviewing denovo asembly
  • Loading branch information
jfy133 authored Oct 4, 2023
2 parents fd4e4f4 + da654d8 commit d10670d
Show file tree
Hide file tree
Showing 11 changed files with 82 additions and 37 deletions.
2 changes: 1 addition & 1 deletion accessing-ancientmetagenomic-data.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -443,7 +443,7 @@ ENTER!
:::

```bash
rm -r /<PATH>/<TO>/accessing-metagenomic-data
rm -r /<PATH>/<TO>/accessing-metagenomic-data*
```

Once deleted you can move elsewhere (e.g. `cd ~`).
Expand Down
17 changes: 17 additions & 0 deletions ancient-metagenomic-pipelines.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,25 @@ For this chapter's exercises, if not already performed, you will need to create
```bash
conda activate ancient-metagenomic-pipelines
```

To download the data for this chapter, please download following archive with, extract the tar, and change into the directory.

For example

```bash
wget -O ancient-metagenomic-pipelines.tar.gz https://www.dropbox.com/scl/fi/jyz9n3h8yt8jeovviizf4/phylogenomics.tar.gz?rlkey=1aowh0grdvphht2h8h6qox0fa&dl=0 -P /<PATH>/<TO>/<SOMEWHERE_TO_STORE>/
tar -xzf ancient-metagenomic-pipelines.tar.gz
cd ancient-metagenomic-pipelines
```
:::

::: {.callout-warning}
There are additional software requirements for this chapter

Please see the relevant chapter section in [Before you start](/before-you-start.qmd) before continuing with this chapter.
:::


## Introduction

A **pipeline** is a series of linked computational steps, where the output of one process becomes the input of the next. Pipelines are critical for managing the huge quantities of data that are now being generated regularly as part of ancient DNA analyses. In this chapter we will go through three dedicated ancient DNA pipelines - all with some (or all!) functionality geared to ancient metagenomics - to show you how you can speed up the more routine aspects of the basic analyses we've learnt about earlier in this text book through workflow automation.
Expand Down
29 changes: 15 additions & 14 deletions assets/envs/denovo-assembly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,18 @@ channels:
- bioconda
- defaults
dependencies:
- fastp
- megahit
- bowtie2
- samtools
- bioawk
- diamond
- metabat2
- maxbin2
- concoct
- checkm-genome
- gunc
- pydamage
- prokka
- k8
- bioconda::fastp
- bioconda::megahit
- bioconda::bowtie2
- bioconda::samtools
- bioconda::bioawk
- bioconda::diamond
- bioconda::metabat2
- bioconda::maxbin2
- bioconda::concoct
- bioconda::checkm-genome
- bioconda::gunc
- bioconda::pydamage
- bioconda::prokka
- conda-forge::k8
- conda-forge::visidata
2 changes: 1 addition & 1 deletion authentication-decontamination.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -796,7 +796,7 @@ ENTER!
:::

```bash
rm -r /<PATH>/<TO>/authentication-decontamination
rm -rf /<PATH>/<TO>/authentication-decontamination*
```

Once deleted you can move elsewhere (e.g. `cd ~`).
Expand Down
2 changes: 1 addition & 1 deletion bare-bones-bash.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -720,7 +720,7 @@ ENTER!

```bash
cd ~ ## We shouldn't delete a directory while we are still in it. (It is possible though).
rm -r /<PATH>/<TO>/BareBonesBash
rm -r /<PATH>/<TO>/BareBonesBash*
```

We can also get out of the `conda` environment with
Expand Down
57 changes: 42 additions & 15 deletions denovo-assembly.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -400,15 +400,18 @@ During its bin refinement, metaWRAP first combines the results of all binning to

Using this approach, the authors of metaWRAP could show that they can outperform the individual binning tools and other bin refinement algorithms both regarding the **completeness** and **contamination** that was estimated for the MAGs (@fig-denovoassembly-metawrap).

Due to the large memory requirements of the the bin refinment module, we have prepared the results of metaWRAP's bin refinement algorithm, `metawrap_50_10_bins.stats` for you already, which can be found in the folder `/<PATH>/<TO>/denovo-assembly/seqdata_files`.

::: {.callout-caution}
Running metaWRAP's bin refinement module requires about 72 GB of memory because it has to load a larger reference database containing the lineage-specific marker genes of checkM.

If your computer infrastructure cannot provide so much memory, I have prepared the results of metaWRAP's bin refinement algorithm, `metawrap_50_10_bins.stats`, which can be found in the folder `/<PATH>/<TO>/denovo-assembly/seqdata_files`.
:::
If you have sufficient computational memory resources, you can run the following steps to run the bin refinement yourself.

::: {.callout-warning title="Example commands - do not run!" collapse="true"}
To apply metaWRAP's bin refinement to the bins that we obtained from metaBAT2 and MaxBin2, we first need to install the software checkM [@Parks2015] that will provide the lineage-specific marker gene catalogue:

```{bash, eval = F}
## Only run if you have at least 72 GB of memory!
mkdir checkM
wget -O checkM/checkm_data_2015_01_16.tar.gz \
https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
Expand All @@ -417,11 +420,11 @@ tar xvf checkM/checkm_data_2015_01_16.tar.gz -C checkM
echo checkM | checkm data setRoot checkM
```

<!-- up to here -->

Afterwards, we can execute metaWRAP's bin refinement module:

```{bash, eval = F}
## Only run if you have at least 72 GB of memory!
mkdir -p metawrap/BIN_REFINEMENT/2612
metawrap bin_refinement -o metawrap/BIN_REFINEMENT/2612 \
-t 14 \
Expand All @@ -436,8 +439,11 @@ Once finished you can deactivate the metawrap conda environment.
```bash
conda deactivate
```
:::

The latter step will produce a summary file, `metawrap_50_10_bins.stats`, that lists all retained bins and some key characteristics, such as the genome size, the completeness estimate, and the contamination estimate. The latter two can be used to assign a quality score according to the Minimum Information for MAG (MIMAG; see info box).
:::


::: {.callout-note title="The Minimum Information for MAG (MIMAG)"}

Expand All @@ -458,7 +464,7 @@ As these two steps will run rather long and need a large amount of memory and di

**How many bins were retained after the refinement with metaWRAP? How many high-quality and medium-quality MAGs did the refinement yield following the MIMAG criteria?**

Hint: You can more easily visualise tables on the terminal using the Python program `visidata`. After installing it with `pip install visidata`, you can open a table using `vd -f tsv /<PATH>/<TO>/denovo-assembly/metawrap_50_10_bins.stats`. Next to separating the columns nicely, it allows you to perform a lot of operations like sorting conveniently. Check the cheat sheet [here](https://jsvine.github.io/visidata-cheat-sheet/en/).
Hint: You can more easily visualise tables on the terminal using the Python program `visidata`. You can open a table using `vd -f tsv /<PATH>/<TO>/denovo-assembly/metawrap_50_10_bins.stats`. (press <kdb>ctrl</kbd>+<kbd>q</kbd> to exit). Next to separating the columns nicely, it allows you to perform a lot of operations like sorting conveniently. Check the cheat sheet [here](https://jsvine.github.io/visidata-cheat-sheet/en/).

:::

Expand Down Expand Up @@ -495,14 +501,17 @@ While all of these tools can do the job, I typically prefer to use the program M
::: {.callout-caution}
The latest version of the pre-computed GTDB reference database (r207) requires about 105 GB of harddisk space and 700 GB of memory for running.

As our computer infrastructure does not provide so much memory for every user, I pre-computed the results. You can find the results `2612.mmseqs2_gtdb.tsv` in the folder `/<PATH>/<TO>/denovo-assembly/seqdata_files`.
Therefore for those on smaller computational infrastructure, I have pre-computed the results. You can find the results `2612.mmseqs2_gtdb.tsv` in the folder `/<PATH>/<TO>/denovo-assembly/seqdata_files`.

An alternative for users with a less powerful infrastructure is the program [KrakenUniq](https://github.com/fbreitwieser/krakenuniq).
:::
An alternative for users with a less powerful infrastructure is the program [KrakenUniq](https://github.com/fbreitwieser/krakenuniq), however if your infrastructure has sufficient computational resources, or you want to see how you would run `mmseqs2` see the collapsed block.


::: {.callout-warning title="Example commands - do not run!" collapse="true"}

Before running MMSeqs2's _taxonomy_ workflow against the GTDB reference database, we need to install it.

```{bash, eval = F}
## Only run if you have at least 105 GB of disk space!
mkdir -p refdbs/mmseqs2/gtdb
mmseqs databases GTDB \
refdbs/mmseqs2/gtdb /tmp --threads 14
Expand All @@ -511,6 +520,7 @@ mmseqs databases GTDB \
Subsequently, we can align all the contigs of the sample 2612 against the GTDB r207 with MMSeqs2:

```{bash, eval = F}
## Only run if you have at least 700 GB of memory!
mkdir mmseqs2
mmseqs createdb alignment/2612.fa mmseqs2/2612.contigs
mmseqs taxonomy mmseqs2/2612.contigs \
Expand All @@ -525,12 +535,16 @@ mmseqs taxonomy mmseqs2/2612.contigs \
mmseqs createtsv mmseqs2/2612.contigs \
mmseqs2/2612.mmseqs2_gtdb \
```
:::
:::

Lets inspect the file `2612.mmseqs2_gtdb.tsv` in the folder `/<PATH>/<TO>/denovo-assembly/seqdata_files`.

::: {.callout-tip title="Question" appearence="simple"}

**What is the proportion of contigs that could be assigned to the different taxonomic ranks, such as species or genus? What are the dominant taxa?**

Hint: You can access this information easily by opening the file using visidata: `vd 2612.mmseqs2_gtdb.tsv`
Hint: You can access this information easily by opening the file using visidata: `vd seqdata/2612.mmseqs2_gtdb.tsv` (press <kdb>ctrl</kbd>+<kbd>q</kbd> to exit)

:::

Expand Down Expand Up @@ -564,12 +578,16 @@ To infer which taxa our five reconstructed MAGs represent, we can run the GTDBTK
::: {.callout-caution}
The latest version of the database used by GTDBTK (r207) requires about 70 GB of harddisk space and 80 GB of memory for running.

As our computer infrastructure does not provide so much memory for every user, I pre-computed the results. You can find the results `2612.gtdbtk_archaea.tsv` and `2612.gtdbtk_bacteria.tsv` in the folder `/<PATH>/<TO>/denovo-assembly/seqdata_files`.
:::
Therefore for those on smaller computational infrastructure, I have pre-computed the results. You can find the results `2612.gtdbtk_archaea.tsv` and `2612.gtdbtk_bacteria.tsv` in the folder `/<PATH>/<TO>/denovo-assembly/seqdata_files`.

If you want to see how you would run the `gtdbtk classify_wf` see the collapsed block.

::: {.callout-warning title="Example commands - do not run!" collapse="true"}

First, we need to install the GTDB database:

```{bash, eval = F}
## Only run if you have at least 70 GB disk space!
mkdir -p refdbs/gtdbtk
wget -O refdbs/gtdbtk/gtdbtk_v2_data.tar.gz \
https://data.gtdb.ecogenomic.org/releases/latest/auxillary_files/gtdbtk_v2_data.tar.gz
Expand All @@ -579,18 +597,23 @@ tar xvf refdbs/gtdbtk/gtdbtk_v2_data.tar.gz -C refdbs/gtdbtk
Afterwards, we can run GTDBTK's _classify_ workflow:

```{bash, eval = F}
## Only run if you have at least 80 GB of memory!
mkdir gtdbtk
GTDBTK_DATA_PATH="$PWD/refdbs/gtdbtk/gtdbtk_r207_v2" \
gtdbtk classify_wf --cpu 14 --extension fa \
--genome_dir metawrap/BIN_REFINEMENT/2612/metawrap_50_10_bins \
--out_dir gtdbtk
```
:::
:::

Lets inspect files `2612.gtdbtk_archaea.tsv` and `2612.gtdbtk_bacteria.tsv` in the folder `/<PATH>/<TO>/denovo-assembly/seqdata_files`.

::: {.callout-tip title="Question" appearence="simple"}

**Do the classifications obtained by GTDBTK match the classifications that were assigned to the contigs using MMSeqs2? Would you expect these taxa given the archaeological context of the samples?**

Hint: You can access the classification results of GTDBTK easily by opening the file using visidata: `vd 2612.gtdbtk_archaea.tsv` and `vd 2612.gtdbtk_bacteria.tsv`
Hint: You can access the classification results of GTDBTK easily by opening the file using visidata: `vd 2612.gtdbtk_archaea.tsv` and `vd 2612.gtdbtk_bacteria.tsv` (press <kdb>ctrl</kbd>+<kbd>q</kbd> to exit)

:::

Expand Down Expand Up @@ -633,7 +656,7 @@ contigs! How many contigs does pyDamage consider to show evidence for ancient DN
power (prediction accuracy) does it have for this decision? Which MAGs are strongly "ancient" or
"modern"?**

Hint: You can access pyDamage's results easily by opening the file using visidata: `vd pydamage_results/pydamage_results.tsv`
Hint: You can access pyDamage's results easily by opening the file using visidata: `vd pydamage_results/pydamage_results.csv` (press <kbd>ctrl</kbd> + <kbd>q</kbd> to exit)

:::

Expand All @@ -658,15 +681,19 @@ One of the most commonly used annotation tools for MAGs is Prokka [@Seemann2014]

Next to returning information on the protein-coding genes, Prokka also returns the annotation of RNA genes (tRNAs and rRNAs), which will help us to evaluate the quality of MAGs regarding the MIMAG criteria.

For this practical course, we will use Prokka and we will focus on annotating the MAG `bin.3.fa` that we reconstructed from the sample 2612.
For this practical course, we will use Prokka and we will focus on annotating the pre-refined MAG bin `bin.3.fa` that we reconstructed from the sample 2612.

```{bash, eval = F}
prokka --outdir prokka \
--prefix 2612_003 \
--compliant --metagenome --cpus 14 \
metawrap_50_10_bins/bin.3.fa
seqdata_files/metawrap_50_10_bins/bin.3.fa
```

:::{.callout-note}
If you performed your own metaWRAP bin refineent, you would find the `bin.3.fa` in the `metawrap/BIN_REFINEMENT` directory.
:::

::: {.callout-tip title="Question" appearence="simple"}

**Prokka has annotated our MAG. What type of files does Prokka return? How many genes/tRNAs/rRNAs were detected?**
Expand Down Expand Up @@ -705,7 +732,7 @@ ENTER!
:::

```bash
rm -r /<PATH>/<TO>/denovo-assembly
rm -rf /<PATH>/<TO>/denovo-assembly*
```

Once deleted you can move elsewhere (e.g. `cd ~`).
Expand Down
2 changes: 1 addition & 1 deletion genome-mapping.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -435,7 +435,7 @@ ENTER!
:::

```bash
rm -r /<PATH>/<TO>/genome-mapping
rm -r /<PATH>/<TO>/genome-mapping*
```

Once deleted you can move elsewhere (e.g. `cd ~`).
Expand Down
2 changes: 1 addition & 1 deletion phylogenomics.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -472,7 +472,7 @@ ENTER!
:::

```bash
rm -r /<PATH>/<TO>/phylogenomics
rm -r /<PATH>/<TO>/phylogenomics*
```

Once deleted you can move elsewhere (e.g. `cd ~`).
Expand Down
2 changes: 1 addition & 1 deletion python-pandas.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -925,7 +925,7 @@ ENTER!
:::

```bash
rm -r /<PATH>/<TO>/python-pandas
rm -r /<PATH>/<TO>/python-pandas*
```

Once deleted you can move elsewhere (e.g. `cd ~`).
Expand Down
2 changes: 1 addition & 1 deletion r-tidyverse.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -968,7 +968,7 @@ ENTER!
:::

```bash
rm -r /<PATH>/<TO>/r-tidyverse
rm -r /<PATH>/<TO>/r-tidyverse*
```

Once deleted you can move elsewhere (e.g. `cd ~`).
Expand Down
2 changes: 1 addition & 1 deletion taxonomic-profiling.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -1478,7 +1478,7 @@ ENTER!
:::
```bash
rm -r /<PATH>/<TO>/taxonomic-profiling
rm -r /<PATH>/<TO>/taxonomic-profiling*
```
Once deleted you can move elsewhere (e.g. `cd ~`).
Expand Down

0 comments on commit d10670d

Please sign in to comment.