diff --git a/_freeze/denovo-assembly/execute-results/html.json b/_freeze/denovo-assembly/execute-results/html.json index f6a43351..b0cdd6da 100644 --- a/_freeze/denovo-assembly/execute-results/html.json +++ b/_freeze/denovo-assembly/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "26247914269cb2c3703a6206ca22171b", + "hash": "ae566a52313aa7a89a0a4650ac0091c5", "result": { - "markdown": "---\ntitle: Introduction to _de novo_ Genome Assembly\nauthor: Alexander Hübner\nbibliography: assets/references/chapters/denovo-assembly/references.bib\n---\n\n::: {.cell}\n\n:::\n\n\n\n::: {.callout-tip}\nFor this chapter's exercises, if not already performed, you will need to create the [conda environment](before-you-start.qmd#creating-a-conda-environment) from the `yml` file in the following [link](https://github.com/SPAAM-community/intro-to-ancient-metagenomics-book/raw/main/assets/envs/denovo-assembly.yml) (right click and save as to download), and once created, activate the environment with:\n\n```bash\nconda activate denovo-assembly\n```\n:::\n\n### Lecture\n\nLecture slides and video from the [2022 edition of the summer school](https://www.spaam-community.org/wss-summer-school/#/2022/README).\n\n\n\nPDF version of these slides can be downloaded from [here](https://github.com/SPAAM-community/wss-summer-school/raw/main/docs/assets/slides/2022/4c-intro-to-denovoassembly/SPAAM%20Summer%20School%202022%20-%204C%20-%20Genome%20Assembly.pdf).\n\n\n\n## Introduction\n\nFirst of all, what is a **metagenomic** sample? Metagenomic sample is a sample that consists of DNA from more than one source. The number and the type of sources might vary widely between different samples. Typical sources for ancient remains are e.g. the host organism and the microbial species. The important part is that we generally do not know the origin of a DNA molecule prior to analysing the sequencing data generated from the DNA library of this sample. In the example presented in (@fig-denovoassembly-overview), our metagenomic sample has DNA from three different sources, here coloured in blue, red, and yellow.\n\n\n![Overview of the ways how to analyse metagenomic sequencing data.](assets/images/chapters/denovo-assembly/metagenome_assembly_scheme.png){#fig-denovoassembly-overview}\n\nHow can we determine the sources of the DNA that we have in our metagenomic sample? There are three main options whose _pros_ and _cons_ are summarised in (@tbl-denovoassembly-proscons).\n\n\n::: {#tbl-denovoassembly-proscons .cell .tbl-cap-location-top tbl-cap='Pros and cons of the major methods for determining the sources of metagenomic DNA.'}\n\n------------------------------------------------------------------------------------------\n method pros cons \n--------------------------- ------------------------------ -------------------------------\n reference-based alignment highly sensitive, applicable requires all sources to be \n to aDNA samples represented in database \n\n single-genome assembly high-quality genomes from not available for ancient DNA \n cultivated bacteria samples \n\n metagenome assembly able to recover unknown highly dependent on \n diversity present in sample preservation of ancient DNA \n------------------------------------------------------------------------------------------\n:::\n\n\nUntil recently, the only option for ancient DNA samples was to take the short-read sequencing data and align them against some known reference genomes. However, this approach is heavily relying on whether all sources of our samples are represented in the available reference genomes. If a source is missing in the reference database - in our toy example, this is the case for the yellow source (@fig-denovoassembly-overview) -, then we won't be able to detect it using this reference database.\n\nWhile there is a potential workaround for modern metagenomic samples, _single-genome assembly_ relies on being able to cultivate a microbial species to obtain an isolate. This is unfeasible for ancient metagenomic samples because there are no more viable microbial cells available that could be cultivated.\n\nAround 2015, a technical revolution started when the first programs, e.g. MEGAHIT [@LiMegahit2015] and metaSPAdes [@Nurk2017], were published that could successfully perform *de novo* assembly from metagenomic data. Since then, tens of thousands metagenomic samples have been assembled and it was revealed that even well studied environments, such as the human gut microbiome, have a lot of additional microbial diversity that has not been observed previously via culturing and isolation [@Almeida2021].\n\nThe technical advancement of being able to perform *de novo* assembly on metagenomic samples led to an explosion of studies that analysed samples that were considered almost impossible to study beforehand. For researchers that are exposed to ancient DNA, the imminent question arises: can we apply the same methods to ancient DNA data? In this practical course, we will walk through all required steps that are necessary to successfully perform _de novo_ assembly from ancient DNA metagenomic sequencing data and show you what you can do once you have obtained the data.\n\n## Practical course\n\n### Sample overview\n\nFor this practical course, I selected a palaeofaeces sample from the study by @Maixner2021, who generated deep metagenomic sequencing data for four palaeofaeces samples that were excavated from an Austrian salt mine in Hallstatt and were associated with the Celtic Iron Age. We will focus on the youngest sample, **2612**, which was dated to be just a few hundred years old (@fig-denovoassembly-maixner).\n\n![The graphical abstract of Maixner et al. (2021).](assets/images/chapters/denovo-assembly/Maixner2021_GraphicalAbstract.jpg){#fig-denovoassembly-maixner}\n\nHowever, because the sample was very deeply sequenced for more than 250 million paired-end reads and we do not want to wait for days for the analysis to finish, we will not use all data but just a sub-sample of them.\n\nYou can access the sub-sampled data by changing to the folder \n\n```bash\nln -s /vol/volume/denovo_assembly/2612_R1.fastq.gz 2612_R1.fastq.gz\nln -s /vol/volume/denovo_assembly/2612_R2.fastq.gz 2612_R2.fastq.gz\n```\n\non the cluster. If you would like to repeat the practical on your own computational infrastructure, you can download the sequencing data from here:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nwget https://share.eva.mpg.de/index.php/s/CtLq2R9iqEcAFyg/download/2612_R1.fastq.gz\nwget https://share.eva.mpg.de/index.php/s/mc5JrpDWdL4rC24/download/2612_R2.fastq.gz\n```\n:::\n\n\n::: {.callout-tip title=\"Question\" appearance=\"simple\"}\n\n**How many sequences are in each FastQ file?**\n\nHint: You can run either `seqtk size ` or `bioawk -c fastx 'END{print NR}' ` to find this out.\n\n:::\n\n::: {.callout-note collapse=\"true\" title=\"Answer\"}\nThere are about 3.25 million paired-end sequences in these files.\n:::\n\n### Preparing the sequencing data for _de novo_ assembly\n\nBefore running the actual assembly, we need to pre-process our sequencing data. Typical pre-processing steps include the trimming of adapter sequences and barcodes from the sequencing data and the removal of host or contaminant sequences, such as the bacteriophage PhiX, which is commonly sequenced as a quality control.\n\nMany assembly pipelines, such as [nf-core/mag](https://nf-co.re/mag), have these steps automatically included, however, these steps can be performed prior to it, too. For this practical course, I have performed these steps for us and we could directly continue with the _de novo_ assembly.\n\n::: {.callout-important}\nThe characteristic of ancient DNA samples that pre-determines the success of the _de novo_ assembly is the **distribution of the DNA molecule length**. Determine this distribution prior to running the _de novo_ assembly to be able to predict the results of the _de novo_ assembly.\n:::\n\nHowever, since the average length of the insert size of sequencing data (@fig-denovoassembly-illumina) is highly correlated with the success of the assembly, we want to first evaluate it. For this we can make use of the program fastp [@Chen2018].\n\n![Scheme of a read pair of Illumina sequencing data.](assets/images/chapters/denovo-assembly/insert_size_scheme.png){#fig-denovoassembly-illumina}\n\nFastp will try to overlap the two mates of a read pair and, if this is possible, return the length of the merged sequence, which is identical to insert size or the DNA molecule length. If the two mates cannot be overlapped, we will not be able to know the exact DNA molecule length but only know that it is longer than 290 bp (each read has a length of 150 bp and FastP requires a 11 bp overlap between the reads).\n\nThe final histogram of the insert sizes that is returned by FastP can tell us how well preserved the DNA of an ancient sample is (@fig-denovoassembly-lendist). The more the distribution is skewed to the right, i.e. the longer the DNA molecules are, the more likely we are to obtain long contiguous DNA sequences from the _de novo_ assembly. A distribution that is skewed to the left means that the DNA molecules are more highly degraded and this lowers our chances for obtaining long continuous sequences.\n\n![Example of a DNA molecule length distribution of a well-preserved ancient DNA\nsample. This histogram belongs to a 700,000-year-old horse that was preserved in permafrost, as reported in @Orlando2013, Fig. S4.](assets/images/chapters/denovo-assembly/Orlando2013_FigS4.5.png){#fig-denovoassembly-lendist}\n\nTo infer the distribution of the DNA molecules, we can run the command\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nfastp --in1 2612_R1.fastq.gz \\\n --in2 2612_R2.fastq.gz \\\n --stdout --merge -A -G -Q -L --json /dev/null --html overlaps.html \\\n> /dev/null\n```\n:::\n\n\n::: {.callout-tip title=\"Question\" appearance=\"simple\"}\n\n**Infer the distribution of the DNA molecule length of the sequencing data. Is this sample well-preserved?**\n\nHint: You can easily inspect the distribution by opening the HTML report `overlaps.html`.\n\n:::\n\n::: {.callout-note collapse=\"true\" title=\"Answer\"}\n\nHere is the histogram of the insert sizes determined by fastp (@fig-denovoassembly-insertsize). By default, fastp will only keep reads that are longer than 30 bp and requires an overlap between the read mates of 30 bp. The maximum read length is 150 bp, therefore, the histogram only spreads from 31 to 271 bp in total.\n\n![Histogram of the insert sizes determined by fastp](assets/images/chapters/denovo-assembly/2612_insertsize_hist.png){#fig-denovoassembly-insertsize}\n\nThe sequencing data for the sample **2612** were generated across eight different sequencing runs, which differed in their nominal length. Some sequencing runs were 2x 100 bp, while others were 2x 150 bp. This is the reason why we observe two peaks just short of 100 and 150 bp. The difference to the nominal length is caused by the quality trimming of the data.\n\nOverall, we have almost no short DNA molecules (< 50 bp) but most DNA molecules are longer than 80 bp. Additionally, there were > 200,000 read pairs that could not be overlapped. Therefore, we can conclude that the sample **2612** is moderately degraded ancient DNA sample and has many long DNA molecules.\n:::\n\n### _De novo_ assembly\n\nNow, we will actual perform the _de novo_ assembly on the sequencing data. For this, we will use the program MEGAHIT [@LiMegahit2015], a _de Bruijn_-graph assembler. \n\n![Overview of inferring _k_-mers from a DNA sequence. Credit:\n[https://medium.com/swlh/bioinformatics-1-k-mer-counting-8c1283a07e29](https://medium.com/swlh/bioinformatics-1-k-mer-counting-8c1283a07e29)](assets/images/chapters/denovo-assembly/kmer_counting.png){#fig-denovoassembly-kmers}\n\n_De Bruijn_ graph assemblers are together with overlap assemblers the predominant group of assemblers. They use _k_-mers (see @fig-denovoassembly-kmers for an example of _4_-mers) extracted from the short-read sequencing data to build the graph. For this, they connect each _k_-mer to adjacent _k_-mer using a directional edge. By walking along the edges and visiting each _k_-mer or node once, longer continuous sequences are reconstructed. This is a very rough explanation and I would advise you to watch this excerpt of a [lecture](https://www.youtube.com/watch?v=OY9Q_rUCGDw) by Rob Edwards from San Diego State University and a [Coursera lecture](https://youtu.be/TNYZZKrjCSk?t=112) by Ben Langmead from Johns Hopkins University, if you would like to learn more about it.\n\nAll _de Bruijn_ graph assemblers work in a similar way so the question is why do we use MEGAHIT and not other programs, such as metaSPAdes?\n\n::: {.callout-note title=\"Pros and cons of MEGAHIT\"}\n\n**Pros**:\n\n - low-memory footprint: can be run on computational infrastructure that does not have large memory resources \n - can assembly both single-end and paired-end sequencing data\n - the assembly algorithm can cope with the presence of high amounts of ancient DNA damage\n\n**Cons**:\n\n- lower assembly quality on modern comparative data, particularly a higher rate of mis-assemblies (CAMI II challenge)\n\n:::\n\nIn the [Warinner group](https://christinawarinner.com/about-us/), we realised after some tests that MEGAHIT has a clear advantage when ancient DNA damage is present at higher quantities. While it produced a higher number of mis-assemblies compared to metaSPAdes when being evaluated on simulated modern metagenomic data [Critical Assessment of Metagenome Interpretation II challenge, @Meyer2022], it produces more long contigs when ancient DNA damage is present.\n\nTo _de novo_ assemble the short-read sequencing data of the sample **2612** using MEGAHIT, we can run the command\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nmegahit -1 2612_R1.fastq.gz \\\n -2 2612_R2.fastq.gz \\\n -t 14 --min-contig-len 500 \\\n --out-dir megahit\n```\n:::\n\n\nThis will use the paired-end sequencing data as input and return all contigs that are at least 500 bp long.\n\n::: {.callout-caution}\nWhile MEGAHIT is able to use merged sequencing data, it is advised to use the unmerged paired-end data as input. In tests using simulated data I have observed that MEGAHIT performed slightly better when using the unmerged data and it likely has something to do with its internal algorithm to infer insert sizes from the paired-end data.\n:::\n\nWhile we are waiting for MEGAHIT to finish, here is a question:\n\n::: {.callout-tip title=\"Question\" appearance=\"simple\"}\n\n**Which _k_-mer lengths did MEGAHIT select for the _de novo_ assembly?**\n\n:::\n\n::: {.callout-note collapse=\"true\" title=\"Answer\"}\n\nBased on the maximum read length, MEGAHIT decided to use the _k_-mer lengths of 21, 29, 39, 59, 79, 99, 119, and 141.\n:::\n\nNow, as MEGAHIT has finished, we want to evaluate the quality of the assembly results. MEGAHIT has written the contiguous sequences (contigs) into a single FastA file stored in the folder `megahit`. We will process this FastA file with a small script, calN50, which will count the number of contigs and give us an overview of their length distribution.\n\nTo download the script and run it, we can execute the following commands:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nwget https://raw.githubusercontent.com/lh3/calN50/master/calN50.js\nk8 ./calN50.js megahit/final.contigs.fa\n```\n:::\n\n\n::: {.callout-tip title=\"Question\" appearance=\"simple\"}\n\n**How many contigs were assembled? What is the sum of the lengths of all contigs? What is the maximum, the median, and the minimum contig length?**\n\nHint: The maximum contig length is indicated by the label \"N0\", the median by the label \"N50\", and the minimum by the label \"N100\"?\n\n:::\n\n::: {.callout-note collapse=\"true\" title=\"Answer\"}\n\nMEGAHIT assembled 3,606 contigs and their lengths sum up to 11.4 Mb. The maximum contig length was 448 kb, the median length was 15.6 kb, and the minimum length was 500 bp.\n\n:::\n\nThere is a final caveat when assembling ancient metagenomic data with MEGAHIT: while it is able to assemble sequencing data with a high percentage of C-to-T substitutions, it tends to introduce these changes into the contig sequences, too.\n\n![_De Bruijn_ graph with a bubble caused by a second allele. Adapted from\n[@Leggett2013, Figure 1a]](assets/images/chapters/denovo-assembly/Leggett2013_Fig1a.png){#fig-denovoassembly-debruijnbubble}\n\nThese C-to-T substitutions are similar to biological single-nucleotide polymorphisms in the sequencing data. Both lead the introduction of bubbles in the _de Bruijn_ graph when two alleles are present in the k-mer sequences (@fig-denovoassembly-debruijnbubble) and the assembler decides during its pruning steps which allele to keep in the contig sequence. \n\nWhile it does not really matter which allele is kept for biological polymorphisms, it does matter for technical artefacts that are introduced by the presence of ancient DNA damage. In our group we realised that the gene sequences that were annotated on the contigs of MEGAHIT tended to have a higher number of nonsense mutations compared to the known variants of the genes. After manual inspection, we observed that many of these mutations appeared because MEGAHIT chose the damage-derived T allele over the C allele or the damage-derived A allele over a G allele [see @Klapper2023, Figure S1].\n\nTo overcome this issue, my colleagues Maxime Borry, James Fellows Yates and I developed a strategy to replace these damage-derived alleles in the contig sequences. This approach consists of aligning the short-read data against the contigs, performing genotyping along them, and finally replacing all alleles for which we have strong support for an allele that is different from the one selected by MEGAHIT.\n\nWe standardised this approach and added it to the Nextflow pipeline nf-core/mag [@Krakau2022]. It can simply be activated by providing the parameter `--ancient_dna`. \n\n::: {.callout-caution}\nWhile MEGAHIT is able to assemble ancient metagenomic sequencing data with high amounts of ancient DNA damage, it tends to introduce damage-derived T and A alleles into the contig sequences instead of the true C and G alleles. This can lead to a higher number of nonsense mutations in coding sequences. We strongly advise you to correct such mutations, e.g. by using the ancient DNA workflow of the Nextflow pipeline [nf-core/mag](https://nf-co.re/mag).\n:::\n\n### Aligning the short-read data against the contigs\n\nAfter the assembly, the next detrimental step that is required for many subsequent analyses is the alignment of the short-read sequencing data back to assembled contigs.\n\nAnalyses that require these alignment information are for example:\n\n - the correction of the contig sequences to remove damage-derived alleles\n - the non-reference binning of contigs into MAGs for inferring the coverage along the contigs\n - the quantification of the presence of ancient DNA damage\n\nAligning the short-read data to the contigs requires multiple steps:\n\n 1. Build an index from the contigs for the alignment program BowTie2\n 2. Align the short-read data against the contigs using this index with BowTie2\n 3. Calculate the mismatches and deletions of the short-read data against the contig sequences\n 4. Sort the aligned reads by the contig name and the coordinate they were aligned to\n 5. Index the resulting alignment file for faster access\n\nTo execute the alignment step we can run the following commands:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nmkdir alignment\nbowtie2-build -f megahit/final.contigs.fa alignment/2612\nbowtie2 -p 14 –-very-sensitive -N 1 -x alignment/2612 \\\n -1 2612_R1.fastq.gz -2 2612_R2.fastq.gz | \\\nsamtools view -Sb - | \\\nsamtools calmd -u /dev/stdin megahit/final.contigs.fa | \\\nsamtools sort -o alignment/2612.sorted.calmd.bam -\nsamtools index alignment/2612.sorted.calmd.bam\n```\n:::\n\n\nHowever, these steps are rather time-consuming, even when we just have so little sequencing data as we do for our course example. The alignment is rather slow because we allow a single mismatch in the seeds that are used by the aligner BowTie2 to quickly determine the position of a read along the contig sequences (parameter `-N 1`). This is necessary because otherwise we might not be able to align reads with ancient DNA damage present on them. Secondly, the larger the resulting alignment file is the longer it takes to sort it by coordinate.\n\nTo save us some time and continue with the more interesting analyses, I prepared the resulting files for us. For this, I also corrected damage-derived alleles in the contig sequences. You can access these files on the cluster by running the following commands:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nmkdir alignment\nln -s /vol/volume/denovo-assembly/2612.sorted.calmd.bam alignment/\nln -s /vol/volume/denovo-assembly/2612.sorted.calmd.bam.bai alignment/\nln -s /vol/volume/denovo-assembly/2612.fa alignment/\n```\n:::\n\n\n\n:::{.callout-note title=\"Self-guided: data preparation\" collapse=\"true\"}\nFor everyone, who runs this practical on their own infrastructure, you can download the files:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nmkdir alignment\nwget -O alignment/2612.sorted.calmd.bam \\\n https://share.eva.mpg.de/index.php/s/bDKgFLj9GpRFdPg/download/2612.sorted.calmd.bam\nwget -O alignment/2612.sorted.calmd.bam.bai \\\n https://share.eva.mpg.de/index.php/s/HWqg6fJj6ZEEBAL/download/2612.sorted.calmd.bam.bai\nwget -O alignment/2612.fa \\\n https://share.eva.mpg.de/index.php/s/z6ZAai42RPribX5/download/final.contigs.fa\n```\n:::\n\n:::\n\n\n### Reconstructing metagenome-assembled genomes\n\nThere are typically two major approaches on how to study biological diversity of samples using the results obtained from the _de novo_ assembly. The first one is to reconstruct metagenome-assembled genomes (MAGs) and to study the species diversity.\n\n\"Metagenome-assembled genome\" is a convoluted term that means that we reconstructed a genome from metagenomic data via _de novo_ assembly. While these reconstructed genomes, particularly the high-quality ones, are most likely a good representation of the genomes of the organisms present in the sample, the scientific community refrains from calling them a species or a strain. The reason is that for calling a genome a species or a strain additional analyses would be necessary, of which many would include the cultivation of the organism. For many samples, this is not feasible and therefore the community stuck to the term MAG instead.\n\nThe most commonly applied method to obtain MAGs is the so-called \"non-reference binning\". Non-reference binning means that we do not try to identify contigs by aligning them against known reference genomes, but only use the characteristics of the contigs themselves to cluster them (@fig-denovoassembly-binning). \n\n![Scheme of binning _de novo_ assembled contigs into metagenome-assembled genomes. During the binning contigs are grouped into clusters based on their characteristics, such as tetra-nucleotide frequency and the coverage along the contigs. Clusters of contigs that fulfil a minimum of quality criteria are then considered as metagenome-assembled genomes. However, depending on the sample, a number of contigs will remain unbinned.](assets/images/chapters/denovo-assembly/mag_binning_scheme.png){#fig-denovoassembly-binning}\n\nThe two most commonly used characteristics are:\n\n - the tetra-nucleotide frequency: the frequency of all 4-mers (e.g. AAAA, AAAC, AAAG etc.) in the contig sequence\n - the coverage along the contig\n\nThe idea here is that two contigs that are derived from the same bacterial genome will likely have a similar nucleotide composition and coverage.\n\nThis approach works very well when the contigs are longer and they strongly differ in the nucleotide composition and the coverage from each other. However, if this is not the case, e.g. there is more than one strain of the same species in the sample, these approaches will likely not be able to assign some contigs to clusters: these contigs remain unbinned. In case there is a high number of unbinned contigs, one can also employ the more sensitive reference-based binning strategies but we will not cover this in this practical course.\n\nThere are a number of different binning tools out there that can be used for this task. Since this number is constantly growing, there have been attempts to standardise the test data sets that these tools are run on so that their performances can be easily compared. The most well-known attempt is the Critical Assessment of Metagenome Interpretation [Critical Assessment of Metagenome Interpretation II challenge, @Meyer2022], which released the latest comparison of commonly used tools for assembly, binning, and bin refinement in 2022. I recommend that you first check the performance of a new tool against the CAMI datasets before testing it to see whether it is worth using it.\n\nThree commonly used binners are:\n\n - metaBAT2 [@Kang2019, more than 1,300 citations]\n - MaxBin2 [@WuMaxbin2016, more than 1,200 citations]\n - CONCOCT [@Alneberg2014, more than 1,000 citation]\n\nEach of these three binners employs a slightly different strategy. While metaBAT2 simply uses the two previously mentioned metrics, the tetra-nucleotide frequency and the coverage along the contigs, MaxBin2 additionally uses single-copy marker genes to infer the taxonomic origin of contigs. In contrast, CONCOCT also just uses the two aforementioned metrics but first performs a Principal Component Analysis (PCA) on these metrics and uses the PCA results for the clustering.\n\nThe easiest way to run all these three programs is the program metaWRAP [@Uritskiy2018]. metaWRAP is in fact a pipeline that allows you to assemble your contigs, bin them, and subsequently refine the resulting MAGs. However, the pipeline is not very well written and does not contain any strategies to deal with ancient metagenomic sequencing data. Therefore, I prefer to use different pipelines, such as nf-core/mag for the assembly, and only use metaWRAP for binning and bin refinement.\n\nTo skip the first steps of metaWRAP and start straight with the binning, we need to create the folder structure and files that metaWRAP expects:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nmkdir -p metawrap/INITIAL_BINNING/2612/work_files\nln -s $PWD/alignment/2612.sorted.calmd.bam \\\n metawrap/INITIAL_BINNING/2612/work_files/2612.bam\nmkdir -p metawrap/faux_reads\necho \"@\" > metawrap/faux_reads/2612_1.fastq\necho \"@\" > metawrap/faux_reads/2612_2.fastq\n```\n:::\n\n\nNow, we can start to run the binning. In this practical course, we will focus on metaBAT2 and MaxBin2. To bin the contigs with these binners, we execute:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nconda activate metawrap-env\nmetawrap binning -o metawrap/INITIAL_BINNING/2612 \\\n -t 14 \\\n -a alignment/2612.fa \\\n --metabat2 --maxbin2 --universal \\\n metawrap/faux_reads/2612_1.fastq metawrap/faux_reads/2612_2.fastq\nconda deactivate\n```\n:::\n\n\n::: {.callout-caution}\nMetaWRAP is not a well-written Python software and has not been update for more than three years. It\nstill relies on the deprecated Python v2.7. This is in conflict with many other tools and therefore\nit requires its own conda environment, `metawrap-env`. Do not forget to deactivate this environment\nafterwards again!\n:::\n\nMetaWRAP will run metaBAT2 and MaxBin2 for us and store their respective output into sub-folders in\nthe folder `metawrap/INITIAL_BINNING/2612`.\n\n::: {.callout-tip title=\"Question\" appearence=\"simple\"}\n\n**How many bins did metaBAT2 and MaxBin2 reconstruct, respectively? Is there a difference in the genome sizes of these reconstructed bins?**\n\nHint: You can use the previously introduced script `k8 .calN50.js` to analyse the genome size of the individual bins.\n:::\n\n::: {.callout-note collapse=\"true\" title=\"Answer\"}\n\nmetaBAT2 reconstructed seven bins, while MaxBin2 reconstructed only five bins.\n\nWhen comparing the genome sizes of these bins, we can see that despite having reconstructed fewer bins, MaxBin2's bins have on average larger genome size and all of them are at least 1.5 Mb. In contrast, five out of seven metaBAT2 bins are shorter than 1.5 Mb.\n\n:::\n\nWhile we could have run these two binning softwares manually ourselves, there is another reason why we should use metaWRAP: it has a powerful bin refinement algorithm.\n\nAs we just observed, binning tools come to different results when performing non-reference binning on the same contigs. So how do we know which binning tool delivered the better or even correct results?\n\nA standard approach is to identify single-copy marker genes that are specific for certain taxonomic lineages, e.g. to all members of the family _Prevotellaceae_ or to all members of the kingdom archaea. If we find lineage-specific marker genes from more than one lineage in our bin, something likely went wrong. While in certain cases horizontal gene transfer could explain such a finding, it is much more common that a binning tool clustered two contigs from two different taxons.\n\nDuring its bin refinement, metaWRAP first combines the results of all binning tools in all combinations. So it would merge the results of metaBAT2 and MaxBin2, metaBAT2 and CONCOCT, MaxBin2 and CONCOCT, and all three together. Afterwards, it evaluates the presence of lineage-specific marker genes on the contigs of the bins for every combination and the individual binning tools themselves. In the case that it would find marker genes of more than one lineage in a bin, it would split the bin into two. After having evaluated everything, metaWRAP selects the refined bins that have the highest quality score across the whole set of bins.\n\n![Performance of metaWRAP's bin refinement algorithm compared to other tools. Adapted from @Uritskiy2018, Fig. 4](assets/images/chapters/denovo-assembly/MetaWrap_Fig4.png){#fig-denovoassembly-metawrap}\n\nUsing this approach, the authors of metaWRAP could show that they can outperform the individual binning tools and other bin refinement algorithms both regarding the **completeness** and **contamination** that was estimated for the MAGs (@fig-denovoassembly-metawrap).\n\n::: {.callout-caution}\nRunning metaWRAP's bin refinement module requires about 72 GB of memory because it has to load a larger reference database containing the lineage-specific marker genes of checkM.\n\nIf your computer infrastructure cannot provide so much memory, I have prepared the results of metaWRAP's bin refinement algorithm, `metawrap_50_10_bins.stats`, which can be found in the folder `vol/volume/denovo_assembly`.\n:::\n\nTo apply metaWRAP's bin refinement to the bins that we obtained from metaBAT2 and MaxBin2, we first need to install the software checkM [@Parks2015] that will provide the lineage-specific marker gene catalogue:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nmkdir checkM\nwget -O checkM/checkm_data_2015_01_16.tar.gz \\\n https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz\ntar xvf checkM/checkm_data_2015_01_16.tar.gz -C checkM\n\necho checkM | checkm data setRoot checkM\n```\n:::\n\n\nAfterwards, we can execute metaWRAP's bin refinement module:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nmkdir -p metawrap/BIN_REFINEMENT/2612\nmetawrap bin_refinement -o metawrap/BIN_REFINEMENT/2612 \\\n -t 14 \\\n -c 50 \\\n -x 10 \\\n -A metawrap/INITIAL_BINNING/2612/maxbin2_bins \\\n -B metawrap/INITIAL_BINNING/2612/metabat2_bins\n```\n:::\n\n\nThe latter step will produce a summary file, `metawrap_50_10_bins.stats`, that lists all retained bins and some key characteristics, such as the genome size, the completeness estimate, and the contamination estimate. The latter two can be used to assign a quality score according to the Minimum Information for MAG (MIMAG; see info box).\n\n::: {.callout-note title=\"The Minimum Information for MAG (MIMAG)\"}\n\nThe two most common metrics to evaluate the quality of MAGs are:\n\n - the **completeness**: how many of the expected lineage-specific single-copy marker genes were present in the MAG?\n - the **contamination**: how many of the expected lineage-specific single-copy marker genes were present more than once in the MAG?\n\nThese metric is usually calculated using the marker-gene catalogue of checkM [@Parks2015], also if there are other estimates from other tools such as BUSCO [@Manni2021], GUNC [@Orakov2021] or checkM2 [@Chklovski2022].\n\nDepending on the estimates on completeness and contamination plus the presence of RNA genes, MAGs are assigned to the quality category following the Minimum Information for MAG criteria [@Bowers2017] You can find the overview [here](https://www.nature.com/articles/nbt.3893/tables/1).\n\n:::\n\nAs these two steps will run rather long and need a large amount of memory and disk space, I have provided the results of metaWRAP's bin refinement. You can find the file here: `/vol/volume/denovo_assembly/metawrap_50_10_bins.stats`. Be aware that these results are based on the bin refinement of the results of three binning tools and include CONCOCT.\n\n::: {.callout-tip title=\"Question\" appearence=\"simple\"}\n\n**How many bins were retained after the refinement with metaWRAP? How many high-quality and medium-quality MAGs did the refinement yield following the MIMAG criteria?**\n\nHint: You can more easily visualise tables on the terminal using the Python program `visidata`. After installing it with `pip install visidata`, you can open a table using `vd -f tsv /vol/volume/denovo_assembly/metawrap_50_10_bins.stats`. Next to separating the columns nicely, it allows you to perform a lot of operations like sorting conveniently. Check the cheat sheet [here](https://jsvine.github.io/visidata-cheat-sheet/en/).\n\n:::\n\n::: {.callout-note collapse=\"true\" title=\"Answer\"}\n\nIn total, metaWRAP retained five bins, similarly to MaxBin2. Of these five bins, the bins `bin.3` and `bin.4` had completeness and contamination estimates that would qualify them for being high-quality MAGs. However, we would need to check the presence of rRNA and tRNA genes. The other three bins are medium-quality bins because their completeness estimate was < 90%.\n\n:::\n\n### Taxonomic assignment of contigs\n\nWhat should we do when we simply want to know to which taxon a certain contig most likely belongs to?\n\nReconstructing metagenome-assembled genomes requires multiple steps and might not even provide the answer in the case that the contig of interest is not binned into a MAG. Instead, it is sufficient to perform a sequence search against a reference database. \n\nThere are plenty of tools available for this task, such as:\n\n - BLAST/DIAMOND\n - Kraken2\n - Centrifuge\n - MMSeqs2\n\nFor each tool, we can either use pre-computed reference databases or compute our own one. The two taxonomic classification systems that are most commonly used are:\n\n - NCBI Taxonomy\n - GTDB\n\nAs for any task that involves the alignment of sequences against a reference database, the chosen reference database should fit the sequences you are searching for. If your reference database does not capture the diversity of your samples, you will not be able to assign a subset of the contigs. There is also a trade-off between a large reference database that contains all sequences and its memory requirement. @Wright2023 elaborated on this quite extensively when comparing Kraken2 against MetaPhlAn.\n\nWhile all of these tools can do the job, I typically prefer to use the program MMSeqs2 [@Steinegger2017] because it comes along with a very fast algorithm based on aminoacid sequence alignment and implements a lowest common ancestor (LCA) algorithm (@fig-denovoassembly-mmseqs2). Recently, they implemented a _taxonomy_ workflow [@Mirdita2021] that allows to efficiently assign contigs to taxons. Luckily, it comes with multiple pre-computed reference databases, such as the GTDB v207 reference database [@Parks2020], and therefore it is even more accessible for users.\n\n![Scheme of the _taxonomy_ workflow implemented into MMSeqs2. Adapted from @Mirdita2021, Fig. 1.](assets/images/chapters/denovo-assembly/MMSeqs2_classify_Fig1.jpeg){#fig-denovoassembly-mmseqs2}\n\n::: {.callout-caution}\nThe latest version of the pre-computed GTDB reference database (r207) requires about 105 GB of harddisk space and 700 GB of memory for running.\n\nAs our computer infrastructure does not provide so much memory for every user, I pre-computed the results. You can find the results `2612.mmseqs2_gtdb.tsv` in the folder `/vol/volume/denovo_assembly`.\n\nAn alternative for users with a less powerful infrastructure is the program [KrakenUniq](https://github.com/fbreitwieser/krakenuniq).\n:::\n\nBefore running MMSeqs2's _taxonomy_ workflow against the GTDB reference database, we need to install it. \n\n\n::: {.cell}\n\n```{.bash .cell-code}\nmkdir -p refdbs/mmseqs2/gtdb\nmmseqs databases GTDB \\\n refdbs/mmseqs2/gtdb /tmp --threads 14\n```\n:::\n\n\nSubsequently, we can align all the contigs of the sample 2612 against the GTDB r207 with MMSeqs2:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nmkdir mmseqs2\nmmseqs createdb alignment/2612.fa mmseqs2/2612.contigs\nmmseqs taxonomy mmseqs2/2612.contigs \\\n refdbs/mmseqs2/gtdb/mmseqs2_gtdb \\\n mmseqs2/2612.mmseqs2_gtdb \\\n /tmp \\\n -a --tax-lineage 1 \\\n --lca-ranks kingdom,phylum,class,order,family,genus,species \\\n --orf-filter 1 \\\n --remove-tmp-files 1 \\\n --threads 14\nmmseqs createtsv mmseqs2/2612.contigs \\\n mmseqs2/2612.mmseqs2_gtdb \\\n```\n:::\n\n\n::: {.callout-tip title=\"Question\" appearence=\"simple\"}\n\n**What is the proportion of contigs that could be assigned to the different taxonomic ranks, such as species or genus? What are the dominant taxa?**\n\nHint: You can access this information easily by opening the file using visidata: `vd 2612.mmseqs2_gtdb.tsv` \n\n:::\n\n::: {.callout-note collapse=\"true\" title=\"Answer\"}\n\nFrom the 3,606 contigs, MMSeqs2's _taxonomy_ workflow assigned 3,523 contigs to any taxonomy. For the rest, there was not enough information and they were discarded.\n\nFrom the 3,523 assigned contigs, 2,013 were assigned to the rank \"species\", while 1,137 could only be assigned to the rank \"genus\".\n\nThe most contigs were assigned the archael species _Halococcus morrhuae_ (n=386), followed by the bacterial species _Olsenella E sp003150175_ (n=298) and _Collinsella sp900768795_ (n=186).\n:::\n\n### Taxonomic assignment of MAGs\n\nMMSeqs2's _taxonomy_ workflow is very useful to classify all contigs taxonomically. However, how would we determine which species we reconstructed by binning our contigs?\n\nThe simplest approach would be that we could summarise MMSeqs2's taxonomic assignments of all contigs of a MAG and then determine which lineage is the most frequent one. Although this would work in general, there is another approach that is more sophisticated: GTDB toolkit [GTDBTK, @Chaumeil2020].\n\nGTDBTK performs three steps to assign a MAG to a taxonomic lineage:\n\n1. **Lineage identification** based on single-copy marker genes using Hidden Markov Models (HMMs)\n2. **Multi-sequence alignment** of the identified marker genes\n3. Placement of the MAG genome into a **fixed reference tree** at class level\n\nThe last step is particularly clever. Based on the known diversity of a lineage present in the GTDB, it will construct a reference tree with all known taxa of this lineage. Afterwards, the tree structure is fixed and an algorithm attempts to create a new branch in the tree for placing the unknown MAG based on both the tree structure and the multi-sequence alignment. \n\nIn most cases, both the simple approach of taking the majority of the MMSeqs2's _taxonomy_ results and the GTDBTK approach lead to very similar results. However, GTDBTK performs better when determining whether a new MAG potentially belongs to a new genus or even a new family.\n\nTo infer which taxa our five reconstructed MAGs represent, we can run the GTDBTK.\n\n::: {.callout-caution}\nThe latest version of the database used by GTDBTK (r207) requires about 70 GB of harddisk space and 80 GB of memory for running.\n\nAs our computer infrastructure does not provide so much memory for every user, I pre-computed the results. You can find the results `2612.gtdbtk_archaea.tsv` and `2612.gtdbtk_bacteria.tsv` in the folder `/vol/volume/denovo_assembly`.\n:::\n\nFirst, we need to install the GTDB database:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nmkdir -p refdbs/gtdbtk\nwget -O refdbs/gtdbtk/gtdbtk_v2_data.tar.gz \\\n https://data.gtdb.ecogenomic.org/releases/latest/auxillary_files/gtdbtk_v2_data.tar.gz\ntar xvf refdbs/gtdbtk/gtdbtk_v2_data.tar.gz -C refdbs/gtdbtk\n```\n:::\n\n\nAfterwards, we can run GTDBTK's _classify_ workflow:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nmkdir gtdbtk\nGTDBTK_DATA_PATH=\"$PWD/refdbs/gtdbtk/gtdbtk_r207_v2\" \\\ngtdbtk classify_wf --cpu 14 --extension fa \\\n --genome_dir metawrap/BIN_REFINEMENT/2612/metawrap_50_10_bins \\\n --out_dir gtdbtk\n```\n:::\n\n\n::: {.callout-tip title=\"Question\" appearence=\"simple\"}\n\n**Do the classifications obtained by GTDBTK match the classifications that were assigned to the contigs using MMSeqs2? Would you expect these taxa given the archaeological context of the samples?**\n\nHint: You can access the classification results of GTDBTK easily by opening the file using visidata: `vd 2612.gtdbtk_archaea.tsv` and `vd 2612.gtdbtk_bacteria.tsv`\n\n:::\n\n::: {.callout-note collapse=\"true\" title=\"Answer\"}\n\nThe five MAGs reconstructed from the sample 2612 were assigned to the taxa:\n\n - `bin.1`: _Agathobacter rectalis_\n - `bin.2`: _Halococcus morrhuae_\n - `bin.3`: _Methanobrevibacter smithii_\n - `bin.4`: _Ruminococcus bromii_\n - `bin.5`: _Bifidobacterium longum_\n\nAll of these species were among the most frequent lineages that were identified by MMSeqs2's _taxonomy_ workflow highlighting the large overlap between the methods.\n\nWe would expect all five species to be present in our sample. All MAGs but `bin.2` were assigned to human gut microbiome commensal species that are typically found in healthy humans. The MAG `bin.2` was assigned to a halophilic archaeal species, which is typically found in salt mines.\n\n:::\n\n### Evaluating the amount of ancient DNA damage\n\nOne of the common questions that remain at this point of our analysis is whether the contigs that we assembled show evidence for the presence of ancient DNA damage. If yes, we could argue that these microbes are indeed ancient, particularly when their DNA fragment length distribution is rather short, too. \n\nMAGs typically consist of either tens or hundreds of contigs. For this case, many of the commonly used tools for quantifying the amount of ancient DNA damage, such as damageprofiler [@Neukamm2021] or mapDamage2 [@Jonsson2013], are not very well suited because they would require us to manually check each of their generated \"smiley plots\", which visualise the amount of C-to-T substitutions at the end of reads.\n\n![Overview of the model comparison performed by pyDamage. The green line represents the null model, i.e. the absence of ancient DNA damage, while the orange line represents the alternative model, i.e. the presence of ancient DNA damage.](assets/images/chapters/denovo-assembly/Pydamage_NZ_JHCB02000011.1.png){#fig-denovoassembly-pydamage}\n\nInstead, we will use the program pyDamage [@Borry2021] that was written with the particular use-case of metagenome-assembled contigs in mind. Although pyDamage can visualise the amount of C-to-T substitutions at the 5' end of reads, it goes a step further and fits two models upon the substitution frequency (@fig-denovoassembly-pydamage). The null hypothesis is that the observed distribution of C-to-T substitutions at the 5' end of reads reflects a flat line, i.e. a case when no ancient DNA damage is present. The alternative model assumes that the distribution resembles an exponential decay, i.e. a case when ancient DNA damage is present. By comparing the fit of these two models to the observed data for each contig, pyDamage can quickly identify contigs that are likely of ancient origin without requiring the user to inspect the plots visually.\n\nWe can run pyDamage directly on the alignment data in BAM format:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\npydamage analyze -w 30 -p 14 alignment/2612.sorted.calmd.bam\n```\n:::\n\n\n::: {.callout-tip title=\"Question\" appearence=\"simple\"}\n\n**Evaluate the pyDamage results with respect to the amount of C-to-T substitutions observed on the\ncontigs! How many contigs does pyDamage consider to show evidence for ancient DNA damage? How much\npower (prediction accuracy) does it have for this decision? Which MAGs are strongly \"ancient\" or\n\"modern\"?**\n\nHint: You can access pyDamage's results easily by opening the file using visidata: `vd pydamage_results/pydamage_results.tsv`\n\n:::\n\n::: {.callout-note collapse=\"true\" title=\"Answer\"}\n\nFrom the 3,606 contigs, pyDamage inferred a q-value, i.e. a p-value corrected for multiple testing, of < 0.5 to 26 contigs. This is partially due to a high fraction of the contigs with no sufficient information to discriminate between the models (predicted accuracy of < 0.5). However, the majority of the contigs with a prediction accuracy of > 0.5 still had q-values of 0.05 and higher. This suggests that overall the sample did not show large evidence of ancient DNA damage.\n\nThis reflects also on the MAGs. Although four of the five MAGs were human gut microbiome taxa, they did not show strong evidence of ancient DNA damage. This suggests that the sample is too young and is well preserved.\n:::\n\n### Annotating genomes for function\n\nThe second approach on how to study biological diversity of samples using the assembly results is to compare the reconstructed genes and their functions with each other.\n\nWhile it is very interesting to reconstruct the metagenome-assembled genomes from contigs and speculate about the evolution of a certain taxon, this does not always help when studying the ecology of a microbiome. Studying the ecology of a microbiome means to understand which taxa are present in an environment, what type of community do they form, what kind of natural products do they produce etc.?\n\nWith the lack of any data from culture isolates, it is rather difficult to discriminate from which species a reconstructed gene sequence is coming from, particularly when the contigs are short. Many microbial species exchange genes among each other via horizontal gene transfer which leads to multiple copies of a gene to be present in our metagenome and increases the level of difficulty further. \n\nBecause of this, many researchers tend to annotate all genes of a MAGs and compare the presence and absence of these genes across other genomes that were taxonomically assigned to the same taxon. This analysis, called pan-genome analysis, allows us to estimate the diversity of a microbial species with respect to their repertoire of protein-coding genes.\n\nOne of the most commonly used annotation tools for MAGs is Prokka [@Seemann2014], although it has recently been challenged by Bakta [@Schwengers2021]. The latter provides the same functionality as Prokka but incorporates more up-to-date reference databases for the annotation. Therefore, the scientific community is slowly shifting to Bakta.\n\nNext to returning information on the protein-coding genes, Prokka also returns the annotation of RNA genes (tRNAs and rRNAs), which will help us to evaluate the quality of MAGs regarding the MIMAG criteria.\n\nFor this practical course, we will use Prokka and we will focus on annotating the MAG `bin.3.fa` that we reconstructed from the sample 2612. \n\n\n::: {.cell}\n\n```{.bash .cell-code}\nprokka --outdir prokka \\\n --prefix 2612_003 \\\n --compliant --metagenome --cpus 14 \\\n metawrap_50_10_bins/bin.3.fa\n```\n:::\n\n\n::: {.callout-tip title=\"Question\" appearence=\"simple\"}\n\n**Prokka has annotated our MAG. What type of files does Prokka return? How many genes/tRNAs/rRNAs were detected?**\n\nHint: Check the file `prokka/2612_003.txt` for the number of annotated elements.\n\n:::\n\n::: {.callout-note collapse=\"true\" title=\"Answer\"}\n\nProkka returns the following files:\n\n - `.faa`: the amino acid sequences of all identified coding sequences \n - `.ffn`: the nucleotide sequences of all identified coding sequences \n - `.fna`: all contigs in nucleotide sequence format renamed following Prokka's naming scheme\n - `.gbk`: all annotations and sequences in GenBank format\n - `.gff`: all annotations and sequences in GFF format\n - `.tsv`: the tabular overview of all annotations\n - `.txt`: the short summary of the annotation results\n\nProkka found 1,797 coding sequences, 32 tRNAs, but no rRNAs. Finding no rRNAs is a common issues when trying to assemble MAGs without long-read sequencing data and is not just characteristic for ancient DNA samples. However, this means that we cannot technically call this MAG a high-quality MAG due to the lack of the rRNA genes. \n\n:::\n\n### Summary\n\nIn this practical course you have gone through all the important steps that are necessary for _de novo_ assembling ancient metagenomic sequencing data to obtain contiguous DNA sequences with little error. Furthermore, you have learned how to cluster these sequences into bins without using any references and how to refine them based on lineage-specific marker genes. For these refined bins, you have evaluated their quality regarding common standards set by the scientific community and assigned the MAGs to its most likely taxon. Finally, we learned how to infer the presence of ancient DNA damage and annotate them for RNA genes and protein-coding sequences.\n\n::: {.callout-caution title=\"Cautionary note - sequencing depth\"}\n\nBe aware of the sequencing depth when you assemble your sample. This sample used in this practical course was not obtained by randomly subsampling but I subsampled the sample so that we are able to reconstruct MAGs.\n\nThe original sample had almost 200 million reads, however, I subsampled it to less than 5 million reads. You usually need a lot of sequencing data for _de novo_ assembly and definitely more data than for reference-alignment based profiling. However, it also heavily depends on the complexity of the sample. So the best advice is: **just give it a try**!\n\n:::", + "markdown": "---\ntitle: Introduction to _de novo_ Genome Assembly\nauthor: Alexander Hübner\nbibliography: assets/references/chapters/denovo-assembly/references.bib\n---\n\n::: {.cell}\n\n:::\n\n\n\n::: {.callout-tip}\nFor this chapter's exercises, if not already performed, you will need to create the [conda environment](before-you-start.qmd#creating-a-conda-environment) from the `yml` file in the following [link](https://github.com/SPAAM-community/intro-to-ancient-metagenomics-book/raw/main/assets/envs/denovo-assembly.yml) (right click and save as to download), and once created, activate the environment with:\n\n```bash\nconda activate denovo-assembly\n```\n:::\n\n### Lecture\n\nLecture slides and video from the [2022 edition of the summer school](https://www.spaam-community.org/wss-summer-school/#/2022/README).\n\n\n\nPDF version of these slides can be downloaded from [here](https://github.com/SPAAM-community/wss-summer-school/raw/main/docs/assets/slides/2022/4c-intro-to-denovoassembly/SPAAM%20Summer%20School%202022%20-%204C%20-%20Genome%20Assembly.pdf).\n\n\n\n## Introduction\n\nFirst of all, what is a **metagenomic** sample? Metagenomic sample is a sample that consists of DNA from more than one source. The number and the type of sources might vary widely between different samples. Typical sources for ancient remains are e.g. the host organism and the microbial species. The important part is that we generally do not know the origin of a DNA molecule prior to analysing the sequencing data generated from the DNA library of this sample. In the example presented in (@fig-denovoassembly-overview), our metagenomic sample has DNA from three different sources, here coloured in blue, red, and yellow.\n\n![Overview of the ways how to analyse metagenomic sequencing data.](assets/images/chapters/denovo-assembly/metagenome_assembly_scheme.png){#fig-denovoassembly-overview}\n\nHow can we determine the sources of the DNA that we have in our metagenomic sample? There are three main options whose _pros_ and _cons_ are summarised in (@tbl-denovoassembly-proscons).\n\n\n::: {#tbl-denovoassembly-proscons .cell .tbl-cap-location-top tbl-cap='Pros and cons of the major methods for determining the sources of metagenomic DNA.'}\n\n------------------------------------------------------------------------------------------\n method pros cons \n--------------------------- ------------------------------ -------------------------------\n reference-based alignment highly sensitive, applicable requires all sources to be \n to aDNA samples represented in database \n\n single-genome assembly high-quality genomes from not available for ancient DNA \n cultivated bacteria samples \n\n metagenome assembly able to recover unknown highly dependent on \n diversity present in sample preservation of ancient DNA \n------------------------------------------------------------------------------------------\n:::\n\n\nUntil recently, the only option for ancient DNA samples was to take the short-read sequencing data and align them against some known reference genomes. However, this approach is heavily relying on whether all sources of our samples are represented in the available reference genomes. If a source is missing in the reference database - in our toy example, this is the case for the yellow source (@fig-denovoassembly-overview) -, then we won't be able to detect it using this reference database.\n\nWhile there is a potential workaround for modern metagenomic samples, _single-genome assembly_ relies on being able to cultivate a microbial species to obtain an isolate. This is unfeasible for ancient metagenomic samples because there are no more viable microbial cells available that could be cultivated.\n\nAround 2015, a technical revolution started when the first programs, e.g. MEGAHIT [@LiMegahit2015] and metaSPAdes [@Nurk2017], were published that could successfully perform *de novo* assembly from metagenomic data. Since then, tens of thousands metagenomic samples have been assembled and it was revealed that even well studied environments, such as the human gut microbiome, have a lot of additional microbial diversity that has not been observed previously via culturing and isolation [@Almeida2021].\n\nThe technical advancement of being able to perform *de novo* assembly on metagenomic samples led to an explosion of studies that analysed samples that were considered almost impossible to study beforehand. For researchers that are exposed to ancient DNA, the imminent question arises: can we apply the same methods to ancient DNA data? In this practical course, we will walk through all required steps that are necessary to successfully perform _de novo_ assembly from ancient DNA metagenomic sequencing data and show you what you can do once you have obtained the data.\n\n## Practical course\n\n### Sample overview\n\nFor this practical course, I selected a palaeofaeces sample from the study by @Maixner2021, who generated deep metagenomic sequencing data for four palaeofaeces samples that were excavated from an Austrian salt mine in Hallstatt and were associated with the Celtic Iron Age. We will focus on the youngest sample, **2612**, which was dated to be just a few hundred years old (@fig-denovoassembly-maixner).\n\n![The graphical abstract of Maixner et al. (2021).](assets/images/chapters/denovo-assembly/Maixner2021_GraphicalAbstract.jpg){#fig-denovoassembly-maixner}\n\nHowever, because the sample was very deeply sequenced for more than 250 million paired-end reads and we do not want to wait for days for the analysis to finish, we will not use all data but just a sub-sample of them.\n\nYou can access the sub-sampled data by changing to the folder \n\n```bash\nln -s /vol/volume/denovo-assembly/2612_R1.fastq.gz 2612_R1.fastq.gz\nln -s /vol/volume/denovo-assembly/2612_R2.fastq.gz 2612_R2.fastq.gz\n```\n\non the cluster. If you would like to repeat the practical on your own computational infrastructure, you will find information on how to access the files in the following boxes with the title \"Self-guided: data preparation\":\n\n\n:::{.callout-note title=\"Self-guided: data preparation\" collapse=\"true\"}\nFor everyone, who runs this practical on their own infrastructure, you can download the sequencing data from here:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nwget https://share.eva.mpg.de/index.php/s/CtLq2R9iqEcAFyg/download/2612_R1.fastq.gz\nwget https://share.eva.mpg.de/index.php/s/mc5JrpDWdL4rC24/download/2612_R2.fastq.gz\n```\n:::\n\n:::\n\n::: {.callout-tip title=\"Question\" appearance=\"simple\"}\n\n**How many sequences are in each FastQ file?**\n\nHint: You can run either `seqtk size ` or `bioawk -c fastx 'END{print NR}' ` to find this out.\n\n:::\n\n::: {.callout-note collapse=\"true\" title=\"Answer\"}\nThere are about 3.25 million paired-end sequences in these files.\n:::\n\n### Preparing the sequencing data for _de novo_ assembly\n\nBefore running the actual assembly, we need to pre-process our sequencing data. Typical pre-processing steps include the trimming of adapter sequences and barcodes from the sequencing data and the removal of host or contaminant sequences, such as the bacteriophage PhiX, which is commonly sequenced as a quality control.\n\nMany assembly pipelines, such as [nf-core/mag](https://nf-co.re/mag), have these steps automatically included, however, these steps can be performed prior to it, too. For this practical course, I have performed these steps for us and we could directly continue with the _de novo_ assembly.\n\n::: {.callout-important}\nThe characteristic of ancient DNA samples that pre-determines the success of the _de novo_ assembly is the **distribution of the DNA molecule length**. Determine this distribution prior to running the _de novo_ assembly to be able to predict the results of the _de novo_ assembly.\n:::\n\nHowever, since the average length of the insert size of sequencing data (@fig-denovoassembly-illumina) is highly correlated with the success of the assembly, we want to first evaluate it. For this we can make use of the program fastp [@Chen2018].\n\n![Scheme of a read pair of Illumina sequencing data.](assets/images/chapters/denovo-assembly/insert_size_scheme.png){#fig-denovoassembly-illumina}\n\nFastp will try to overlap the two mates of a read pair and, if this is possible, return the length of the merged sequence, which is identical to insert size or the DNA molecule length. If the two mates cannot be overlapped, we will not be able to know the exact DNA molecule length but only know that it is longer than 290 bp (each read has a length of 150 bp and FastP requires a 11 bp overlap between the reads).\n\nThe final histogram of the insert sizes that is returned by FastP can tell us how well preserved the DNA of an ancient sample is (@fig-denovoassembly-lendist). The more the distribution is skewed to the right, i.e. the longer the DNA molecules are, the more likely we are to obtain long contiguous DNA sequences from the _de novo_ assembly. A distribution that is skewed to the left means that the DNA molecules are more highly degraded and this lowers our chances for obtaining long continuous sequences.\n\n![Example of a DNA molecule length distribution of a well-preserved ancient DNA\nsample. This histogram belongs to a 700,000-year-old horse that was preserved in permafrost, as reported in @Orlando2013, Fig. S4.](assets/images/chapters/denovo-assembly/Orlando2013_FigS4.5.png){#fig-denovoassembly-lendist}\n\nTo infer the distribution of the DNA molecules, we can run the command\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nfastp --in1 2612_R1.fastq.gz \\\n --in2 2612_R2.fastq.gz \\\n --stdout --merge -A -G -Q -L --json /dev/null --html overlaps.html \\\n> /dev/null\n```\n:::\n\n\n::: {.callout-tip title=\"Question\" appearance=\"simple\"}\n\n**Infer the distribution of the DNA molecule length of the sequencing data. Is this sample well-preserved?**\n\nHint: You can easily inspect the distribution by opening the HTML report `overlaps.html`.\n\n:::\n\n::: {.callout-note collapse=\"true\" title=\"Answer\"}\n\nHere is the histogram of the insert sizes determined by fastp (@fig-denovoassembly-insertsize). By default, fastp will only keep reads that are longer than 30 bp and requires an overlap between the read mates of 30 bp. The maximum read length is 150 bp, therefore, the histogram only spreads from 31 to 271 bp in total.\n\n![Histogram of the insert sizes determined by fastp](assets/images/chapters/denovo-assembly/2612_insertsize_hist.png){#fig-denovoassembly-insertsize}\n\nThe sequencing data for the sample **2612** were generated across eight different sequencing runs, which differed in their nominal length. Some sequencing runs were 2x 100 bp, while others were 2x 150 bp. This is the reason why we observe two peaks just short of 100 and 150 bp. The difference to the nominal length is caused by the quality trimming of the data.\n\nOverall, we have almost no short DNA molecules (< 50 bp) but most DNA molecules are longer than 80 bp. Additionally, there were > 200,000 read pairs that could not be overlapped. Therefore, we can conclude that the sample **2612** is moderately degraded ancient DNA sample and has many long DNA molecules.\n:::\n\n### _De novo_ assembly\n\nNow, we will actual perform the _de novo_ assembly on the sequencing data. For this, we will use the program MEGAHIT [@LiMegahit2015], a _de Bruijn_-graph assembler. \n\n![Overview of inferring _k_-mers from a DNA sequence. Credit:\n[https://medium.com/swlh/bioinformatics-1-k-mer-counting-8c1283a07e29](https://medium.com/swlh/bioinformatics-1-k-mer-counting-8c1283a07e29)](assets/images/chapters/denovo-assembly/kmer_counting.png){#fig-denovoassembly-kmers}\n\n_De Bruijn_ graph assemblers are together with overlap assemblers the predominant group of assemblers. They use _k_-mers (see @fig-denovoassembly-kmers for an example of _4_-mers) extracted from the short-read sequencing data to build the graph. For this, they connect each _k_-mer to adjacent _k_-mer using a directional edge. By walking along the edges and visiting each _k_-mer or node once, longer continuous sequences are reconstructed. This is a very rough explanation and I would advise you to watch this excerpt of a [lecture](https://www.youtube.com/watch?v=OY9Q_rUCGDw) by Rob Edwards from San Diego State University and a [Coursera lecture](https://youtu.be/TNYZZKrjCSk?t=112) by Ben Langmead from Johns Hopkins University, if you would like to learn more about it.\n\nAll _de Bruijn_ graph assemblers work in a similar way so the question is why do we use MEGAHIT and not other programs, such as metaSPAdes?\n\n::: {.callout-note title=\"Pros and cons of MEGAHIT\"}\n\n**Pros**:\n\n - low-memory footprint: can be run on computational infrastructure that does not have large memory resources \n - can assembly both single-end and paired-end sequencing data\n - the assembly algorithm can cope with the presence of high amounts of ancient DNA damage\n\n**Cons**:\n\n- lower assembly quality on modern comparative data, particularly a higher rate of mis-assemblies (CAMI II challenge)\n\n:::\n\nIn the [Warinner group](https://christinawarinner.com/about-us/), we realised after some tests that MEGAHIT has a clear advantage when ancient DNA damage is present at higher quantities. While it produced a higher number of mis-assemblies compared to metaSPAdes when being evaluated on simulated modern metagenomic data [Critical Assessment of Metagenome Interpretation II challenge, @Meyer2022], it produces more long contigs when ancient DNA damage is present.\n\nTo _de novo_ assemble the short-read sequencing data of the sample **2612** using MEGAHIT, we can run the command\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nmegahit -1 2612_R1.fastq.gz \\\n -2 2612_R2.fastq.gz \\\n -t 14 --min-contig-len 500 \\\n --out-dir megahit\n```\n:::\n\n\nThis will use the paired-end sequencing data as input and return all contigs that are at least 500 bp long.\n\n::: {.callout-caution}\nWhile MEGAHIT is able to use merged sequencing data, it is advised to use the unmerged paired-end data as input. In tests using simulated data I have observed that MEGAHIT performed slightly better when using the unmerged data and it likely has something to do with its internal algorithm to infer insert sizes from the paired-end data.\n:::\n\nWhile we are waiting for MEGAHIT to finish, here is a question:\n\n::: {.callout-tip title=\"Question\" appearance=\"simple\"}\n\n**Which _k_-mer lengths did MEGAHIT select for the _de novo_ assembly?**\n\n:::\n\n::: {.callout-note collapse=\"true\" title=\"Answer\"}\n\nBased on the maximum read length, MEGAHIT decided to use the _k_-mer lengths of 21, 29, 39, 59, 79, 99, 119, and 141.\n:::\n\nNow, as MEGAHIT has finished, we want to evaluate the quality of the assembly results. MEGAHIT has written the contiguous sequences (contigs) into a single FastA file stored in the folder `megahit`. We will process this FastA file with a small script, calN50, which will count the number of contigs and give us an overview of their length distribution.\n\nTo download the script and run it, we can execute the following commands:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nwget https://raw.githubusercontent.com/lh3/calN50/master/calN50.js\nk8 ./calN50.js megahit/final.contigs.fa\n```\n:::\n\n\n::: {.callout-tip title=\"Question\" appearance=\"simple\"}\n\n**How many contigs were assembled? What is the sum of the lengths of all contigs? What is the maximum, the median, and the minimum contig length?**\n\nHint: The maximum contig length is indicated by the label \"N0\", the median by the label \"N50\", and the minimum by the label \"N100\"?\n\n:::\n\n::: {.callout-note collapse=\"true\" title=\"Answer\"}\n\nMEGAHIT assembled 3,606 contigs and their lengths sum up to 11.4 Mb. The maximum contig length was 448 kb, the median length was 15.6 kb, and the minimum length was 500 bp.\n\n:::\n\nThere is a final caveat when assembling ancient metagenomic data with MEGAHIT: while it is able to assemble sequencing data with a high percentage of C-to-T substitutions, it tends to introduce these changes into the contig sequences, too.\n\n![_De Bruijn_ graph with a bubble caused by a second allele. Adapted from\n[@Leggett2013, Figure 1a]](assets/images/chapters/denovo-assembly/Leggett2013_Fig1a.png){#fig-denovoassembly-debruijnbubble}\n\nThese C-to-T substitutions are similar to biological single-nucleotide polymorphisms in the sequencing data. Both lead the introduction of bubbles in the _de Bruijn_ graph when two alleles are present in the k-mer sequences (@fig-denovoassembly-debruijnbubble) and the assembler decides during its pruning steps which allele to keep in the contig sequence. \n\nWhile it does not really matter which allele is kept for biological polymorphisms, it does matter for technical artefacts that are introduced by the presence of ancient DNA damage. In our group we realised that the gene sequences that were annotated on the contigs of MEGAHIT tended to have a higher number of nonsense mutations compared to the known variants of the genes. After manual inspection, we observed that many of these mutations appeared because MEGAHIT chose the damage-derived T allele over the C allele or the damage-derived A allele over a G allele [see @Klapper2023, Figure S1].\n\nTo overcome this issue, my colleagues Maxime Borry, James Fellows Yates and I developed a strategy to replace these damage-derived alleles in the contig sequences. This approach consists of aligning the short-read data against the contigs, performing genotyping along them, and finally replacing all alleles for which we have strong support for an allele that is different from the one selected by MEGAHIT.\n\nWe standardised this approach and added it to the Nextflow pipeline nf-core/mag [@Krakau2022]. It can simply be activated by providing the parameter `--ancient_dna`. \n\n::: {.callout-caution}\nWhile MEGAHIT is able to assemble ancient metagenomic sequencing data with high amounts of ancient DNA damage, it tends to introduce damage-derived T and A alleles into the contig sequences instead of the true C and G alleles. This can lead to a higher number of nonsense mutations in coding sequences. We strongly advise you to correct such mutations, e.g. by using the ancient DNA workflow of the Nextflow pipeline [nf-core/mag](https://nf-co.re/mag).\n:::\n\n### Aligning the short-read data against the contigs\n\nAfter the assembly, the next detrimental step that is required for many subsequent analyses is the alignment of the short-read sequencing data back to assembled contigs.\n\nAnalyses that require these alignment information are for example:\n\n - the correction of the contig sequences to remove damage-derived alleles\n - the non-reference binning of contigs into MAGs for inferring the coverage along the contigs\n - the quantification of the presence of ancient DNA damage\n\nAligning the short-read data to the contigs requires multiple steps:\n\n 1. Build an index from the contigs for the alignment program BowTie2\n 2. Align the short-read data against the contigs using this index with BowTie2\n 3. Calculate the mismatches and deletions of the short-read data against the contig sequences\n 4. Sort the aligned reads by the contig name and the coordinate they were aligned to\n 5. Index the resulting alignment file for faster access\n\nTo execute the alignment step we can run the following commands:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nmkdir alignment\nbowtie2-build -f megahit/final.contigs.fa alignment/2612\nbowtie2 -p 14 –-very-sensitive -N 1 -x alignment/2612 \\\n -1 2612_R1.fastq.gz -2 2612_R2.fastq.gz | \\\nsamtools view -Sb - | \\\nsamtools calmd -u /dev/stdin megahit/final.contigs.fa | \\\nsamtools sort -o alignment/2612.sorted.calmd.bam -\nsamtools index alignment/2612.sorted.calmd.bam\n```\n:::\n\n\nHowever, these steps are rather time-consuming, even when we just have so little sequencing data as we do for our course example. The alignment is rather slow because we allow a single mismatch in the seeds that are used by the aligner BowTie2 to quickly determine the position of a read along the contig sequences (parameter `-N 1`). This is necessary because otherwise we might not be able to align reads with ancient DNA damage present on them. Secondly, the larger the resulting alignment file is the longer it takes to sort it by coordinate.\n\nTo save us some time and continue with the more interesting analyses, I prepared the resulting files for us. For this, I also corrected damage-derived alleles in the contig sequences. You can access these files on the cluster by running the following commands:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nmkdir alignment\nln -s /vol/volume/denovo-assembly/2612.sorted.calmd.bam alignment/\nln -s /vol/volume/denovo-assembly/2612.sorted.calmd.bam.bai alignment/\nln -s /vol/volume/denovo-assembly/2612.fa alignment/\n```\n:::\n\n\n\n:::{.callout-note title=\"Self-guided: data preparation\" collapse=\"true\"}\nFor everyone, who runs this practical on their own infrastructure, you can download the files:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nmkdir alignment\nwget -O alignment/2612.sorted.calmd.bam \\\n https://share.eva.mpg.de/index.php/s/bDKgFLj9GpRFdPg/download/2612.sorted.calmd.bam\nwget -O alignment/2612.sorted.calmd.bam.bai \\\n https://share.eva.mpg.de/index.php/s/HWqg6fJj6ZEEBAL/download/2612.sorted.calmd.bam.bai\nwget -O alignment/2612.fa \\\n https://share.eva.mpg.de/index.php/s/z6ZAai42RPribX5/download/final.contigs.fa\n```\n:::\n\n:::\n\n\n### Reconstructing metagenome-assembled genomes\n\nThere are typically two major approaches on how to study biological diversity of samples using the results obtained from the _de novo_ assembly. The first one is to reconstruct metagenome-assembled genomes (MAGs) and to study the species diversity.\n\n\"Metagenome-assembled genome\" is a convoluted term that means that we reconstructed a genome from metagenomic data via _de novo_ assembly. While these reconstructed genomes, particularly the high-quality ones, are most likely a good representation of the genomes of the organisms present in the sample, the scientific community refrains from calling them a species or a strain. The reason is that for calling a genome a species or a strain additional analyses would be necessary, of which many would include the cultivation of the organism. For many samples, this is not feasible and therefore the community stuck to the term MAG instead.\n\nThe most commonly applied method to obtain MAGs is the so-called \"non-reference binning\". Non-reference binning means that we do not try to identify contigs by aligning them against known reference genomes, but only use the characteristics of the contigs themselves to cluster them (@fig-denovoassembly-binning). \n\n![Scheme of binning _de novo_ assembled contigs into metagenome-assembled genomes. During the binning contigs are grouped into clusters based on their characteristics, such as tetra-nucleotide frequency and the coverage along the contigs. Clusters of contigs that fulfil a minimum of quality criteria are then considered as metagenome-assembled genomes. However, depending on the sample, a number of contigs will remain unbinned.](assets/images/chapters/denovo-assembly/mag_binning_scheme.png){#fig-denovoassembly-binning}\n\nThe two most commonly used characteristics are:\n\n - the tetra-nucleotide frequency: the frequency of all 4-mers (e.g. AAAA, AAAC, AAAG etc.) in the contig sequence\n - the coverage along the contig\n\nThe idea here is that two contigs that are derived from the same bacterial genome will likely have a similar nucleotide composition and coverage.\n\nThis approach works very well when the contigs are longer and they strongly differ in the nucleotide composition and the coverage from each other. However, if this is not the case, e.g. there is more than one strain of the same species in the sample, these approaches will likely not be able to assign some contigs to clusters: these contigs remain unbinned. In case there is a high number of unbinned contigs, one can also employ the more sensitive reference-based binning strategies but we will not cover this in this practical course.\n\nThere are a number of different binning tools out there that can be used for this task. Since this number is constantly growing, there have been attempts to standardise the test data sets that these tools are run on so that their performances can be easily compared. The most well-known attempt is the Critical Assessment of Metagenome Interpretation [Critical Assessment of Metagenome Interpretation II challenge, @Meyer2022], which released the latest comparison of commonly used tools for assembly, binning, and bin refinement in 2022. I recommend that you first check the performance of a new tool against the CAMI datasets before testing it to see whether it is worth using it.\n\nThree commonly used binners are:\n\n - metaBAT2 [@Kang2019, more than 1,300 citations]\n - MaxBin2 [@WuMaxbin2016, more than 1,200 citations]\n - CONCOCT [@Alneberg2014, more than 1,000 citation]\n\nEach of these three binners employs a slightly different strategy. While metaBAT2 simply uses the two previously mentioned metrics, the tetra-nucleotide frequency and the coverage along the contigs, MaxBin2 additionally uses single-copy marker genes to infer the taxonomic origin of contigs. In contrast, CONCOCT also just uses the two aforementioned metrics but first performs a Principal Component Analysis (PCA) on these metrics and uses the PCA results for the clustering.\n\nThe easiest way to run all these three programs is the program metaWRAP [@Uritskiy2018]. metaWRAP is in fact a pipeline that allows you to assemble your contigs, bin them, and subsequently refine the resulting MAGs. However, the pipeline is not very well written and does not contain any strategies to deal with ancient metagenomic sequencing data. Therefore, I prefer to use different pipelines, such as nf-core/mag for the assembly, and only use metaWRAP for binning and bin refinement.\n\nTo skip the first steps of metaWRAP and start straight with the binning, we need to create the folder structure and files that metaWRAP expects:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nmkdir -p metawrap/INITIAL_BINNING/2612/work_files\nln -s $PWD/alignment/2612.sorted.calmd.bam \\\n metawrap/INITIAL_BINNING/2612/work_files/2612.bam\nmkdir -p metawrap/faux_reads\necho \"@\" > metawrap/faux_reads/2612_1.fastq\necho \"@\" > metawrap/faux_reads/2612_2.fastq\n```\n:::\n\n\nNow, we can start to run the binning. In this practical course, we will focus on metaBAT2 and MaxBin2. To bin the contigs with these binners, we execute:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nconda activate metawrap-env\nmetawrap binning -o metawrap/INITIAL_BINNING/2612 \\\n -t 14 \\\n -a alignment/2612.fa \\\n --metabat2 --maxbin2 --universal \\\n metawrap/faux_reads/2612_1.fastq metawrap/faux_reads/2612_2.fastq\nconda deactivate\n```\n:::\n\n\n::: {.callout-caution}\nMetaWRAP is not a well-written Python software and has not been update for more than three years. It\nstill relies on the deprecated Python v2.7. This is in conflict with many other tools and therefore\nit requires its own conda environment, `metawrap-env`. Do not forget to deactivate this environment\nafterwards again!\n:::\n\nMetaWRAP will run metaBAT2 and MaxBin2 for us and store their respective output into sub-folders in\nthe folder `metawrap/INITIAL_BINNING/2612`.\n\n::: {.callout-tip title=\"Question\" appearence=\"simple\"}\n\n**How many bins did metaBAT2 and MaxBin2 reconstruct, respectively? Is there a difference in the genome sizes of these reconstructed bins?**\n\nHint: You can use the previously introduced script `k8 ./calN50.js` to analyse the genome size of the individual bins.\n:::\n\n::: {.callout-note collapse=\"true\" title=\"Answer\"}\n\nmetaBAT2 reconstructed seven bins, while MaxBin2 reconstructed only five bins.\n\nWhen comparing the genome sizes of these bins, we can see that despite having reconstructed fewer bins, MaxBin2's bins have on average larger genome size and all of them are at least 1.5 Mb. In contrast, five out of seven metaBAT2 bins are shorter than 1.5 Mb.\n\n:::\n\nWhile we could have run these two binning softwares manually ourselves, there is another reason why we should use metaWRAP: it has a powerful bin refinement algorithm.\n\nAs we just observed, binning tools come to different results when performing non-reference binning on the same contigs. So how do we know which binning tool delivered the better or even correct results?\n\nA standard approach is to identify single-copy marker genes that are specific for certain taxonomic lineages, e.g. to all members of the family _Prevotellaceae_ or to all members of the kingdom archaea. If we find lineage-specific marker genes from more than one lineage in our bin, something likely went wrong. While in certain cases horizontal gene transfer could explain such a finding, it is much more common that a binning tool clustered two contigs from two different taxons.\n\nDuring its bin refinement, metaWRAP first combines the results of all binning tools in all combinations. So it would merge the results of metaBAT2 and MaxBin2, metaBAT2 and CONCOCT, MaxBin2 and CONCOCT, and all three together. Afterwards, it evaluates the presence of lineage-specific marker genes on the contigs of the bins for every combination and the individual binning tools themselves. In the case that it would find marker genes of more than one lineage in a bin, it would split the bin into two. After having evaluated everything, metaWRAP selects the refined bins that have the highest quality score across the whole set of bins.\n\n![Performance of metaWRAP's bin refinement algorithm compared to other tools. Adapted from @Uritskiy2018, Fig. 4](assets/images/chapters/denovo-assembly/MetaWrap_Fig4.png){#fig-denovoassembly-metawrap}\n\nUsing this approach, the authors of metaWRAP could show that they can outperform the individual binning tools and other bin refinement algorithms both regarding the **completeness** and **contamination** that was estimated for the MAGs (@fig-denovoassembly-metawrap).\n\n::: {.callout-caution}\nRunning metaWRAP's bin refinement module requires about 72 GB of memory because it has to load a larger reference database containing the lineage-specific marker genes of checkM.\n\nIf your computer infrastructure cannot provide so much memory, I have prepared the results of metaWRAP's bin refinement algorithm, `metawrap_50_10_bins.stats`, which can be found in the folder `/vol/volume/denovo-assembly`.\n:::\n\nTo apply metaWRAP's bin refinement to the bins that we obtained from metaBAT2 and MaxBin2, we first need to install the software checkM [@Parks2015] that will provide the lineage-specific marker gene catalogue:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nmkdir checkM\nwget -O checkM/checkm_data_2015_01_16.tar.gz \\\n https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz\ntar xvf checkM/checkm_data_2015_01_16.tar.gz -C checkM\n\necho checkM | checkm data setRoot checkM\n```\n:::\n\n\nAfterwards, we can execute metaWRAP's bin refinement module:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nmkdir -p metawrap/BIN_REFINEMENT/2612\nmetawrap bin_refinement -o metawrap/BIN_REFINEMENT/2612 \\\n -t 14 \\\n -c 50 \\\n -x 10 \\\n -A metawrap/INITIAL_BINNING/2612/maxbin2_bins \\\n -B metawrap/INITIAL_BINNING/2612/metabat2_bins\n```\n:::\n\n\nThe latter step will produce a summary file, `metawrap_50_10_bins.stats`, that lists all retained bins and some key characteristics, such as the genome size, the completeness estimate, and the contamination estimate. The latter two can be used to assign a quality score according to the Minimum Information for MAG (MIMAG; see info box).\n\n::: {.callout-note title=\"The Minimum Information for MAG (MIMAG)\"}\n\nThe two most common metrics to evaluate the quality of MAGs are:\n\n - the **completeness**: how many of the expected lineage-specific single-copy marker genes were present in the MAG?\n - the **contamination**: how many of the expected lineage-specific single-copy marker genes were present more than once in the MAG?\n\nThese metric is usually calculated using the marker-gene catalogue of checkM [@Parks2015], also if there are other estimates from other tools such as BUSCO [@Manni2021], GUNC [@Orakov2021] or checkM2 [@Chklovski2022].\n\nDepending on the estimates on completeness and contamination plus the presence of RNA genes, MAGs are assigned to the quality category following the Minimum Information for MAG criteria [@Bowers2017] You can find the overview [here](https://www.nature.com/articles/nbt.3893/tables/1).\n\n:::\n\nAs these two steps will run rather long and need a large amount of memory and disk space, I have provided the results of metaWRAP's bin refinement. You can find the file here: `/vol/volume/denovo-assembly/metawrap_50_10_bins.stats`. Be aware that these results are based on the bin refinement of the results of three binning tools and include CONCOCT.\n\n::: {.callout-tip title=\"Question\" appearence=\"simple\"}\n\n**How many bins were retained after the refinement with metaWRAP? How many high-quality and medium-quality MAGs did the refinement yield following the MIMAG criteria?**\n\nHint: You can more easily visualise tables on the terminal using the Python program `visidata`. After installing it with `pip install visidata`, you can open a table using `vd -f tsv /vol/volume/denovo-assembly/metawrap_50_10_bins.stats`. Next to separating the columns nicely, it allows you to perform a lot of operations like sorting conveniently. Check the cheat sheet [here](https://jsvine.github.io/visidata-cheat-sheet/en/).\n\n:::\n\n::: {.callout-note collapse=\"true\" title=\"Answer\"}\n\nIn total, metaWRAP retained five bins, similarly to MaxBin2. Of these five bins, the bins `bin.3` and `bin.4` had completeness and contamination estimates that would qualify them for being high-quality MAGs. However, we would need to check the presence of rRNA and tRNA genes. The other three bins are medium-quality bins because their completeness estimate was < 90%.\n\n:::\n\n### Taxonomic assignment of contigs\n\nWhat should we do when we simply want to know to which taxon a certain contig most likely belongs to?\n\nReconstructing metagenome-assembled genomes requires multiple steps and might not even provide the answer in the case that the contig of interest is not binned into a MAG. Instead, it is sufficient to perform a sequence search against a reference database. \n\nThere are plenty of tools available for this task, such as:\n\n - BLAST/DIAMOND\n - Kraken2\n - Centrifuge\n - MMSeqs2\n\nFor each tool, we can either use pre-computed reference databases or compute our own one. The two taxonomic classification systems that are most commonly used are:\n\n - NCBI Taxonomy\n - GTDB\n\nAs for any task that involves the alignment of sequences against a reference database, the chosen reference database should fit the sequences you are searching for. If your reference database does not capture the diversity of your samples, you will not be able to assign a subset of the contigs. There is also a trade-off between a large reference database that contains all sequences and its memory requirement. @Wright2023 elaborated on this quite extensively when comparing Kraken2 against MetaPhlAn.\n\nWhile all of these tools can do the job, I typically prefer to use the program MMSeqs2 [@Steinegger2017] because it comes along with a very fast algorithm based on aminoacid sequence alignment and implements a lowest common ancestor (LCA) algorithm (@fig-denovoassembly-mmseqs2). Recently, they implemented a _taxonomy_ workflow [@Mirdita2021] that allows to efficiently assign contigs to taxons. Luckily, it comes with multiple pre-computed reference databases, such as the GTDB v207 reference database [@Parks2020], and therefore it is even more accessible for users.\n\n![Scheme of the _taxonomy_ workflow implemented into MMSeqs2. Adapted from @Mirdita2021, Fig. 1.](assets/images/chapters/denovo-assembly/MMSeqs2_classify_Fig1.jpeg){#fig-denovoassembly-mmseqs2}\n\n::: {.callout-caution}\nThe latest version of the pre-computed GTDB reference database (r207) requires about 105 GB of harddisk space and 700 GB of memory for running.\n\nAs our computer infrastructure does not provide so much memory for every user, I pre-computed the results. You can find the results `2612.mmseqs2_gtdb.tsv` in the folder `/vol/volume/denovo-assembly`.\n\nAn alternative for users with a less powerful infrastructure is the program [KrakenUniq](https://github.com/fbreitwieser/krakenuniq).\n:::\n\nBefore running MMSeqs2's _taxonomy_ workflow against the GTDB reference database, we need to install it. \n\n\n::: {.cell}\n\n```{.bash .cell-code}\nmkdir -p refdbs/mmseqs2/gtdb\nmmseqs databases GTDB \\\n refdbs/mmseqs2/gtdb /tmp --threads 14\n```\n:::\n\n\nSubsequently, we can align all the contigs of the sample 2612 against the GTDB r207 with MMSeqs2:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nmkdir mmseqs2\nmmseqs createdb alignment/2612.fa mmseqs2/2612.contigs\nmmseqs taxonomy mmseqs2/2612.contigs \\\n refdbs/mmseqs2/gtdb/mmseqs2_gtdb \\\n mmseqs2/2612.mmseqs2_gtdb \\\n /tmp \\\n -a --tax-lineage 1 \\\n --lca-ranks kingdom,phylum,class,order,family,genus,species \\\n --orf-filter 1 \\\n --remove-tmp-files 1 \\\n --threads 14\nmmseqs createtsv mmseqs2/2612.contigs \\\n mmseqs2/2612.mmseqs2_gtdb \\\n```\n:::\n\n\n::: {.callout-tip title=\"Question\" appearence=\"simple\"}\n\n**What is the proportion of contigs that could be assigned to the different taxonomic ranks, such as species or genus? What are the dominant taxa?**\n\nHint: You can access this information easily by opening the file using visidata: `vd 2612.mmseqs2_gtdb.tsv` \n\n:::\n\n::: {.callout-note collapse=\"true\" title=\"Answer\"}\n\nFrom the 3,606 contigs, MMSeqs2's _taxonomy_ workflow assigned 3,523 contigs to any taxonomy. For the rest, there was not enough information and they were discarded.\n\nFrom the 3,523 assigned contigs, 2,013 were assigned to the rank \"species\", while 1,137 could only be assigned to the rank \"genus\".\n\nThe most contigs were assigned the archael species _Halococcus morrhuae_ (n=386), followed by the bacterial species _Olsenella E sp003150175_ (n=298) and _Collinsella sp900768795_ (n=186).\n:::\n\n### Taxonomic assignment of MAGs\n\nMMSeqs2's _taxonomy_ workflow is very useful to classify all contigs taxonomically. However, how would we determine which species we reconstructed by binning our contigs?\n\nThe simplest approach would be that we could summarise MMSeqs2's taxonomic assignments of all contigs of a MAG and then determine which lineage is the most frequent one. Although this would work in general, there is another approach that is more sophisticated: GTDB toolkit [GTDBTK, @Chaumeil2020].\n\nGTDBTK performs three steps to assign a MAG to a taxonomic lineage:\n\n1. **Lineage identification** based on single-copy marker genes using Hidden Markov Models (HMMs)\n2. **Multi-sequence alignment** of the identified marker genes\n3. Placement of the MAG genome into a **fixed reference tree** at class level\n\nThe last step is particularly clever. Based on the known diversity of a lineage present in the GTDB, it will construct a reference tree with all known taxa of this lineage. Afterwards, the tree structure is fixed and an algorithm attempts to create a new branch in the tree for placing the unknown MAG based on both the tree structure and the multi-sequence alignment. \n\nIn most cases, both the simple approach of taking the majority of the MMSeqs2's _taxonomy_ results and the GTDBTK approach lead to very similar results. However, GTDBTK performs better when determining whether a new MAG potentially belongs to a new genus or even a new family.\n\nTo infer which taxa our five reconstructed MAGs represent, we can run the GTDBTK.\n\n::: {.callout-caution}\nThe latest version of the database used by GTDBTK (r207) requires about 70 GB of harddisk space and 80 GB of memory for running.\n\nAs our computer infrastructure does not provide so much memory for every user, I pre-computed the results. You can find the results `2612.gtdbtk_archaea.tsv` and `2612.gtdbtk_bacteria.tsv` in the folder `/vol/volume/denovo-assembly`.\n:::\n\nFirst, we need to install the GTDB database:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nmkdir -p refdbs/gtdbtk\nwget -O refdbs/gtdbtk/gtdbtk_v2_data.tar.gz \\\n https://data.gtdb.ecogenomic.org/releases/latest/auxillary_files/gtdbtk_v2_data.tar.gz\ntar xvf refdbs/gtdbtk/gtdbtk_v2_data.tar.gz -C refdbs/gtdbtk\n```\n:::\n\n\nAfterwards, we can run GTDBTK's _classify_ workflow:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\nmkdir gtdbtk\nGTDBTK_DATA_PATH=\"$PWD/refdbs/gtdbtk/gtdbtk_r207_v2\" \\\ngtdbtk classify_wf --cpu 14 --extension fa \\\n --genome_dir metawrap/BIN_REFINEMENT/2612/metawrap_50_10_bins \\\n --out_dir gtdbtk\n```\n:::\n\n\n::: {.callout-tip title=\"Question\" appearence=\"simple\"}\n\n**Do the classifications obtained by GTDBTK match the classifications that were assigned to the contigs using MMSeqs2? Would you expect these taxa given the archaeological context of the samples?**\n\nHint: You can access the classification results of GTDBTK easily by opening the file using visidata: `vd 2612.gtdbtk_archaea.tsv` and `vd 2612.gtdbtk_bacteria.tsv`\n\n:::\n\n::: {.callout-note collapse=\"true\" title=\"Answer\"}\n\nThe five MAGs reconstructed from the sample 2612 were assigned to the taxa:\n\n - `bin.1`: _Agathobacter rectalis_\n - `bin.2`: _Halococcus morrhuae_\n - `bin.3`: _Methanobrevibacter smithii_\n - `bin.4`: _Ruminococcus bromii_\n - `bin.5`: _Bifidobacterium longum_\n\nAll of these species were among the most frequent lineages that were identified by MMSeqs2's _taxonomy_ workflow highlighting the large overlap between the methods.\n\nWe would expect all five species to be present in our sample. All MAGs but `bin.2` were assigned to human gut microbiome commensal species that are typically found in healthy humans. The MAG `bin.2` was assigned to a halophilic archaeal species, which is typically found in salt mines.\n\n:::\n\n### Evaluating the amount of ancient DNA damage\n\nOne of the common questions that remain at this point of our analysis is whether the contigs that we assembled show evidence for the presence of ancient DNA damage. If yes, we could argue that these microbes are indeed ancient, particularly when their DNA fragment length distribution is rather short, too. \n\nMAGs typically consist of either tens or hundreds of contigs. For this case, many of the commonly used tools for quantifying the amount of ancient DNA damage, such as damageprofiler [@Neukamm2021] or mapDamage2 [@Jonsson2013], are not very well suited because they would require us to manually check each of their generated \"smiley plots\", which visualise the amount of C-to-T substitutions at the end of reads.\n\n![Overview of the model comparison performed by pyDamage. The green line represents the null model, i.e. the absence of ancient DNA damage, while the orange line represents the alternative model, i.e. the presence of ancient DNA damage.](assets/images/chapters/denovo-assembly/Pydamage_NZ_JHCB02000011.1.png){#fig-denovoassembly-pydamage}\n\nInstead, we will use the program pyDamage [@Borry2021] that was written with the particular use-case of metagenome-assembled contigs in mind. Although pyDamage can visualise the amount of C-to-T substitutions at the 5' end of reads, it goes a step further and fits two models upon the substitution frequency (@fig-denovoassembly-pydamage). The null hypothesis is that the observed distribution of C-to-T substitutions at the 5' end of reads reflects a flat line, i.e. a case when no ancient DNA damage is present. The alternative model assumes that the distribution resembles an exponential decay, i.e. a case when ancient DNA damage is present. By comparing the fit of these two models to the observed data for each contig, pyDamage can quickly identify contigs that are likely of ancient origin without requiring the user to inspect the plots visually.\n\nWe can run pyDamage directly on the alignment data in BAM format:\n\n\n::: {.cell}\n\n```{.bash .cell-code}\npydamage analyze -w 30 -p 14 alignment/2612.sorted.calmd.bam\n```\n:::\n\n\n::: {.callout-tip title=\"Question\" appearence=\"simple\"}\n\n**Evaluate the pyDamage results with respect to the amount of C-to-T substitutions observed on the\ncontigs! How many contigs does pyDamage consider to show evidence for ancient DNA damage? How much\npower (prediction accuracy) does it have for this decision? Which MAGs are strongly \"ancient\" or\n\"modern\"?**\n\nHint: You can access pyDamage's results easily by opening the file using visidata: `vd pydamage_results/pydamage_results.tsv`\n\n:::\n\n::: {.callout-note collapse=\"true\" title=\"Answer\"}\n\nFrom the 3,606 contigs, pyDamage inferred a q-value, i.e. a p-value corrected for multiple testing, of < 0.5 to 26 contigs. This is partially due to a high fraction of the contigs with no sufficient information to discriminate between the models (predicted accuracy of < 0.5). However, the majority of the contigs with a prediction accuracy of > 0.5 still had q-values of 0.05 and higher. This suggests that overall the sample did not show large evidence of ancient DNA damage.\n\nThis reflects also on the MAGs. Although four of the five MAGs were human gut microbiome taxa, they did not show strong evidence of ancient DNA damage. This suggests that the sample is too young and is well preserved.\n:::\n\n### Annotating genomes for function\n\nThe second approach on how to study biological diversity of samples using the assembly results is to compare the reconstructed genes and their functions with each other.\n\nWhile it is very interesting to reconstruct the metagenome-assembled genomes from contigs and speculate about the evolution of a certain taxon, this does not always help when studying the ecology of a microbiome. Studying the ecology of a microbiome means to understand which taxa are present in an environment, what type of community do they form, what kind of natural products do they produce etc.?\n\nWith the lack of any data from culture isolates, it is rather difficult to discriminate from which species a reconstructed gene sequence is coming from, particularly when the contigs are short. Many microbial species exchange genes among each other via horizontal gene transfer which leads to multiple copies of a gene to be present in our metagenome and increases the level of difficulty further. \n\nBecause of this, many researchers tend to annotate all genes of a MAGs and compare the presence and absence of these genes across other genomes that were taxonomically assigned to the same taxon. This analysis, called pan-genome analysis, allows us to estimate the diversity of a microbial species with respect to their repertoire of protein-coding genes.\n\nOne of the most commonly used annotation tools for MAGs is Prokka [@Seemann2014], although it has recently been challenged by Bakta [@Schwengers2021]. The latter provides the same functionality as Prokka but incorporates more up-to-date reference databases for the annotation. Therefore, the scientific community is slowly shifting to Bakta.\n\nNext to returning information on the protein-coding genes, Prokka also returns the annotation of RNA genes (tRNAs and rRNAs), which will help us to evaluate the quality of MAGs regarding the MIMAG criteria.\n\nFor this practical course, we will use Prokka and we will focus on annotating the MAG `bin.3.fa` that we reconstructed from the sample 2612. \n\n\n::: {.cell}\n\n```{.bash .cell-code}\nprokka --outdir prokka \\\n --prefix 2612_003 \\\n --compliant --metagenome --cpus 14 \\\n metawrap_50_10_bins/bin.3.fa\n```\n:::\n\n\n::: {.callout-tip title=\"Question\" appearence=\"simple\"}\n\n**Prokka has annotated our MAG. What type of files does Prokka return? How many genes/tRNAs/rRNAs were detected?**\n\nHint: Check the file `prokka/2612_003.txt` for the number of annotated elements.\n\n:::\n\n::: {.callout-note collapse=\"true\" title=\"Answer\"}\n\nProkka returns the following files:\n\n - `.faa`: the amino acid sequences of all identified coding sequences \n - `.ffn`: the nucleotide sequences of all identified coding sequences \n - `.fna`: all contigs in nucleotide sequence format renamed following Prokka's naming scheme\n - `.gbk`: all annotations and sequences in GenBank format\n - `.gff`: all annotations and sequences in GFF format\n - `.tsv`: the tabular overview of all annotations\n - `.txt`: the short summary of the annotation results\n\nProkka found 1,797 coding sequences, 32 tRNAs, but no rRNAs. Finding no rRNAs is a common issues when trying to assemble MAGs without long-read sequencing data and is not just characteristic for ancient DNA samples. However, this means that we cannot technically call this MAG a high-quality MAG due to the lack of the rRNA genes. \n\n:::\n\n### Summary\n\nIn this practical course you have gone through all the important steps that are necessary for _de novo_ assembling ancient metagenomic sequencing data to obtain contiguous DNA sequences with little error. Furthermore, you have learned how to cluster these sequences into bins without using any references and how to refine them based on lineage-specific marker genes. For these refined bins, you have evaluated their quality regarding common standards set by the scientific community and assigned the MAGs to its most likely taxon. Finally, we learned how to infer the presence of ancient DNA damage and annotate them for RNA genes and protein-coding sequences.\n\n::: {.callout-caution title=\"Cautionary note - sequencing depth\"}\n\nBe aware of the sequencing depth when you assemble your sample. This sample used in this practical course was not obtained by randomly subsampling but I subsampled the sample so that we are able to reconstruct MAGs.\n\nThe original sample had almost 200 million reads, however, I subsampled it to less than 5 million reads. You usually need a lot of sequencing data for _de novo_ assembly and definitely more data than for reference-alignment based profiling. However, it also heavily depends on the complexity of the sample. So the best advice is: **just give it a try**!\n\n:::\n", "supporting": [ "denovo-assembly_files" ], diff --git a/denovo-assembly.qmd b/denovo-assembly.qmd index 55668b41..541b6df6 100644 --- a/denovo-assembly.qmd +++ b/denovo-assembly.qmd @@ -92,8 +92,6 @@ For everyone, who runs this practical on their own infrastructure, you can downl ```{bash, eval = F} wget https://share.eva.mpg.de/index.php/s/CtLq2R9iqEcAFyg/download/2612_R1.fastq.gz wget https://share.eva.mpg.de/index.php/s/mc5JrpDWdL4rC24/download/2612_R2.fastq.gz - - ``` :::