diff --git a/_quarto.yml b/_quarto.yml index fc7f51b6..8e7622ef 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -12,7 +12,7 @@ book: cover-image: "assets/images/cover.png" page-footer: "© 2023 SPAAM Community & Authors with ❤️. Available under [CC-BY 4.0](http://creativecommons.org/licenses/by-sa/4.0/). Source material [here](https://github.com/SPAAM-community/intro-to-ancient-metagenomics-book)." page-navigation: true - downloads: [pdf, epub] + #downloads: [pdf, epub] site-url: https://spaam-community.github.org/intro-to-ancient-metagenomics-book favicon: favicon.png open-graph: true @@ -50,7 +50,6 @@ book: chapters: - accessing-ancientmetagenomic-data.qmd - ancient-metagenomic-pipelines.qmd - - summary.qmd - part: appendices.qmd chapters: - resources.qmd diff --git a/accessing-ancientmetagenomic-data.qmd b/accessing-ancientmetagenomic-data.qmd index a946d385..939108f8 100644 --- a/accessing-ancientmetagenomic-data.qmd +++ b/accessing-ancientmetagenomic-data.qmd @@ -8,23 +8,24 @@ bibliography: assets/references/chapters/accessing-ancientmetagenomic-data/refer --- ::: {.callout-tip} -For this chapter's exercises, if not already performed, you will need to create the [conda environment](before-you-start.qmd#creating-a-conda-environment) from the `yml` file in the following [link](https://github.com/SPAAM-community/intro-to-ancient-metagenomics-book/raw/main/assets/envs/accessing-ancientmetagenomic-data.yml) (right click and save as to download), and once created, activate the environment with: +For this chapter's exercises, if not already performed, you will need to download the chapter's dataset, decompress the archive, and create and activate the conda environment. + +Do this, use `wget` or right click and save to download this Zenodo archive: [10.5281/zenodo.8413229](https://doi.org/10.5281/zenodo.8413229), and unpack ```bash -conda activate accessing-ancientmetagenomic-data +tar xvf accessing-ancientmetagenomic-data.tar.gz +cd accessing-ancientmetagenomic-data/ ``` -To download the data for this chapter, please download following archive with, extract the tar, and change into the directory. - -For example +You can then create the subsequently activate environment with ```bash -wget -O accessing-metagenomic-data.tar.gz https://www.dropbox.com/scl/fi/6yc24aqmjklppw59b03kp/accessing-ancientmetagenomic-data.tar.gz?rlkey=rly8e3xv28p5v0c2aixicdf21&dl=0 -P //// -tar -xzf accessing-metagenomic-data.tar.gz -cd accessing-metagenomic-data +conda env create -f accessing-ancientmetagenomic-data.yml +conda activate accessing-ancientmetagenomic-data ``` ::: + ## Introduction In most bioinformatic projects, we need to include publicly available comparative data to expand or compare our newly generated data with. diff --git a/ancient-metagenomic-pipelines.qmd b/ancient-metagenomic-pipelines.qmd index 1243baa9..e4bbb4a6 100644 --- a/ancient-metagenomic-pipelines.qmd +++ b/ancient-metagenomic-pipelines.qmd @@ -6,20 +6,20 @@ csl: american-journal-of-physical-anthropology.csl --- ::: {.callout-tip} -For this chapter's exercises, if not already performed, you will need to create the [conda environment](before-you-start.qmd#creating-a-conda-environment) from the `yml` file in the following [link](https://github.com/SPAAM-community/intro-to-ancient-metagenomics-book/raw/main/assets/envs/ancient-metagenomic-pipelines.yml) (right click and save as to download), and once created, activate the environment with: +For this chapter's exercises, if not already performed, you will need to download the chapter's dataset, decompress the archive, and create and activate the conda environment. + +Do this, use `wget` or right click and save to download this Zenodo archive: [10.5281/zenodo.8413239](https://doi.org/10.5281/zenodo.8413239), and unpack ```bash -conda activate ancient-metagenomic-pipelines +tar xvf ancient-metagenomic-pipelines.tar.gz +cd ancient-metagenomic-pipelines/ ``` -To download the data for this chapter, please download following archive with, extract the tar, and change into the directory. - -For example +You can then create the subsequently activate environment with ```bash -wget -O ancient-metagenomic-pipelines.tar.gz https://www.dropbox.com/scl/fi/oxd4ag5haygpihg30grh8/ancient-metagenomic-pipelines.tar.gz?rlkey=p40jtcf6fofs2ak1wfbc1cyg9&dl=0 -P //// -tar -xzf ancient-metagenomic-pipelines.tar.gz -cd ancient-metagenomic-pipelines +conda env create -f ancient-metagenomic-pipelines.yml +conda activate ancient-metagenomic-pipelines ``` ::: @@ -29,7 +29,6 @@ There are additional software requirements for this chapter Please see the relevant chapter section in [Before you start](/before-you-start.qmd) before continuing with this chapter. ::: - ## Introduction A **pipeline** is a series of linked computational steps, where the output of one process becomes the input of the next. Pipelines are critical for managing the huge quantities of data that are now being generated regularly as part of ancient DNA analyses. In this chapter we will go through three dedicated ancient DNA pipelines - all with some (or all!) functionality geared to ancient metagenomics - to show you how you can speed up the more routine aspects of the basic analyses we've learnt about earlier in this text book through workflow automation. diff --git a/authentication-decontamination.qmd b/authentication-decontamination.qmd index 3ff096df..56018a67 100644 --- a/authentication-decontamination.qmd +++ b/authentication-decontamination.qmd @@ -5,20 +5,20 @@ bibliography: assets/references/chapters/authentication-decontamination/referenc --- ::: {.callout-tip} -For this chapter's exercises, if not already performed, you will need to create the [conda environment](before-you-start.qmd#creating-a-conda-environment) from the `yml` file in the following [link](https://github.com/SPAAM-community/intro-to-ancient-metagenomics-book/raw/main/assets/envs/authentication-decontamination.yml) (right click and save as to download), and once created, activate the environment with: +For this chapter's exercises, if not already performed, you will need to download the chapter's dataset, decompress the archive, and create and activate the conda environment. + +Do this, use `wget` or right click and save to download this Zenodo archive: [10.5281/zenodo.8413184](https://doi.org/10.5281/zenodo.8413184), and unpack ```bash -conda activate authentication-decontamination +tar xvf authentication-decontamination.tar.gz +cd authentication-decontamination/ ``` -To download the data for this chapter, please download following archive with, extract the tar, and change into the directory. - -For example +You can then create the subsequently activate environment with ```bash -wget -P //// -O authentication-decontamination.tar.gz https://www.dropbox.com/scl/fi/hzfd2o4ji5zwx7q97diwy/authentication-decontamination.tar.gz?rlkey=84lv4fccfyxaaptqot4k9o2tz&dl=0 -tar -xzf authentication-decontamination.tar.gz -cd authentication-decontamination +conda env create -f authentication-decontamination.yml +conda activate authentication-decontamination ``` ::: @@ -28,7 +28,7 @@ There are additional software requirements for this chapter Please see the relevant chapter section in [Before you start](/before-you-start.qmd) before continuing with this chapter. ::: -# Introduction +## Introduction In ancient metagenomics we typically try to answer two questions: "Who is there?" and "How ancient?", meaning we would like to detect an organism and investigate whether this organism is ancient. There are three typical ways to identify the presence of an organism in a metagenomic sample: @@ -74,7 +74,7 @@ The chapter has the following outline: - Similarity to expected microbiome source (microbial source tracking) -# Simulated ancient metagenomic data +## Simulated ancient metagenomic data In this chapter, we will use 10 pre-simulated metagenomics with [gargammel](https://academic.oup.com/bioinformatics/article/33/4/577/2608651) ancient metagenomic samples from @Pochon2022-hj. \ @@ -106,11 +106,11 @@ Now, after the basic data pre-processing has been done, we can proceed with vali In here you will see a range of directories, each representing different parts of this tutorial. One set of trimmed 'simulated' reads from @Pochon2022-hj in `rawdata/`. -# Genomic hit confirmation +## Genomic hit confirmation Once an organism has been detected in a sample (via alignment, classification or *de-novo* assembly), one needs to take a closer look at multiple quality metrics in order to reliably confirm that the organism is not a false-positive detection and is of ancient origin. The methods used for this purpose can be divided into modern validation and ancient-specific validation criteria. Below, we will cover both of them. -## Modern validation criteria +## Modern genomic hit validation criteria The modern validation methods aim at confirming organism presence regardless of its ancient status. The main approaches include evenness / breadth of coverage computation, assessing alignment quality, and monitoring affinity of the DNA reads to the reference genome of the potential host. @@ -160,8 +160,6 @@ done ::: - - Taxonomic k-mer-based classification of the ancient metagenomic reads can be done via KrakenUniq. However as this requires a very large database file, the results from running KrakenUniq on the 10 simulated genomes can be found in. ```bash @@ -359,11 +357,11 @@ Another important way to detect reads that cross-map between related species is In contrast, a large number of multi-allelic sites indicates that the assigned reads originate from more than one species or strain, which can result in symmetric allele frequency distributions (e.g., if two species or strains are present in equal abundance) (panel g) or asymmetric distributions (e.g., if two species or strains are present in unequal abundance) (panel h). A large number of mis-assigned reads from closely related species can result in a large number of multi-allelic sites with low frequencies of the derived allele (panel i). The situations (g-i) correspond to incorrect assignment of the reads to the reference. Please also check the corresponding "Bad alignments" IGV visualization to the right in the figure above. -## Ancient-specific validation criteria +## Ancient-specific genomic hit validation criteria In contrast to modern genomic hit validation criteria, the ancient-specific validation methods concentrate on DNA degradation and damage pattern as ultimate signs of ancient DNA. Below, we will discuss deamination profile, read length distribution and post mortem damage (PMD) scores metrics that provide good confirmation of ancient origin of the detected organism. -### Ancient status +### Degradation patterns Checking evenness of coverage and alignment quality can help us to make sure that the organism we are thinking about is really present in the metagenomic sample. However, we still need to address the question "How ancinet?". For this purpose we need to compute **deamination profile** and **read length distribution** of the aligned reads in order to prove that they demonstrate damage pattern and are sufficiently fragmented, which would be a good evidence of ancient origin of the detected organisms. @@ -437,7 +435,7 @@ pydamage analyze -w 30 -p 14 filtered.sorted.bam ``` ::: -# Microbiome contamination correction +## Microbiome contamination correction Modern contamination can severely bias ancient metagenomic analysis. Also, ancient contamination, i.e. entered *post-mortem*, can potentially lead to false biological interpretations. Therefore, a lot of efforts in the ancient metagenomics field are directed on establishing methodology for identification of contaminants. Among them, the use of negative (blank) control samples is perhaps the most reliable and straightforward method. Additionally, one often performs microbial source tracking for predicting environment (including contamination environment) of origin for ancient metagenomic samples. diff --git a/bare-bones-bash.qmd b/bare-bones-bash.qmd index 6270d3b0..bf82ea66 100644 --- a/bare-bones-bash.qmd +++ b/bare-bones-bash.qmd @@ -4,9 +4,19 @@ author: Thiseas C. Lamnidis, Aida Andrades Valtueña --- ::: {.callout-tip} -For this chapter's exercises, if not already performed, you will need to [create the conda environment](before-you-start.qmd#creating-a-conda-environment) from the `yml` file in the following [link](https://github.com/SPAAM-community/intro-to-ancient-metagenomics-book/raw/main/assets/envs/bare-bones-bash.yml) (use `wget`, or right click and save as to download). Once created, activate the environment with: +For this chapter's exercises, if not already performed, you will need to download the chapter's dataset, decompress the archive, and create and activate the conda environment. + +Do this, use `wget` or right click and save to download this Zenodo archive: [10.5281/zenodo.8412661](https://doi.org/10.5281/zenodo.8412661), and unpack + +```bash +tar xvf bare-bones-bash.tar.gz +cd bare-bones-bash/ +``` + +You can then create the subsequently activate environment with ```bash +conda env create -f bare-bones-bash.yml conda activate bare-bones-bash ``` ::: @@ -735,7 +745,7 @@ To delete the conda environment conda remove --name bare-bones-bash --all -y ``` -### Conclusion +## Summary You should now know the basics of working on the command line, like: diff --git a/before-you-start.qmd b/before-you-start.qmd index a02efcd2..3474a980 100644 --- a/before-you-start.qmd +++ b/before-you-start.qmd @@ -75,13 +75,7 @@ These instructions have been tested on Ubuntu 22.04, but should apply to most Li Once `conda` is installed and `bioconda` configured, at the beginning of each chapter, to create the `conda` environment from the `yml` file, you will need to run the following: -1. Download the `conda` env file the top of the chapter by right clicking on the link and pressing 'save as', or copy and paste the contents of the file to an empty file in your terminal. - - For example: - - ```bash - wget https:///.yml - ``` +1. Download and unpack the `conda` env file the top of the chapter by right clicking on the link and pressing 'save as'. Once uncompressed, change into the directory. 2. Then you can run the following conda command to install the software into it's dedicated environment @@ -114,7 +108,7 @@ You only have to run the environment creation once! To reuse the environment, just run step 4 and 5 as necessary. ::: {.callout-tip} -To delete a conda software environment, just get the path listed on `conda env list` and delete the folder with `rm -rf `. +To delete a conda software environment, run `conda remove --name --all -y` ::: ## Additional Software {.unnumbered} diff --git a/citing-this-book.qmd b/citing-this-book.qmd index 6845e759..9f5557ed 100644 --- a/citing-this-book.qmd +++ b/citing-this-book.qmd @@ -6,7 +6,7 @@ The source material for this book is located on GitHub: If you wish to cite this book, please use the following bibliographic information -> James A. Fellows Yates, Christina Warinner, Alina Hiß, Arthur Kocher, Clemens Schmid, Irina Velsko, Maxime Borry, Megan Michel, Nikolay Oskolkov, Sebastian Duchene, Thiseas Lamnidis, Aida Andrades Valtueña, Alexander Herbig, & Alexander Hübner. (2023). Introduction to Ancient Metagenomics. In Introduction to Ancient Metagenomics (Version 2022). Zenodo. DOI: [10.5281/zenodo.8027281](https://doi.org/10.5281/zenodo.8027281) +> James A. Fellows Yates, Christina Warinner, Alina Hiß, Arthur Kocher, Clemens Schmid, Irina Velsko, Maxime Borry, Megan Michel, Nikolay Oskolkov, Sebastian Duchene, Thiseas Lamnidis, Aida Andrades Valtueña, Alexander Herbig, Alexander Hübner, Kevin Nota, Robin Warner, Meriam Guellil. (2023). Introduction to Ancient Metagenomics (Edition 2023). Zenodo. DOI: [10.5281/zenodo.8027281](https://doi.org/10.5281/zenodo.8027281) diff --git a/denovo-assembly.qmd b/denovo-assembly.qmd index cd42d8e6..2cbf507d 100644 --- a/denovo-assembly.qmd +++ b/denovo-assembly.qmd @@ -10,20 +10,20 @@ library(pander) ``` ::: {.callout-tip} -For this chapter's exercises, if not already performed, you will need to create the [conda environment](before-you-start.qmd#creating-a-conda-environment) from the `yml` file in the following [link](https://github.com/SPAAM-community/intro-to-ancient-metagenomics-book/raw/main/assets/envs/denovo-assembly.yml) (right click and save as to download), and once created, activate the environment with: +For this chapter's exercises, if not already performed, you will need to download the chapter's dataset, decompress the archive, and create and activate the conda environment. + +Do this, use `wget` or right click and save to download this Zenodo archive: [10.5281/zenodo.8413147](https://doi.org/10.5281/zenodo.8413147), and unpack ```bash -conda activate denovo-assembly +tar xvf denovo-assembly.tar.gz +cd denovo-assembly/ ``` -To download the data for this chapter, please download following archive with, extract the tar, and change into the directory. - -For example +You can then create the subsequently activate environment with ```bash -wget -O denovo-assembly.tar.gz https://www.dropbox.com/scl/fi/a2ax5retqrhci51gg8nzq/denovo-assembly.tar.gz?rlkey=8j8u18g710omadudhcwz61072&dl=0 -P //// -tar -xzf denovo-assembly.tar.gz -cd denovo-assembly +conda env create -f denovo-assembly.yml +conda activate denovo-assembly ``` ::: @@ -72,9 +72,7 @@ Around 2015, a technical revolution started when the first programs, e.g. MEGAHI The technical advancement of being able to perform *de novo* assembly on metagenomic samples led to an explosion of studies that analysed samples that were considered almost impossible to study beforehand. For researchers that are exposed to ancient DNA, the imminent question arises: can we apply the same methods to ancient DNA data? In this practical course, we will walk through all required steps that are necessary to successfully perform _de novo_ assembly from ancient DNA metagenomic sequencing data and show you what you can do once you have obtained the data. -## Practical course - -### Sample overview +## Sample overview For this practical course, I selected a palaeofaeces sample from the study by @Maixner2021, who generated deep metagenomic sequencing data for four palaeofaeces samples that were excavated from an Austrian salt mine in Hallstatt and were associated with the Celtic Iron Age. We will focus on the youngest sample, **2612**, which was dated to be just a few hundred years old (@fig-denovoassembly-maixner). @@ -113,7 +111,7 @@ to find this out. There are about 3.25 million paired-end sequences in these files. ::: -### Preparing the sequencing data for _de novo_ assembly +## Preparing the sequencing data for _de novo_ assembly Before running the actual assembly, we need to pre-process our sequencing data. Typical pre-processing steps include the trimming of adapter sequences and barcodes from the sequencing data and the removal of host or contaminant sequences, such as the bacteriophage PhiX, which is commonly sequenced as a quality control. @@ -162,7 +160,7 @@ The sequencing data for the sample **2612** were generated across eight differen Overall, we have almost no short DNA molecules (< 50 bp) but most DNA molecules are longer than 80 bp. Additionally, there were > 200,000 read pairs that could not be overlapped. Therefore, we can conclude that the sample **2612** is moderately degraded ancient DNA sample and has many long DNA molecules. ::: -### _De novo_ assembly +## _De novo_ assembly Now, we will actual perform the _de novo_ assembly on the sequencing data. For this, we will use the program MEGAHIT [@LiMegahit2015], a _de Bruijn_-graph assembler. @@ -258,7 +256,7 @@ We standardised this approach and added it to the Nextflow pipeline nf-core/mag While MEGAHIT is able to assemble ancient metagenomic sequencing data with high amounts of ancient DNA damage, it tends to introduce damage-derived T and A alleles into the contig sequences instead of the true C and G alleles. This can lead to a higher number of nonsense mutations in coding sequences. We strongly advise you to correct such mutations, e.g. by using the ancient DNA workflow of the Nextflow pipeline [nf-core/mag](https://nf-co.re/mag). ::: -### Aligning the short-read data against the contigs +## Aligning the short-read data against the contigs After the assembly, the next detrimental step that is required for many subsequent analyses is the alignment of the short-read sequencing data back to assembled contigs. @@ -303,7 +301,7 @@ samtools index alignment/2612.sorted.calmd.bam ``` ::: -### Reconstructing metagenome-assembled genomes +## Reconstructing metagenome-assembled genomes There are typically two major approaches on how to study biological diversity of samples using the results obtained from the _de novo_ assembly. The first one is to reconstruct metagenome-assembled genomes (MAGs) and to study the species diversity. @@ -340,9 +338,6 @@ Make sure you have followed the instructions for setting up the additional softw To skip the first steps of metaWRAP and start straight with the binning, we need to create the folder structure and files that metaWRAP expects: - - - ```{bash, eval = F} mkdir -p metawrap/INITIAL_BINNING/2612/work_files ln -s $PWD/alignment/2612.sorted.calmd.bam \ @@ -420,7 +415,6 @@ tar xvf checkM/checkm_data_2015_01_16.tar.gz -C checkM echo checkM | checkm data setRoot checkM ``` - Afterwards, we can execute metaWRAP's bin refinement module: ```{bash, eval = F} @@ -444,9 +438,7 @@ conda deactivate The latter step will produce a summary file, `metawrap_50_10_bins.stats`, that lists all retained bins and some key characteristics, such as the genome size, the completeness estimate, and the contamination estimate. The latter two can be used to assign a quality score according to the Minimum Information for MAG (MIMAG; see info box). ::: - ::: {.callout-note title="The Minimum Information for MAG (MIMAG)"} - The two most common metrics to evaluate the quality of MAGs are: - the **completeness**: how many of the expected lineage-specific single-copy marker genes were present in the MAG? @@ -455,13 +447,11 @@ The two most common metrics to evaluate the quality of MAGs are: These metric is usually calculated using the marker-gene catalogue of checkM [@Parks2015], also if there are other estimates from other tools such as BUSCO [@Manni2021], GUNC [@Orakov2021] or checkM2 [@Chklovski2022]. Depending on the estimates on completeness and contamination plus the presence of RNA genes, MAGs are assigned to the quality category following the Minimum Information for MAG criteria [@Bowers2017] You can find the overview [here](https://www.nature.com/articles/nbt.3893/tables/1). - ::: As these two steps will run rather long and need a large amount of memory and disk space, I have provided the results of metaWRAP's bin refinement. You can find the file here: `///denovo-assembly/metawrap_50_10_bins.stats`. Be aware that these results are based on the bin refinement of the results of three binning tools and include CONCOCT. ::: {.callout-tip title="Question" appearence="simple"} - **How many bins were retained after the refinement with metaWRAP? How many high-quality and medium-quality MAGs did the refinement yield following the MIMAG criteria?** Hint: You can more easily visualise tables on the terminal using the Python program `visidata`. You can open a table using `vd -f tsv ///denovo-assembly/metawrap_50_10_bins.stats`. (press ctrl+q to exit). Next to separating the columns nicely, it allows you to perform a lot of operations like sorting conveniently. Check the cheat sheet [here](https://jsvine.github.io/visidata-cheat-sheet/en/). @@ -474,7 +464,7 @@ In total, metaWRAP retained five bins, similarly to MaxBin2. Of these five bins, ::: -### Taxonomic assignment of contigs +## Taxonomic assignment of contigs What should we do when we simply want to know to which taxon a certain contig most likely belongs to? @@ -494,7 +484,7 @@ For each tool, we can either use pre-computed reference databases or compute our As for any task that involves the alignment of sequences against a reference database, the chosen reference database should fit the sequences you are searching for. If your reference database does not capture the diversity of your samples, you will not be able to assign a subset of the contigs. There is also a trade-off between a large reference database that contains all sequences and its memory requirement. @Wright2023 elaborated on this quite extensively when comparing Kraken2 against MetaPhlAn. -While all of these tools can do the job, I typically prefer to use the program MMSeqs2 [@Steinegger2017] because it comes along with a very fast algorithm based on aminoacid sequence alignment and implements a lowest common ancestor (LCA) algorithm (@fig-denovoassembly-mmseqs2). Recently, they implemented a _taxonomy_ workflow [@Mirdita2021] that allows to efficiently assign contigs to taxons. Luckily, it comes with multiple pre-computed reference databases, such as the GTDB v207 reference database [@Parks2020], and therefore it is even more accessible for users. +While all of these tools can do the job, I typically prefer to use the program MMSeqs2 [@Steinegger2017] because it comes along with a very fast algorithm based on amino acid sequence alignment and implements a lowest common ancestor (LCA) algorithm (@fig-denovoassembly-mmseqs2). Recently, they implemented a _taxonomy_ workflow [@Mirdita2021] that allows to efficiently assign contigs to taxons. Luckily, it comes with multiple pre-computed reference databases, such as the GTDB v207 reference database [@Parks2020], and therefore it is even more accessible for users. ![Scheme of the _taxonomy_ workflow implemented into MMSeqs2. Adapted from @Mirdita2021, Fig. 1.](assets/images/chapters/denovo-assembly/MMSeqs2_classify_Fig1.jpeg){#fig-denovoassembly-mmseqs2} @@ -557,7 +547,7 @@ From the 3,523 assigned contigs, 2,013 were assigned to the rank "species", whil The most contigs were assigned the archael species _Halococcus morrhuae_ (n=386), followed by the bacterial species _Olsenella E sp003150175_ (n=298) and _Collinsella sp900768795_ (n=186). ::: -### Taxonomic assignment of MAGs +## Taxonomic assignment of MAGs MMSeqs2's _taxonomy_ workflow is very useful to classify all contigs taxonomically. However, how would we determine which species we reconstructed by binning our contigs? @@ -633,7 +623,7 @@ We would expect all five species to be present in our sample. All MAGs but `bin. ::: -### Evaluating the amount of ancient DNA damage +## Evaluating the amount of ancient DNA damage One of the common questions that remain at this point of our analysis is whether the contigs that we assembled show evidence for the presence of ancient DNA damage. If yes, we could argue that these microbes are indeed ancient, particularly when their DNA fragment length distribution is rather short, too. @@ -667,7 +657,7 @@ From the 3,606 contigs, pyDamage inferred a q-value, i.e. a p-value corrected fo This reflects also on the MAGs. Although four of the five MAGs were human gut microbiome taxa, they did not show strong evidence of ancient DNA damage. This suggests that the sample is too young and is well preserved. ::: -### Annotating genomes for function +## Annotating genomes for function The second approach on how to study biological diversity of samples using the assembly results is to compare the reconstructed genes and their functions with each other. @@ -753,7 +743,7 @@ To delete the conda environment conda remove --name denovo-assembly --all -y ``` -### Summary +## Summary In this practical course you have gone through all the important steps that are necessary for _de novo_ assembling ancient metagenomic sequencing data to obtain contiguous DNA sequences with little error. Furthermore, you have learned how to cluster these sequences into bins without using any references and how to refine them based on lineage-specific marker genes. For these refined bins, you have evaluated their quality regarding common standards set by the scientific community and assigned the MAGs to its most likely taxon. Finally, we learned how to infer the presence of ancient DNA damage and annotate them for RNA genes and protein-coding sequences. diff --git a/functional-profiling.qmd b/functional-profiling.qmd index 63813c36..a3530c93 100644 --- a/functional-profiling.qmd +++ b/functional-profiling.qmd @@ -3,28 +3,27 @@ title: Functional Profiling author: Irina Velsko, James A. Fellows Yates --- -::: {.callout-warning} +::: {.callout-important} This chapter has not been updated since the 2022 edition of this book. ::: ::: {.callout-tip} -For this chapter's exercises, if not already performed, you will need to create the [conda environment](before-you-start.qmd#creating-a-conda-environment) from the `yml` file in the following [link](https://github.com/SPAAM-community/intro-to-ancient-metagenomics-book/raw/main/assets/envs/functional-profiling.yml) (right click and save as to download), and once created, activate the environment with: +For this chapter's exercises, if not already performed, you will need to download the chapter's dataset, decompress the archive, and create and activate the conda environment. + +Do this, use `wget` or right click and save to download this Zenodo archive: [10.5281/zenodo.6983188](https://doi.org/10.5281/zenodo.6983188), and unpack ```bash -conda activate functional-profiling +tar xvf 5c-functional-genomics.tar.gz +cd 5c-functional-genomics/ ``` -To download the data for this chapter, please download following archive with, extract the tar, and change into the directory. - -For example +You can then create the subsequently activate environment with ```bash -wget -P . -O functional-profiling.tar.gz https://zenodo.org/record/6983189/files/5c-functional-genomics.tar.gz -tar -xzf functional-profiling.tar.gz -cd functional-profiling/ +conda env create -f day5.yml +conda activate phylogenomics-functional ``` ::: - :::{.callout-note} The above conda environment _does not_ include HUMAnN3 due to conflicts with the R packages in the environment. @@ -39,8 +38,6 @@ conda create -n humann3 -c bioconda humann ## Preparation - - Open R Studio from within the conda environment ```bash @@ -72,7 +69,7 @@ Running HUMAnN3 module requires about 72 GB of memory because it has to load a If you have sufficient computational memory resources, you can run the following steps to run the bin refinement yourself. We will not run HUMANn3 here as it requires very large databases and takes a long time to run, we have already prepared output for you. - +::: ::: {.callout-warning title="Example commands - do not run!" collapse="true"} ```bash diff --git a/genome-mapping.qmd b/genome-mapping.qmd index a9706f23..a5fdb727 100644 --- a/genome-mapping.qmd +++ b/genome-mapping.qmd @@ -3,21 +3,21 @@ title: Genome Mapping author: Alexander Herbig and Alina Hiß --- -::: callout-tip -For this chapter's exercises, if not already performed, you will need to create the [conda environment](before-you-start.qmd#creating-a-conda-environment) from the `yml` file in the following [link](https://github.com/SPAAM-community/intro-to-ancient-metagenomics-book/raw/main/assets/envs/genome-mapping.yml) (use `wget`, or right click and save as to download), and once created, activate the environment with: +::: {.callout-tip} +For this chapter's exercises, if not already performed, you will need to download the chapter's dataset, decompress the archive, and create and activate the conda environment. -``` bash -conda activate genome-mapping -``` +Do this, use `wget` or right click and save to download this Zenodo archive: [10.5281/zenodo.8413204](https://doi.org/10.5281/zenodo.8413204), and unpack -To download the data for this chapter, please download following archive with, extract the tar, and change into the directory. +```bash +tar xvf genome-mapping.tar.gz +cd genome-mapping/ +``` -For example +You can then create the subsequently activate environment with ```bash -wget --quiet -O genome-mapping.tar.gz https://www.dropbox.com/scl/fi/wghlf22dxyl4imp96wf9x/genome-mapping.tar.gz?rlkey=mw90slu9ep9tdzpypfxrsrfuj&dl=0 -P //// -tar -xzf genome-mapping.tar.gz -cd genome-mapping +conda env create -f genome-mapping.yml +conda activate genome-mapping ``` ::: @@ -452,7 +452,7 @@ To delete the conda environment conda remove --name genome-mapping --all -y ``` -### Conclusions +## Summary - Mapping DNA sequencing reads to a reference genome is a complex procedure that requires multiple steps. - Mapping results are the basis for genotyping, i.e. the detection of differences to the reference. diff --git a/git-github.qmd b/git-github.qmd index 22705c51..81d73375 100644 --- a/git-github.qmd +++ b/git-github.qmd @@ -5,9 +5,19 @@ bibliography: assets/references/chapters/git-github/references.bib --- ::: {.callout-tip} -For this chapter's exercises, if not already performed, you will need to create the [conda environment](before-you-start.qmd#creating-a-conda-environment) from the `yml` file in the following [link](https://github.com/SPAAM-community/intro-to-ancient-metagenomics-book/raw/main/assets/envs/git-github.yml) (right click and save as to download), and once created, activate the environment with: +For this chapter's exercises, if not already performed, you will need to download the chapter's dataset, decompress the archive, and create and activate the conda environment. + +Do this, use `wget` or right click and save to download this Zenodo archive: [10.5281/zenodo.8413100](https://doi.org/10.5281/zenodo.8413100), and unpack + +```bash +tar xvf git-github.tar.gz +cd git-github/ +``` + +You can then create the subsequently activate environment with ```bash +conda env create -f git-github.yml conda activate git-github ``` ::: @@ -521,9 +531,6 @@ You can make a branch from any point in your git history, and also make as many ### Branches - - - There are two ways you can make a branch. The first way is using the GitHub interface, as in @fig-git-github-githubbranch. @@ -711,23 +718,6 @@ Fast-forward ``` ::: -## Summary - -In this chapter, we have gone over the fundamental concepts of Git. - -We've gone through setting up your GitHub account to allow passwordless interaction between the GitHub remote repository and your local copy on your machine with SSH keys. - -Through the GitHub website interface we made a new repository and gone through the 6 basic commands you need for using Git - -1. git clone -2. git add -3. git status -4. git commit -5. git push -6. git pull - -We finally covered how to work in sandboxes and collaboratively with branches and pull requests. - ## (Optional) clean-up Let's clean up your working directory by removing all the data and output from this chapter. @@ -759,6 +749,23 @@ To delete the conda environment conda remove --name git-github --all -y ``` +## Summary + +In this chapter, we have gone over the fundamental concepts of Git. + +We've gone through setting up your GitHub account to allow passwordless interaction between the GitHub remote repository and your local copy on your machine with SSH keys. + +Through the GitHub website interface we made a new repository and gone through the 6 basic commands you need for using Git + +1. git clone +2. git add +3. git status +4. git commit +5. git push +6. git pull + +We finally covered how to work in sandboxes and collaboratively with branches and pull requests. + ## Questions to think about 1. Why is using a version control software for tracking data and code important? diff --git a/phylogenomics.qmd b/phylogenomics.qmd index eea3b77c..d31dbab2 100644 --- a/phylogenomics.qmd +++ b/phylogenomics.qmd @@ -4,20 +4,20 @@ author: Arthur Kocher and Aida Andrades Valtueña --- ::: {.callout-tip} -For this chapter's exercises, if not already performed, you will need to create the [conda environment](before-you-start.qmd#creating-a-conda-environment) from the `yml` file in the following [link](https://github.com/SPAAM-community/intro-to-ancient-metagenomics-book/raw/main/assets/envs/phylogenomics.yml) (right click and save as to download), and once created, activate the environment with: +For this chapter's exercises, if not already performed, you will need to download the chapter's dataset, decompress the archive, and create and activate the conda environment. + +Do this, use `wget` or right click and save to download this Zenodo archive: [10.5281/zenodo.8413215](https://doi.org/10.5281/zenodo.8413215), and unpack ```bash -conda activate phylogenomics +tar xvf phylogenomics.tar.gz +cd phylogenomics/ ``` -To download the data for this chapter, please download following archive with, extract the tar, and change into the directory. - -For example +You can then create the subsequently activate environment with ```bash -wget -O phylogenomics.tar.gz https://www.dropbox.com/scl/fi/jyz9n3h8yt8jeovviizf4/phylogenomics.tar.gz?rlkey=1aowh0grdvphht2h8h6qox0fa&dl=0 -P //// -tar -xzf phylogenomics.tar.gz -cd phylogenomics +conda env create -f phylogenomics.yml +conda activate phylogenomics ``` ::: @@ -27,9 +27,7 @@ There are additional software requirements for this chapter Please see the relevant chapter section in [Before you start](/before-you-start.qmd) before continuing with this chapter. ::: -## Practical - -### Preparation +## Preparation The data and conda environment `.yaml` file for this practical session can be downloaded from here: [https://doi.org/10.5281/zenodo.6983184](https://doi.org/10.5281/zenodo.6983184). See instructions on page. @@ -47,7 +45,7 @@ Load the conda environment. conda activate phylogenomics ``` -### Visualize the sequence alignment +## Visualize the sequence alignment In this practical session, we will be working with an alignment produced as you learned in the practical _Genome mapping_. @@ -116,7 +114,7 @@ Once you know this, can you already tell by looking at the alignment which seque We can easily see that the last sequence in the alignment (Y. pseudotuberculosis) contains more disagreements to the consensus. This is normal since this is the only genome not belonging to the _Y. pestis_ species: we will use it as an outgroup ::: -### Distance-based phylogeny: Neighbour Joining +## Distance-based phylogeny: Neighbour Joining The Neighbour Joining (NJ) method is an agglomerative algorithm which can be used to derive a phylogenetic tree from a pairwise distance matrix. In essence, this method will be grouping taxa that have the shortest distance together first, and will be doing this iteratively until all the taxa/sequences included in your alignment have been placed in a tree. @@ -148,7 +146,6 @@ As said above, we will explore own NJ tree in _FigTree_. Open the software by ty ![](assets/images/chapters/phylogenomics/11.png) - Note that even though a root is displayed by default in _FigTree_, NJ trees are actually **unrooted**. We know that _Yersinia pseudotuberculosis_ (labelled here as _Y. pseudotuberculosis_) is an outgroup to _Yersinia pestis_. You can reroot the tree by selecting _Y.pseudotuberculosis_ and pressing _Reroot_. ![](assets/images/chapters/phylogenomics/14.png) @@ -181,7 +178,7 @@ Do they form a monophyletic group (a clade)? Yes, they form a monophyletic group. We can also say that this group of prehistoric strains form their own lineage. ::: -### Probabilistic methods: Maximum Likelihood and Bayesian inference +## Probabilistic methods: Maximum Likelihood and Bayesian inference These are the most commonly used approach today. In general, probabilistic methods are statistical techniques that are based on models under which the observed data is generated through a stochastic process depending on a set of parameters which we want to estimate. The probability of the data given the model parameters is called the likelihood. @@ -199,7 +196,7 @@ In a phylogenetic probabilistic model, the data is the sequence alignment and th ![](assets/images/chapters/phylogenomics/20.png) ::: -#### Maximum likelihood estimation and bootstrapping +## Maximum likelihood estimation and bootstrapping One way we can make inferences from a probabilistic model is by finding the combination of parameters which maximises the likelihood. These parameter values are called maximum likelihood (ML) estimates. We are usually not able to compute the likelihood value for all possible combinations of parameters and have to rely on heuristic algorithms to find the maximum likelihood estimates. @@ -260,7 +257,7 @@ Lastly, we will export the rooted tree from figtree: File -> Export trees -> sel ![](assets/images/chapters/phylogenomics/export_tree.png) -#### Estimating a time-tree using Bayesian phylogenetics (_BEAST2_) +## Estimating a time-tree using Bayesian phylogenetics (_BEAST2_) Now, we will try to use reconstruct a phylogeny in which the branch lengths do not represent a number of substitutions but instead represent the time of evolution. To do so, we will use the dates of ancient genomes (C14 dates) to calibrate the tree in time. This assumes a molecular clock hypothesis in which substitutions occur at a rate that is relatively constant in time so that the time of evolution can be estimated based on the number of substitutions. @@ -284,8 +281,6 @@ To characterize the full posterior distribution of each parameter, we would need ![](assets/images/chapters/phylogenomics/25.png) -##### Set up a _BEAST2_ analysis - ::: {.callout-tip} The ["taming the beast" website](https://taming-the-beast.org/tutorials/) has great tutorials to learn setting a _BEAST2_ analysis. In particular, the "Introduction to BEAST2", "Prior selection" and "Time-stamped data" are good starts. ::: @@ -347,7 +342,7 @@ Once the analysis is running two files should have been created and are continuo While the analysis is running, you can start reading the next section! -##### Assessing _BEAST2_ results +### Assessing _BEAST2_ results ::: {.callout-note title="Reminder"} We are using an MCMC algorithm to sample the posterior distribution of parameters. If the MCMC has run long enough, we can use the sampled parameters to approximate the posterior distribution itself. Therefore, we have to check first that the MCMC chain has run long enough. @@ -407,7 +402,7 @@ What is your mean estimate of the clock rate (ucld mean)? ![](assets/images/chapters/phylogenomics/beast_results_ucldMean.png) ::: -##### MCC tree +### MCC tree Since we are working in a Bayesian framework, we do not obtain a single phylogenetic tree as with Maximum likelihood, but a large set of trees which should be representative of the posterior distribution. In contrast with mono-dimensional parameters, a tree distribution cannot be easily summarized with mean or median estimates. Instead, we need to use specific tree-summarizing techniques. One of the most popular is the maximum clade credibility (MCC) tree, which works as follow: @@ -445,7 +440,7 @@ What is your estimate for the age of the most recent common ancestor of all Y. p ~5800 years BP (HPD 95%: ~8000-4500 years BP) ::: -#### Bonus: Temporal signal assessment +### Bonus: Temporal signal assessment It is a good practice to assess if the genetic sequences that we analyse do indeed behave like molecular clocks before trying to estimate a time tree (i.e. we should have done this before the actual _BEAST2_ analysis). A classic way to assess the temporal signal of a dataset is the root-to-tip regression. The rationale of the root-to-tip regression is to verify that the oldest a sequence is, the closer it should be to the root in a (rooted) substitution tree because there was less time for substitution to accumulate. In other words, their should be a correlation between sample age and distance to the root, which we can assess using a linear regression (root-to-tip regression). This can be done using the program _TempEst_: diff --git a/python-pandas.qmd b/python-pandas.qmd index 75576bfe..8e45d35a 100644 --- a/python-pandas.qmd +++ b/python-pandas.qmd @@ -7,21 +7,21 @@ author: Robin Warner, Kevin Nota, and Maxime Borry This session is typically ran held in parallel to the Introduction to R and Tidyverse. Participants of the summer schools chose which to attend based on their prior experience. We recommend the [introduction to R session](r-tidyverse.qmd) if you have no experience with neither R nor Python. ::: -::: callout-tip -For this chapter's exercises, if not already performed, you will need to create the [conda environment](before-you-start.qmd#creating-a-conda-environment) from the `yml` file in the following [link](https://github.com/SPAAM-community/intro-to-ancient-metagenomics-book/raw/main/assets/envs/python-pandas.yml) (use `wget`, or right click and save as to download), and once created, activate the environment with: +::: {.callout-tip} +For this chapter's exercises, if not already performed, you will need to download the chapter's dataset, decompress the archive, and create and activate the conda environment. -``` bash -conda activate python-pandas -``` +Do this, use `wget` or right click and save to download this Zenodo archive: [10.5281/zenodo.8413046](https://doi.org/10.5281/zenodo.8413046), and unpack -To download the data for this chapter, please download following archive with, extract the tar, and change into the directory. +```bash +tar xvf python-pandas.tar.gz +cd python-pandas/ +``` -For example +You can then create the subsequently activate environment with ```bash -wget -O python-pandas.tar.gz https://www.dropbox.com/scl/fi/38jrvyat9aaodix46zeuc/python-pandas.tar.gz?rlkey=u1wn7qwkqn288o0bnx5mqrgfp&dl=0 -P //// -tar -xzf python-pandas.tar.gz -cd python-pandas +conda env create -f python-pandas.yml +conda activate python-pandas ``` ::: diff --git a/r-tidyverse.qmd b/r-tidyverse.qmd index 01bdc87f..518ffe2b 100644 --- a/r-tidyverse.qmd +++ b/r-tidyverse.qmd @@ -14,21 +14,21 @@ bibliography: assets/references/chapters/r-tidyverse/references.bib This session is typically ran held in parallel to the Introduction to Python and Pandas. Participants of the summer schools chose which to attend based on their prior experience. We recommend this session if you have no experience with neither R nor Python. ::: -::: callout-tip -For this chapter's exercises, if not already performed, you will need to create the [conda environment](before-you-start.qmd#creating-a-conda-environment) from the `yml` file in the following [link](https://github.com/SPAAM-community/intro-to-ancient-metagenomics-book/raw/main/assets/envs/r-tidyverse.yml) (use `wget`, or right click and save as to download), and once created, activate the environment with: +::: {.callout-tip} +For this chapter's exercises, if not already performed, you will need to download the chapter's dataset, decompress the archive, and create and activate the conda environment. -``` bash -conda activate r-tidyverse -``` +Do this, use `wget` or right click and save to download this Zenodo archive: [10.5281/zenodo.8413026](https://doi.org/10.5281/zenodo.8413026), and unpack -To download the data for this chapter, please download following archive with, extract the tar, and change into the directory. +```bash +tar xvf r-tidyverse.tar.gz +cd r-tidyverse/ +``` -For example +You can then create the subsequently activate environment with ```bash -wget https://www.dropbox.com/s/61zy5uadsatxz7y/r-tidyverse.tar.gz -P //// -tar -xzf r-tidyverse.tar.gz -cd r-tidyverse +conda env create -f r-tidyverse.yml +conda activate r-tidyverse ``` ::: diff --git a/resources.qmd b/resources.qmd index cc298924..0070862b 100644 --- a/resources.qmd +++ b/resources.qmd @@ -2,6 +2,11 @@ title: Resources --- +::: {.callout-important} +This page is still under construction +::: + + ## Introduction to NGS Sequencing diff --git a/summary.qmd b/summary.qmd deleted file mode 100644 index ac23450c..00000000 --- a/summary.qmd +++ /dev/null @@ -1,7 +0,0 @@ -# Summary {.unnumbered} - -In summary, this book has no content whatsoever. - -```{r} -1 + 1 -``` diff --git a/taxonomic-profiling.qmd b/taxonomic-profiling.qmd index 75f96bd9..2b7a9a3d 100644 --- a/taxonomic-profiling.qmd +++ b/taxonomic-profiling.qmd @@ -5,21 +5,20 @@ bibliography: assets/references/chapters/taxonomic_profiling/references.bib --- ::: {.callout-tip} -For this chapter's exercises, if not already performed, you will need to create the [conda environment](before-you-start.qmd#creating-a-conda-environment) from the `yml` file in the following [link](https://github.com/SPAAM-community/intro-to-ancient-metagenomics-book/raw/main/assets/envs/taxonomic-profiling.yml) (right click and save as to download), and once created, activate the environment with: +For this chapter's exercises, if not already performed, you will need to download the chapter's dataset, decompress the archive, and create and activate the conda environment. + +Do this, use `wget` or right click and save to download this Zenodo archive: [10.5281/zenodo.8413138](https://doi.org/10.5281/zenodo.8413138), and unpack ```bash -conda activate taxonomic-profiling +tar xvf taxonomic-profiling.tar.gz +cd taxonomic-profiling/ ``` -To download the data for this chapter, please download following archive with, extract the tar, and change into the directory. - -For example +You can then create the subsequently activate environment with ```bash -wget -q -O taxonomic-profiling.tar.gz -P //// https://www.dropbox.com/scl/fi/pljjweplz534ut9gyl87m/taxonomic-profiling.tar.gz?rlkey=owmeonoupcattbwsy5til4d9i&dl=0 - -tar -xzf taxonomic-profiling.tar.gz -cd taxonomic-profiling +conda env create -f taxonomic-profiling.yml +conda activate taxonomic-profiling ``` ::: diff --git a/tools.qmd b/tools.qmd index 07f82df1..9e3d72c1 100644 --- a/tools.qmd +++ b/tools.qmd @@ -2,6 +2,10 @@ title: Tools --- +::: {.callout-important} +This page is still under construction +::: + This page all the software with links used and referred to in the practical chapters of the book. ## Introduction to R and the Tidyverse