-
Notifications
You must be signed in to change notification settings - Fork 10
/
authentication.qmd
1065 lines (757 loc) · 62.4 KB
/
authentication.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: Authentication
author: "Nikolay Oskolkov, Giulia Zampirolo, Aleksandra Laura Pach"
bibliography: assets/references/authentication.bib
---
::: {.callout-note collapse="true" title="Self guided: chapter environment setup"}
For this chapter's exercises, if not already performed, you will need to download the chapter's dataset, decompress the archive, and create and activate the conda environment.
Do this, use `wget` or right click and save to download this Zenodo archive: [10.5281/zenodo.13759228](https://doi.org/10.5281/zenodo.13759228), and unpack
```bash
tar xvf authentication.tar.gz
cd authentication/
```
You can then create the subsequently activate environment with
```bash
conda env create -f authentication.yml
conda activate authentication
```
:::
::: {.callout-warning}
There are additional software requirements for this chapter
Please see the relevant chapter section in [Before you start](/before-you-start.qmd) before continuing with this chapter.
:::
In ancient metagenomics we typically try to answer two questions: "Who is there?" and "How ancient?", meaning we would like to detect an organism and investigate whether this organism is ancient. There are three typical ways to identify the presence of an organism in a metagenomic sample:
- alignment of DNA fragments to a reference genome ([Bowtie](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3322381/), [BWA](https://pubmed.ncbi.nlm.nih.gov/19451168/), [Malt](https://www.biorxiv.org/content/10.1101/050559v1) etc.)
- taxonomic (kmer-based) classification of DNA fragments ([Kraken](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1891-0), [MetaPhlan](https://www.nature.com/articles/s41587-023-01688-w), [Centrifuge](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5131823/) etc.)
- *de-novo* genome assembly ([Megahit](https://academic.oup.com/bioinformatics/article/31/10/1674/177884), [metaSPAdes](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5411777/) etc.)
The first two are reference-based, i.e. they assume a similarity of a query ancient DNA fragment to a modern reference genome in a database. This is a strong assumption, which might not be true for very old or very diverged ancient organisms. This is the case when the reference-free *de-novo* assembly approach becomes powerful. However, *de-novo* assembly has its own computational challenges for low-coverage ancient metagenomic samples that typically contain very short DNA fragments.
![](assets/images/chapters/authentication/metagenomic_approaches.png)
While all the three types of metagenomic analysis are suitable for exploring composition of metagenomic samples, they do not directly validate the findings or provide information about ancient or endogenous status of the detected orsganism. It can happen that the detected organism
1. was mis-identified (the DNA belongs to another organism than initially thought),
2. has a modern origin (for example, lab or sequencing contaminant)
3. is of exogenous origin (for example, an ancient microbe that entered the host *post-mortem*).
Therefore, additional analysis is needed to follow-up each hit and demonstrate its ancient origin. Below, we describe a few steps that can help ancient metagenomic researchers to verify their findings and put them into biological context.
In this chapter, we will cover:
- How to recognise that a detected organism was mis-identified based on breadth / evenness of coverage
- How to validate findings by breadth of coverage filters via k-mer based taxonomic classification with KrakenUniq
- How to validate findings using alignments and assess mapping quality, edit distance and evenness of coverage profile
- How to detect modern contaminants via deamination profile, DNA fragmentation and post-mortem damage (PMD) scores
- How negative (blank) controls can help disentangle ancient organisms from modern contaminants
- How microbial source tracking can facilitate separating endogenous and exogenous microbial communities
We will also cover
- Introduction to combining authentication with taxonomic profiling
- How to analyse deamination from a metagenomic dataset with metaDMG
- How to investigate the main metaDMG statistics with R
## Simulated ancient metagenomic data
In this chapter, we will use 10 pre-simulated metagenomics with [gargammel](https://academic.oup.com/bioinformatics/article/33/4/577/2608651) ancient metagenomic samples from @Pochon2022-hj. \
![Screenshot of preprint of aMeta by Pochon et al. 2022](assets/images/chapters/authentication/aMeta.png){#fig-authenticationdecontamination-ameta}
::: {.callout-note title="Self guided: data preparation" collapse=true}
The raw simulated data can be accessed via [https://doi.org/10.17044/scilifelab.21261405](https://doi.org/10.17044/scilifelab.21261405)
To download the simulated ancient metagenomic data please use the following command lines:
```bash
mkdir ameta/ && cd ameta/
wget https://figshare.scilifelab.se/ndownloader/articles/21261405/versions/1 \
&& unzip 1 && rm 1
```
The DNA reads were simulated with damage, sequencing errors and Illumina adapters, therefore one will have to trim the adapters first:
```bash
for i in $(ls *.fastq.gz)
do
sample_name=$(basename $i .fastq.gz)
cutadapt -a AGATCGGAAGAG --minimum-length 30 -o ${sample_name}.trimmed.fastq.gz ${sample_name}.fastq.gz -j 4
done
```
Now, after the basic data pre-processing has been done, we can proceed with validation, authentication and decontamination analyses.
:::
In here you will see a range of directories, each representing different parts of this tutorial. One set of trimmed 'simulated' reads from @Pochon2022-hj in `rawdata/`.
## Genomic hit confirmation
Once an organism has been detected in a sample (via alignment, classification or *de-novo* assembly), one needs to take a closer look at multiple quality metrics in order to reliably confirm that the organism is not a false-positive detection and is of ancient origin. The methods used for this purpose can be divided into modern validation and ancient-specific validation criteria. Below, we will cover both of them.
## Modern genomic hit validation criteria
The modern validation methods aim at confirming organism presence regardless of its ancient status. The main approaches include evenness / breadth of coverage computation, assessing alignment quality, and monitoring affinity of the DNA reads to the reference genome of the potential host.
### Depth vs breadth and evenness of coverage
Concluding organism presence by relying solely on the numbers of assigned sequenced reads (aka depth of coverage metric) turns out to be not optimal and too permissive, which may result in a large amount of false-positive discoveries. For example, when using alignment to a reference genome, the mapped reads may demonstrate non-uniform coverage as visualised in the [Integrative Genomics Viewer (IGV)](https://software.broadinstitute.org/software/igv/) below.
![Screenshot of the IGV software. The reference genome bar is shown at the top, and in the main panel tall isolated 'towers' of reads with lots of coloured bases representing read 'stacking' i.e., un-uniform distribution of reads across the whole genome (as expected of the correct reference genome), but accumulation of all reads in the same isolated places on the reference genome, with the many variants on the reads suggesting they are from different species and aligning to conserved regions.](assets/images/chapters/authentication/IGV_uneven_coverage_Y.pestis.png){#fig-authenticationdecontamination-igv}
In this case, DNA reads originating from another microbe were (mis-)aligned to *Yersina pestis* reference genome. It can be observed that a large number the reads align only to a few conserved genomic loci. Therefore, even if many thousands of DNA reads are capable of aligning to the reference genome, the overall uneven alignment pattern suggests no presence of *Yersina pestis* in the metagenomic sample. Thus, not only the number of assigned reads (proportional to depth of coverage metric) but also the **breadth and evenness of coverage** metrics become of particular importance for verification of metagenomic findings, i.e. hits with DNA reads uniformly aligned across the reference genome are more likely to be true-positive detections (@fig-authenticationdecontamination-igv).
![Schematic diagram of the differences between a) read stacking (all reads aligned at one position fo the genome), indicating you've not correctly identified the organism, vs b) reads distributed across all the genome. A formula at the bottom of the image shows how both A and B have the same depth of coverage even though they have very different actual patterns on the genome.](assets/images/chapters/authentication/depth_vs_breadth_of_coverage.png){#fig-authenticationdecontamination-igv}
In the next sections, we will show how to practically compute the breadth and evenness of coverage via KrakenUniq and Samtools.
### Breadth of coverage via KrakenUniq
Here we are going to demonstrate that one can assess breadth of coverage information already at the taxonomic profiling step. Although taxonomic classifiers do not perform alignment, some of them, such as [KrakenUniq](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1568-0) and [Kraken2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1891-0) provide a way to infer breadth of coverage in addition to the number of assigned reads to a taxon. This allows for immediate filtering out a lot of false positive hits. Since Kraken-family classifiers are typically faster and [less memory-demanding](https://www.biorxiv.org/content/10.1101/2022.06.01.494344v1.full.pdf), i.e. can work with very large reference databases, compared to genome aligners, they provide a robust and fairly unbiased initial taxonomic profiling, which can still later be followed-up with proper alignment and computing evenness of coverage as described above.
::: {.callout-note title="Self guided: data preparation" collapse=true}
⚠️ This step will require large amounts of memory and CPUs!, so if running yourself please note this step is better suited for a server, HPC cluster, or the cloud rather than on a laptop!
To profile the data with KrakenUniq one needs a database, a pre-built complete microbial NCBI RefSeq database can be accessed via [https://doi.org/10.17044/scilifelab.21299541](https://doi.org/10.17044/scilifelab.21299541).
Please use the following command line to download the database:
```bash
cd krakenuniq/ ## If you've left it...
wget https://figshare.scilifelab.se/ndownloader/articles/21299541/versions/1 \
&& unzip 1 && rm 1
```
The following example command is how you would execute KrakenUniq.
```bash
for i in $(ls *.trimmed.fastq.gz)
do
krakenuniq --db KRAKENUNIQ_DB --fastq-input $i --threads 20 \
--classified-out ${i}.classified_sequences.krakenuniq \
--unclassified-out ${i}.unclassified_sequences.krakenuniq \
--output ${i}.sequences.krakenuniq --report-file ${i}.krakenuniq.output
done
```
:::
Taxonomic k-mer-based classification of the ancient metagenomic reads can be done via KrakenUniq. However as this requires a very large database file, the results from running KrakenUniq on the 10 simulated genomes can be found in.
```bash
cd krakenuniq/
```
KrakenUniq by default delivers a proxy metric for breadth of coverage called the **number of unique kmers** (in the 4th column of its output table) assigned to a taxon. KrakenUniq output can be easily filtered with respect to both depth and breadth of coverage, which substantially reduces the number of false-positive hits.
![](assets/images/chapters/authentication/krakenuniq_filter.png)
We can filter the KrakenUniq output with respect to both depth (*taxReads*) and breadth (*kmers*) of coverage with the following custom Python script, which selects only species with at east 200 assigned reads and 1000 unique k-mers. After the filtering, we can see a *Yersinia pestis* hit in the *sample 10* that possess the filtering thresholds with respect to both depth and breadth of coverage.
Run this from within the `krakenuniq/` directory.
```bash
for i in $(ls *.krakenuniq.output)
do
../scripts/filter_krakenuniq.py $i 1000 200 ../scripts/pathogensfound.very_inclusive.tab
done
```
![](assets/images/chapters/authentication/filtered_krakenuniq_output.png)
We can also easily produce a KrakenUniq taxonomic abundance table *krakenuniq_abundance_matrix.txt* using the custom R script below, which takes as argument the contents of the `krakenuniq/` folder containing the KrakenUniq output files.
```bash
Rscript ../scripts/krakenuniq_abundance_matrix.R . krakenuniq_abundance_matrix/ 1000 200
```
From the _krakenuniq\_abundance\_matrix.txt_ table inside the resulting directory, it becomes clear that *Yersinia pestis* seems to be present in a few other samples in addition to sample 10.
![](assets/images/chapters/authentication/krakenuniq_abundance_matrix.png)
While KrakenUniq delivers information about breadth of coverage by default, you can also get this information from Kraken2.
For this one has to use a special flag *--report-minimizer-data* when running Kraken2 in order to get the breadth of coverage proxy which is called the **number of distinct minimizers** for the case of Kraken2. Below, we provide an example Kraken2 command line containing the distinct minimizer flag:
::: {.callout-warning title="Example only - do not run!"}
```bash
DBNAME=Kraken2_DB_directory
KRAKEN_INPUT=sample.fastq.gz
KRAKEN_OUTPUT=Kraken2_output_directory
kraken2 --db $DBNAME --fastq-input $KRAKEN_INPUT --threads 20 \
--classified-out $KRAKEN_OUTPUT/classified_sequences.kraken2 \
--unclassified-out $KRAKEN_OUTPUT/unclassified_sequences.kraken2 \
--output $KRAKEN_OUTPUT/sequences.kraken2 \
--report $KRAKEN_OUTPUT/kraken2.output \
--use-names --report-minimizer-data
```
:::
Then the filtering of Kraken2 output with respect to breadth and depth of coverage can be done by analogy with filtering KrakenUniq output table. In case of *de-novo* assembly, the original DNA reads are typically aligned back to the assembled contigs, and the evenness / breadth of coverage can be computed from these alignments.
### Evenness of coverage via Samtools
Now, after we have detected an interesting *Y. pestis* hit, we would like to follow it up, and compute multiple quality metrics (including proper breadth and evenness of coverage) from alignments (Bowtie2 aligner will be used in our case) of the DNA reads to the *Y. pestis* reference genome. Below, we download *Yersinia pestis* reference genome from NCBI, build its Bowtie2 index, and align trimmed reads against *Yersinia pestis* reference genome with Bowtie2. Do not forget to sort and index the alignments as it will be important for computing the evenness of coverage. It is also recommended to remove multi-mapping reads, i.e. the ones that have MAPQ = 0, at least for Bowtie and BWA aligners that are commonly used in ancient metagenomics. Samtools with *-q* flag can be used to extract reads with MAPQ > = 1.
::: {.callout-note title="Self guided: data preparation" collapse=true}
```bash
cd /<path/<to>/authentication/bowtie2
## Download reference genome
NCBI=https://ftp.ncbi.nlm.nih.gov; ID=GCF_000222975.1_ASM22297v1
wget $NCBI/genomes/all/GCF/000/222/975/${ID}/${ID}_genomic.fna.gz
```
:::
```bash
cd /<path/<to>/authentication/bowtie2
## Prepare reference genome and build Bowtie2 index
gunzip GCF_000222975.1_ASM22297v1_genomic.fna.gz; echo NC_017168.1 > region.bed
seqtk subseq GCF_000222975.1_ASM22297v1_genomic.fna region.bed > NC_017168.1.fasta
bowtie2-build --large-index NC_017168.1.fasta NC_017168.1.fasta --threads 10
## Run alignment of raw reads against FASTQ
bowtie2 --large-index -x NC_017168.1.fasta --end-to-end --threads 10 \
--very-sensitive -U ../rawdata/sample10.trimmed.fastq.gz | samtools view -bS -h -q 1 \
-@ 20 - > Y.pestis_sample10.bam
## Sort and index BAM files for rapid access in downstream commands
samtools sort Y.pestis_sample10.bam -@ 10 > Y.pestis_sample10.sorted.bam
samtools index Y.pestis_sample10.sorted.bam
```
Next, the breadth / evenness of coverage can be computed from the BAM-alignments via *samtools depth* as follows:
```bash
samtools depth -a Y.pestis_sample10.sorted.bam > Y.pestis_sample10.sorted.boc
```
and visualised using for example the following R code snippet (alternatively [aDNA-BAMPlotter](https://github.com/MeriamGuellil/aDNA-BAMPlotter) can be used).
Load R by running `R` in your terminal
```bash
R
```
Note the following may take a minute or so to run.
```{r, eval = F}
# Read output of samtools depth commans
df <- read.delim("Y.pestis_sample10.sorted.boc", header = FALSE, sep = "\t")
names(df) <- c("Ref", "Pos", "N_reads")
# Split reference genome in tiles, compute breadth of coverage for each tile
N_tiles <- 500
step <- (max(df$Pos) - min(df$Pos)) / N_tiles
tiles <- c(0:N_tiles) * step; boc <- vector()
for(i in 1:length(tiles))
{
df_loc <- df[df$Pos >= tiles[i] & df$Pos < tiles[i+1], ]
boc <- append(boc, rep(sum(df_loc$N_reads > 0) / length(df_loc$N_reads),
dim(df_loc)[1]))
}
boc[is.na(boc)]<-0; df$boc <- boc
plot(df$boc ~ df$Pos, type = "s", xlab = "Genome position", ylab = "Coverage")
abline(h = 0, col = "red", lty = 2)
mtext(paste0(round((sum(df$N_reads > 0) / length(df$N_reads)) * 100, 2),
"% of genome covered"), cex = 0.8)
```
Once finished examining the plot you can quit R
```bash
## Press 'n' when asked if you want to save your workspace image.
quit()
```
![](assets/images/chapters/authentication/Evenness_of_coverage.png)
In the R script above, we simply split the reference genome into *N_tiles* tiles and compute the breadth of coverage (number of reference nucleotides covered by at least one read normalised by the total length) locally in each tile. By visualising how the local breadth of coverage changes from tile to tile, we can monitor the distribution of the reads across the reference genome. In the evenness of coverage figure above, the reads seem to cover all parts of the reference genome uniformly, which is a good evidence of true-positive detection, even though the total mean breadth of coverage is low due to the low total number of reads.
### Alignment quality
In addition to evenness and breadth of coverage, it is very informative to monitor how well the metagenomic reads map to a reference genome. Here one can control for **mapping quality** ([MAPQ](https://samtools.github.io/hts-specs/SAMv1.pdf) field in the BAM-alignments) and the number of mismatches for each read, i.e. **edit distance**.
Mapping quality (MAPQ) can be extracted from the 5th column of BAM-alignments using Samtools and *cut* command in bash.
```bash
samtools view Y.pestis_sample10.sorted.bam | cut -f5 > mapq.txt
```
Then the 5th column of the filtered BAM-alignment can be visualised via a simple histogram in R as below for two random metagenomic samples.
Load R
```bash
R
```
And generate the histogram with.
```{r, eval = F}
hist(as.numeric(readLines("mapq.txt")), col = "darkred", breaks = 100)
```
![](assets/images/chapters/authentication/MAPQ.png)
Note that MAPQ scores are computed slightly differently for Bowtie and BWA, so they are not directly comparable, however, for both MAPQ ~ 10-30, as in the histograms below, indicates good affinity of the DNA reads to the reference genome. here we provide some examples of how typical MAPQ histograms for Bowtie2 and BWA alignments can look like:
![](assets/images/chapters/authentication/mapq.png)
Edit distance can be extracted by gathering information in the NM-tag inside BAM-alignments, which reports the number of mismatches for each aligned read. This can be done either in bash / awk, or using handy functions from *Rsamtools* R package:
```{r, eval = F}
library("Rsamtools")
param <- ScanBamParam(tag = "NM")
bam <- scanBam("Y.pestis_sample10.sorted.bam", param = param)
barplot(table(bam[[1]]$tag$NM), ylab="Number of reads", xlab="Number of mismatches")
```
Once finished examining the plot you can quit R.
```bash
## Press 'n' when asked if you want to save your workspace image.
quit()
```
![](assets/images/chapters/authentication/edit_distance.png)
In the barplot above we can see that the majority of reads align either without or with very few mismatches, which is an evidence of high affinity of the aligned reads with respect to the reference genome. For a true-positive finding, the edit distance barplot typically has a decreasing profile. However, for a very degraded DNA, it can have a mode around 1 or 2, which can also be reasonable. A false-positive hit would have a mode of the edit distance barplot shifted toward higher numbers of mismatches.
### Affinity to reference
Very related to edit distance is another alignment validation metric which is called **percent identity**. It represents a barplot demonstrating the numbers of reads that are 100% identical to the reference genome (i.e. map without a single mismatch), 99% identical, 98% identical etc. Misaligned reads originating from another related organism have typically most reads with percent identity of 93-96%. In the figure below, the panels (c–e) demonstrate different percent identity distributions. In panel c, most reads show a high similarity to the reference, which indicates a correct assignment of the reads to the reference. In panel d, most reads are highly dissimilar to the reference, which suggests that they originate from different related species. In some cases, as in panel e, a mixture of correctly assigned and misassigned reads can be observed.
![](assets/images/chapters/authentication/multialleleicSNPs.png)
Another important way to detect reads that cross-map between related species is **haploidy** or checking the amount of **multi-allelic SNPs**. Because bacteria are haploid organisms, only one allele is expected for each genomic position. Only a small number of multi-allelic sites are expected, which can result from a few mis-assigned or incorrectly aligned reads. In the figure above, panels (f–i) demonstrate histograms of SNP allele frequency distributions. Panel f demonstrates the situation when we have only a few multi-allelic sites originating from a misaligned reads. This is a preferable case scenario corresponding to correct assignment of the reads to the reference. Please also check the corresponding "Good alignments" IGV visualisation to the right in the figure above.
In contrast, a large number of multi-allelic sites indicates that the assigned reads originate from more than one species or strain, which can result in symmetric allele frequency distributions (e.g., if two species or strains are present in equal abundance) (panel g) or asymmetric distributions (e.g., if two species or strains are present in unequal abundance) (panel h). A large number of mis-assigned reads from closely related species can result in a large number of multi-allelic sites with low frequencies of the derived allele (panel i). The situations (g-i) correspond to incorrect assignment of the reads to the reference. Please also check the corresponding "Bad alignments" IGV visualisation to the right in the figure above.
## Ancient-specific genomic hit validation criteria
In contrast to modern genomic hit validation criteria, the ancient-specific validation methods concentrate on DNA degradation and damage pattern as ultimate signs of ancient DNA. Below, we will discuss deamination profile, read length distribution and post mortem damage (PMD) scores metrics that provide good confirmation of ancient origin of the detected organism.
### Degradation patterns
Checking evenness of coverage and alignment quality can help us to make sure that the organism we are thinking about is really present in the metagenomic sample. However, we still need to address the question "How ancinet?". For this purpose we need to compute **deamination profile** and **read length distribution** of the aligned reads in order to prove that they demonstrate damage pattern and are sufficiently fragmented, which would be a good evidence of ancient origin of the detected organisms.
Deamination profile of a damaged DNA demonstrate an enrichment of C / T polymorphisms at the ends of the reads compared to all other single nucleotide substitutions. There are several tools for computing deamination profile, but perhaps the most popular is [mapDamage](https://academic.oup.com/bioinformatics/article/29/13/1682/184965). The tool can be run using the following command line, still in the `authentication/bowtie2/` directory:
```bash
mapDamage -i Y.pestis_sample10.sorted.bam -r NC_017168.1.fasta -d mapDamage_results/ --merge-reference-sequences --no-stats
```
![](assets/images/chapters/authentication/deamination.png)
maDamage delivers a bunch of useful statistics, among other read length distribution can be checked. A typical mode of DNA reads should be within a range 30-70 base-pairs in order to be a good evidence of DNA fragmentation. Reads longer tha 100 base-pairs are more likely to originate from modern contamination.
![](assets/images/chapters/authentication/read_length.png)
Another useful tool that can be applied to assess how DNA is damaged is [PMDtools](https://github.com/pontussk/PMDtools) which is a maximum-likelihood probabilistic model that calculates an ancient score, **PMD score**, for each read. The ability of PMDtools to infer ancient status with a single read resolution is quite unique and different from mapDamage that can only assess deamination based on a number of reads. PMD scores can be computed using the following command line, please note that Python2 is needed for this purpose.
```bash
samtools view -h Y.pestis_sample10.bam | pmdtools --printDS > PMDscores.txt
```
The distribution of PMD scores can be visualised via a histogram in R as follows.
Load R.
```{r, eval = F}
R
```
Then generate the histogram.
```{r, eval = F}
pmd_scores <- read.delim("PMDscores.txt", header = FALSE, sep = "\t")
hist(pmd_scores$V4, breaks = 1000, xlab = "PMDscores")
```
Once finished examining the plot you can quit R
```bash
## Press 'n' when asked if you want to save your workspace image.
quit()
```
![](assets/images/chapters/authentication/pmd_scores.png)
Typically, reads with PMD scores greater than 3 are considered to be reliably ancient, i.e. damaged, and can be extracted for taking a closer look. Therefore PMDtools is great for separating ancient reads from modern contaminant reads.
As mapDamage, PMDtools can also compute deamination profile. However, the advantage of PMDtools that it can compute deamination profile for UDG / USER treated samples (with the flag *--CpG*). For this purpose, PMDtools uses only CpG sites which escape the treatment, so deamination is not gone completely and there is a chance to authenticate treated samples. Computing deamination pattern with PMDtools can be achieved with the following command line (please note that the scripts *pmdtools.0.60.py* and *plotPMD.v2.R* can be downloaded from the github repository here https://github.com/pontussk/PMDtools):
```bash
samtools view Y.pestis_sample10.bam | pmdtools --platypus > PMD_temp.txt
```
We can then run simple R commands directly from the terminal (without loading R itself with) the following.
```bash
R CMD BATCH plotPMD
```
![](assets/images/chapters/authentication/PMD_Skoglund_et_al_2015_Current_Biology.png)
When performing ancient status analysis on **de-novo** assembled contigs, it can be computationally challenging and time consuming to run mapDamage or PMDtools on all of them as there can be hundreds of thousands contigs. In addition, the outputs from mapDamage and PMDtools lacking a clear numeric quantity or a statistical test that could provide an "ancient vs. non-ancient" decision for each **de-novo** assembled contig. To address these limitations, [pyDamage](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8323603/pdf/peerj-09-11845.pdf) tool was recently developed. PyDamage evaluates the amount of aDNA damage and tests the hypothesis whether a model assuming presence of aDNA damage better explains the data than a null model.
![](assets/images/chapters/authentication/pyDamage.png)
pyDamage can be run on a sorted BAM-alignments of the microbial reads to the **de-novo** assembled contigs using the following command line:
::: {.callout-warning title="Example command - do not run!"}
```bash
pydamage analyze -w 30 -p 14 filtered.sorted.bam
```
:::
## Authentication with taxonomic information
Different tools can be utilised to assess the level of deamination at the strand termini and the most popular are mapDamage2.0 ([https://academic.oup.com/bioinformatics/article/29/13/1682/184965](https://academic.oup.com/bioinformatics/article/29/13/1682/184965)) and PMDtools ([https://github.com/pontussk/PMDtools](https://github.com/pontussk/PMDtools)).
However, applying these to a metagenomic dataset can be computationally demanding and time-consuming due to the tens of thousands of different taxonomic entities it can include.
**metaDMG** is currently under development and it represents a fast, flexible, and efficient tool for performing **taxonomic profiling** (with the integrated ngsLCA algorithm) and quantifying **post-mortem DNA damage**.
It is specially optimised for ancient metagenomic datasets where raw fastq files have been mapped against large sets of reference genomes.
metaDMG should be run on a read coordinate sorted BAM/SAM-alignment file and can calculate the degree of damage from read data mapped against single and multiple genomes by analysing mismatches and deletions contained in the MD:Z tag of the input alignment file.
This reference-free approach allows for a faster processing of the deamination patterns (@fig-authentication-fig1).
![Damage plots of reads assigned to Homo sapiens with post-mortem damage pattern increasing at read termini [from metagenomic samples, adapted from [@Michelsen2022]]. The figure shows the damage rate f (x) = k(x)∕N(x) as a function of position x for both forward (C→T) and reverse (G→A).](assets/images/chapters/authentication/Fig_9_deamination.png){#fig-authentication-fig1}
In particular, three different settings can be run:
1. Single genome analysis with one overall global estimate of the damage patterns. Similar to mapDamage2.0 ([https://academic.oup.com/bioinformatics/article/29/13/1682/184965](https://academic.oup.com/bioinformatics/article/29/13/1682/184965)).
```bash
## Example, do not run
metaDMG-cpp getdamage --run_mode 0
```
2. Metagenomic (e.g. multiple genome alignments) analyses. This mode provides a damage estimate per reference, taxonomic name or accession number, including all alignments without any taxonomical classification
```bash
## Example, do not run
metaDMG-cpp getdamage --run_mode 1
```
3. Metagenomic analyses with the integration of the taxonomy: Least Common Ancestor algorithm (`ngsLCA`). This allows the computation of damage estimates for alignments classified to given taxonomic levels.
```bash
## Example, do not run
metaDMG-cpp lca
```
In this section we will utilise `metaDMG-cpp lca` since we are interested in a more comprehensive analysis that includes the taxonomic classification of our alignments.
For those interested in exploring other functionalities of metaDMG, I encourage you to visit the tool’s official GitHub ([https://github.com/metaDMG-dev/metaDMG-cpp](https://github.com/metaDMG-dev/metaDMG-cpp)) and the ngsLCA official GitHub ([https://github.com/miwipe/ngsLCA](https://github.com/miwipe/ngsLCA)).
### Taxonomic profiling: metaDMG-cpp lca
The `metaDMG-cpp lca` function is based on the `ngsLCA` (next-generation sequence Lowest Common Ancestor) algorithm to collect mismatch information for all reads and generate a taxonomic profile [@Wang2022-fr].
It counts substitutions between read and reference on internal nodes within a taxonomy (e.g. species, genus and family level).
It is built upon the NCBI taxonomy ([https://www.ncbi.nlm.nih.gov/taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy)) and requires three files:
- **nodes.dmp**: taxonomic nodes and the relationships between different taxa in the tree
- **access2taxID**: the "taxonomy file", sequence accession numbers, and the corresponding taxonomic IDs.
- **names.dmp**: scientific names of the taxa are associated with each taxon contained in nodes.dmp file.
::: {.callout-tip}
For custom reference genomes not covered by NCBI, their accession IDs and the corresponded NCBI taxonomic IDs need to be manually attached to the NCBI access2taxID file.
:::
The `ngsLCA` program considers a chosen similarity interval between each read and its reference in the generated bam/sam file.
The similarity can be set as an edit distance `[-editdist[min/max]]`, i.e., number of mismatches between the read to reference genome, or as a similarity distance `[-simscore[low/high]]`, i.e., percentage of mismatches between the read to reference genome.
The main files produced by this command have the extensions `.bdamage.gz` and `lca.gz`.
The first consists of a nucleotide misincorporation matrix (also called **mismatch matrix**) which represents the nucleotide substitution counts across the reads (@tbl-authentication-examplecodetable2).
The lca file reports the sequence analysed and its taxonomic path, as well as other statistics (gc content, fragment length).
We report an example of the bdamage.gz file output printed using the command `metaDMG-cpp print`:
```bash
## Example, do not run
metaDMG-cpp print file.bdamage.gz -names names.dmp > bdamage_table.tsv
```
```{r}
#| label: tbl-authentication-examplecodetable2
#| echo: false
#| results: 'asis'
#| warning: false
#| tbl-cap: "Table 1. Example of the bdamage.gz mismatch matrix table for beech (Fagus sylvatica) of sample VM-14 provided in the exercise."
#| tbl-cap-location: top
library(tidyverse)
library(gt)
# Load the data from CSV
data <- read_csv("assets/images/chapters/authentication/bdamage_example.csv")
# Create the table with gt
data %>%
gt() %>%
tab_header(
title = md("Table 1. Example of the bdamage.gz mismatch matrix table for beech (_Fagus sylvatica_) of sample VM-14 provided in the exercise.")
) %>%
tab_options(
table.width = pct(100),
table.layout = "fixed",
container.overflow.x = "scroll"
)
```
### Deamination patterns
**metaDMG** can perform a numerical optimisation of the deamination frequencies (C→T, G→A) using the binomial or beta-binomial likelihood models, where the latter can deal with a large amount of variance (overdispersion).
This function is called the `metaDMG-cpp dfit` or damage estimates function.
```bash
## Example, do not run
metaDMG-cpp dfit
```
`metaDMG-cpp dfit` allows us to estimate the four fit parameters of the damage model (@fig-authentication-fig2):
- **A**: the amplitude of damage on position one;
- **q**: the constant deamination background;
- **c**: relative decrease of damage per position;
- **ϕ**: the uncertainty of the likelihood model used (binomial or beta-binomial).
Another important parameter is the **Zfit** or significance, which represents the number of standard deviations ("sigmas") away from zero, or in other words, the certainty of the damage being positive.
![The damage model. The figure shows the misincorporations as circles and the damage as a solid line. The fit parameters are reported: A,c,q,ϕ. The uncertainty for a binomial model is in dark grey, while the uncertainty for a beta-binomial model is in light grey [from @Michelsen2022]](assets/images/chapters/authentication/F1._MichelsenB.png){#fig-authentication-fig2}
Accurate damage estimates are crucial for the authentication of the metagenomic dataset.
The _number of reads_ and the _significance (Zfit)_ are additional parameters that can impact the accuracy and reliability of ancient DNA analysis.
Tests on a single-genome have shown how the accuracy of metaDMG dfit increases depending on the number of reads (@fig-authentication-fig3).
![The single-genome simulations on the *Homo Sapiens* genome with a mean fragment length of 60 and 10% damage. The known damage (Dknown) is shown as a dashed line. A. Estimated damage and its standard deviation (error bars) for 20 replicates (100 reads each). B. Average damage as a function of the number of reads (with the average of the standard deviations as errors) [Adapted from @Michelsen2022].](assets/images/chapters/authentication/F3_Michelsen.png){#fig-authentication-fig3}
The simulation on the *Homo Sapiens* genome, illustrated in [@fig-authentication-fig3] demonstrates the individual metaDMG damage estimates for 20 selected replications (iterations 60 to 79).
When damage estimates are minimal, the distribution of Dfit is limited to positive values.
This limitation can result in error bars extending into negative damage values, thus producing unrealistic estimates.
As shown in @fig-authentication-fig3, the damage tends to converge towards the known values as the number of reads increases.
In addition, the simulation reports a relationship between the amount of damage in taxa and the number of reads (@fig-authentication-fig4).
As shown in @fig-authentication-fig4, low expected damage (~5 %) requires about 1000 reads to be 95% certain about its estimation, while higher levels of damage (~15-30%) require fewer reads (100-500) to reach the same level of certainty. When increasing the significance threshold (Zfit) more reads are also required.
![Relationship between the damage and the number of reads for the simulated single-genome data. The lines represent the number of reads needed to correctly infer the amount of damage: the solid line shows a certainty of 95%, and the dashed line, a certainty of 50%. Colors represent the significance (Zfit) cuts at 2 (blue), 3 (red), and 4 (green). [from @Michelsen2022]](assets/images/chapters/authentication/F4_Michelsen.png){#fig-authentication-fig4}
Simulations using metagenomic datasets have also evaluated the relationship between the amount of damage and its significance (@fig-authentication-fig5).
![The amount of damage as a function of significance (metagenomic simulations). A. damage of the ancient taxa; B. damage of the non-ancient taxa. A larger dot size indicates a higher number of reads. [Adapted from @Michelsen2022]](assets/images/chapters/authentication/F5._Michelsen.png){#fig-authentication-fig5}
As shown in @fig-authentication-fig5, there is a difference in the damage estimates between the ancient and the non-ancient taxa of simulated metagenomic datasets.
The non-ancient taxa (@fig-authentication-fig5) report significance values below 2, in contrast to the ancient taxa (@fig-authentication-fig5) which reach a significance of 20.
A relaxed significance threshold (Zfit > 2) and a minimum of 100 reads increase the accuracy of damage estimates to 90% of the dataset [@Michelsen2022].
We also observe how the oldest samples (Cave-100 and Cave-102) which are 100 and 102 thousand years BP, show the highest amount of damage of all the metagenomes.
While Pitch-6 and Cave-22 samples, which are 6 and 22 thousand years old and thus younger have almost similar levels of damage.
### Ancient metagenomic dataset
In this section, we will use 6 metagenomic libraries downsampled with eukaryotes reads from the study by [@Zampirolo2023.12.01.569562] (@fig-authentication-fig6).
The libraries originate from sediment samples of the Velký Mamut'ák rock shelter located in Northern Bohemia (Czech Republic) and covering the period between the Late Neolithic (~6100-5300 cal. BP) to more recent times (800 cal BP).
![Screenshot of preprint of the source dataset by [@Zampirolo2023.12.01.569562]](assets/images/chapters/authentication/BioxRiv_paper.png){#fig-authentication-fig6}
### Ancient metagenomics with metaDMG-cpp: the workflow
This section will cover the metaDMG analysis which involve taxonomic classification of the reads starting from sorted SAM files, the damage estimation and compilation of the final metaDMG output.
To begin, we can find raw SAM files used as input to `metaDMG` we will use for the exercise are stored in `metadmg`.
We also need the taxonomy files, which are in the folder `metadmg/small_taxonomy/`, these include `names.dmp`, `nodes.dmp` and `small_accession2taxid.txt.gz`.
::: {.callout-warning}
**metaDMG** is currently under development and it is therefore important to keep it updated.
The best documentation is currently found in the –help function.
:::
We need to activate a dedicated environment for `metaDMG` as it is still under development. We candeactivate the current one with
```bash
conda deactivate
```
And we will work with metaDMG by activating the environment with the following command.
```bash
conda activate metaDMG
```
:::{.callout-warning}
Before you continue, make sure you're within the `authentication/` directory!
:::
First we will run `metaDMG-cpp lca` to get the mismatch matrix file `bdamage.gz` that we need to estimate the dfit.
```bash
metaDMG-cpp lca --names metadmg/small_taxonomy/names.dmp --nodes metadmg/small_taxonomy/nodes.dmp --acc2tax metadmg/small_taxonomy/small_accession2taxid.txt.gz --sim_score_low 0.95 --sim_score_high 1.0 --how_many 30 --weight_type 1 --threads 12 --bam metadmg/VM-3_800.merged.sort.bam --out_prefix metadmg/VM-3_800.merged.bam
metaDMG-cpp lca --names metadmg/small_taxonomy/names.dmp --nodes metadmg/small_taxonomy/nodes.dmp --acc2tax metadmg/small_taxonomy/small_accession2taxid.txt.gz --sim_score_low 0.95 --sim_score_high 1.0 --how_many 30 --weight_type 1 --threads 12 --bam metadmg/VM-11_3000.merged.sort.bam --out_prefix metadmg/VM-11_3000.merged.bam
metaDMG-cpp lca --names metadmg/small_taxonomy/names.dmp --nodes metadmg/small_taxonomy/nodes.dmp --acc2tax metadmg/small_taxonomy/small_accession2taxid.txt.gz --sim_score_low 0.95 --sim_score_high 1.0 --how_many 30 --weight_type 1 --threads 12 --bam metadmg/VM-14_3900.merged.sort.bam --out_prefix metadmg/VM-14_3900.merged.bam
metaDMG-cpp lca --names metadmg/small_taxonomy/names.dmp --nodes metadmg/small_taxonomy/nodes.dmp --acc2tax metadmg/small_taxonomy/small_accession2taxid.txt.gz --sim_score_low 0.95 --sim_score_high 1.0 --how_many 30 --weight_type 1 --threads 12 --bam metadmg/VM-15_4100.merged.sort.bam --out_prefix metadmg/VM-15_4100.merged.bam
metaDMG-cpp lca --names metadmg/small_taxonomy/names.dmp --nodes metadmg/small_taxonomy/nodes.dmp --acc2tax metadmg/small_taxonomy/small_accession2taxid.txt.gz --sim_score_low 0.95 --sim_score_high 1.0 --how_many 30 --weight_type 1 --threads 12 --bam metadmg/VM-17_5300.merged.sort.bam --out_prefix metadmg/VM-17_5300.merged.bam
metaDMG-cpp lca --names metadmg/small_taxonomy/names.dmp --nodes metadmg/small_taxonomy/nodes.dmp --acc2tax metadmg/small_taxonomy/small_accession2taxid.txt.gz --sim_score_low 0.95 --sim_score_high 1.0 --how_many 30 --weight_type 1 --threads 12 --bam metadmg/VM-19_6100.merged.sort.bam --out_prefix metadmg/VM-19_6100.merged.bam
```
We use the file generated from the previous command (`bdamage.gz`), containing the misincorporation matrix to calculate the deamination pattern.
We use metaDMG-cpp dfit function to obtain a quick computation of a beta-binomial model.
```bash
metaDMG-cpp dfit metadmg/VM-3_800.merged.bam.bdamage.gz --names metadmg/small_taxonomy/names.dmp --nodes metadmg/small_taxonomy/nodes.dmp --showfits 2 --lib ds --out metadmg/VM-3_800.merged.bam
metaDMG-cpp dfit metadmg/VM-11_3000.merged.bam.bdamage.gz --names metadmg/small_taxonomy/names.dmp --nodes metadmg/small_taxonomy/nodes.dmp --showfits 2 --lib ds --out metadmg/VM-11_3000.merged.bam
metaDMG-cpp dfit metadmg/VM-14_3900.merged.bam.bdamage.gz --names metadmg/small_taxonomy/names.dmp --nodes metadmg/small_taxonomy/nodes.dmp --showfits 2 --lib ds --out metadmg/VM-14_3900.merged.bam
metaDMG-cpp dfit metadmg/VM-15_4100.merged.bam.bdamage.gz --names metadmg/small_taxonomy/names.dmp --nodes metadmg/small_taxonomy/nodes.dmp --showfits 2 --lib ds --out metadmg/VM-15_4100.merged.bam
metaDMG-cpp dfit metadmg/VM-17_5300.merged.bam.bdamage.gz --names metadmg/small_taxonomy/names.dmp --nodes metadmg/small_taxonomy/nodes.dmp --showfits 2 --lib ds --out metadmg/VM-17_5300.merged.bam
metaDMG-cpp dfit metadmg/VM-19_6100.merged.bam.bdamage.gz --names metadmg/small_taxonomy/names.dmp --nodes metadmg/small_taxonomy/nodes.dmp --showfits 2 --lib ds --out metadmg/VM-19_6100.merged.bam
```
::: {.callout-warning title="Example only - do not run!"}
The `metaDMG-cpp dfit` function also allows for the computation of a binomial model, which includes additional statistics (such as bootstrap estimated parameters).
For the exercise, we only run the quick statistics, but we provide an example of the codes for full stat:
```bash
metaDMG-cpp dfit metadmg/VM-11_3000.merged.bam.bdamage.gz --names metadmg/small_taxonomy/names.dmp --nodes metadmg/small_taxonomy/nodes.dmp --showfits 2 --nopt 10 --nbootstrap 20 --doboot 1 --seed 1234 --lib ds --out metadmg/VM-11_3000.merged.bam
```
:::
We run the `metaDMG-cpp aggregate` function to merge the statistics from the previous two steps and obtain a file for each sample.
```bash
metaDMG-cpp aggregate metadmg/VM-3_800.merged.bam.bdamage.gz -lcastat metadmg/VM-3_800.merged.bam.stat.gz --names metadmg/small_taxonomy/names.dmp --nodes metadmg/small_taxonomy/nodes.dmp --dfit metadmg/VM-3_800.merged.bam.dfit.gz --out metadmg/VM-3_aggregated_results
metaDMG-cpp aggregate metadmg/VM-11_3000.merged.bam.bdamage.gz -lcastat metadmg/VM-11_3000.merged.bam.stat.gz --names metadmg/small_taxonomy/names.dmp --nodes metadmg/small_taxonomy/nodes.dmp --dfit metadmg/VM-11_3000.merged.bam.dfit.gz --out metadmg/VM-11_aggregated_results
metaDMG-cpp aggregate metadmg/VM-14_3900.merged.bam.bdamage.gz -lcastat metadmg/VM-14_3900.merged.bam.stat.gz --names metadmg/small_taxonomy/names.dmp --nodes metadmg/small_taxonomy/nodes.dmp --dfit metadmg/VM-14_3900.merged.bam.dfit.gz --out metadmg/VM-14_aggregated_results
metaDMG-cpp aggregate metadmg/VM-15_4100.merged.bam.bdamage.gz -lcastat metadmg/VM-15_4100.merged.bam.stat.gz --names metadmg/small_taxonomy/names.dmp --nodes metadmg/small_taxonomy/nodes.dmp --dfit metadmg/VM-15_4100.merged.bam.dfit.gz --out metadmg/VM-15_aggregated_results
metaDMG-cpp aggregate metadmg/VM-17_5300.merged.bam.bdamage.gz -lcastat metadmg/VM-17_5300.merged.bam.stat.gz --names metadmg/small_taxonomy/names.dmp --nodes metadmg/small_taxonomy/nodes.dmp --dfit metadmg/VM-17_5300.merged.bam.dfit.gz --out metadmg/VM-17_aggregated_results
metaDMG-cpp aggregate metadmg/VM-19_6100.merged.bam.bdamage.gz -lcastat metadmg/VM-19_6100.merged.bam.stat.gz --names metadmg/small_taxonomy/names.dmp --nodes metadmg/small_taxonomy/nodes.dmp --dfit metadmg/VM-19_6100.merged.bam.dfit.gz --out metadmg/VM-19_aggregated_results
```
In the last step we merge the header and the filenames in a unique tab-separated file (TSV).
We first change into the `metadmg/` results directory, and unzip the output files
```bash
cd metadmg
gunzip *_aggregated_results.stat.gz
```
Then we extract the header and we concatenate the content of all the output files in a unique TSV file.
```bash
#Define header for final output table
header_file="VM-11_aggregated_results.stat"
# Get the header
header=$(head -n 1 "$header_file")
# Define the output file
output_file="concatenated_metaDMGfinal.tsv"
# Add the header to the concatenated file
echo -e "filename\t$header" > "$output_file"
for file in VM-11_aggregated_results.stat \
VM-14_aggregated_results.stat \
VM-15_aggregated_results.stat \
VM-17_aggregated_results.stat \
VM-19_aggregated_results.stat \
VM-3_aggregated_results.stat
do
tail -n +2 "$file" | while read -r line; do
echo -e "$file\t$line" >> "$output_file"
done
done
```
### Investigating the final output with R
We first visualise our metaDMG output manually by navigating to the folder `metadmg/` with your file browser and clicking on "Open folder".
Open the TSV file `concatenated_metaDMGfinal.tsv` in a spreadsheet manner and inspect the files.
We will now investigate the TSV table produced by metaDMG to authenticate damage patterns, visualise the relationship between the damage and the significance, and the degree of damage through depth and time.
R packages for this exercise are located in our original conda environment `authentication`.
While still in the `authentication/metadmg/` folder, We deactivate the current conda environment and we re-activate the environment `authentication`.
```bash
conda deactivate
conda activate authentication
```
We load R by running `R` in your terminal
```bash
R
```
We load the libraries
```{r eval=F}
library(tidyr)
library(dplyr)
library(forcats)
library(scales)
library(gridExtra)
library(ggplot2)
library(purrr)
library(ggpubr)
```
#### Deamination patterns
We run the damage plot to visualise the deamination patterns along forward and reverse strands, and we save the results per each taxon detected in the samples.
We will use the function `get_dmg_decay_fit` to visualise damage pattern (@fig-authentication-fagusovisdmg).
The function is saved in `metadmg/script/`, so we only need to run the following command to recall it:
```{r eval=F}
source("script/get_dmg_decay_fit.R")
```
But if you are curious and want to know how it works, here is the function itself:
```{r eval=F}
get_dmg_decay_fit <- function(df, orient = "fwd", pos = 30, p_breaks = c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7), y_max = 0.7, y_min = -0.01) {
df_dx_fwd <- df %>%
select(taxid, name, label, starts_with("fwdx")) %>%
select(-starts_with("fwdxConf")) %>%
pivot_longer(names_to = "type", values_to = "Dx_fwd", c(-taxid, -name, -label)) %>%
mutate(x = gsub("fwdx", "", type)) %>%
select(-type)
df_dx_rev <- df %>%
select(taxid, name, label, starts_with("bwdx")) %>%
select(-starts_with("bwdxConf")) %>%
pivot_longer(names_to = "type", values_to = "Dx_rev", c(-taxid, -name, -label)) %>%
mutate(x = gsub("bwdx", "", type)) %>%
select(-type)
df_dx_std_fwd <- df %>%
select(taxid, name, label, starts_with("fwdxConf")) %>%
pivot_longer(names_to = "type", values_to = "Dx_std_fwd", c(-taxid, -name, -label)) %>%
mutate(x = gsub("fwdxConf", "", type)) %>%
select(-type)
df_dx_std_rev <- df %>%
select(taxid, name, label, starts_with("bwdxConf")) %>%
pivot_longer(names_to = "type", values_to = "Dx_std_rev", c(-taxid, -name, -label)) %>%
mutate(x = gsub("bwdxConf", "", type)) %>%
select(-type)
df_fit_fwd <- df %>%
select(taxid, name, label, starts_with("fwf")) %>%
pivot_longer(names_to = "type", values_to = "f_fwd", c(-taxid, -name, -label)) %>%
mutate(x = gsub("fwf", "", type)) %>%
select(-type)
df_fit_rev <- df %>%
select(taxid, name, label, starts_with("bwf")) %>%
pivot_longer(names_to = "type", values_to = "f_rev", c(-taxid, -name, -label)) %>%
mutate(x = gsub("bwf", "", type)) %>%
select(-type)
dat <- df_dx_fwd %>%
inner_join(df_dx_rev, by = c("taxid", "name", "label", "x")) %>%
inner_join(df_dx_std_fwd, by = c("taxid", "name", "label", "x")) %>%
inner_join(df_dx_std_rev, by = c("taxid", "name", "label", "x")) %>%
inner_join(df_fit_fwd, by = c("taxid", "name", "label", "x")) %>%
inner_join(df_fit_rev, by = c("taxid", "name", "label", "x")) %>%
mutate(x = as.numeric(x)) %>%
filter(x <= pos) %>%
rowwise() %>%
mutate(Dx_fwd_min = Dx_fwd - Dx_std_fwd,
Dx_fwd_max = Dx_fwd + Dx_std_fwd,
Dx_rev_min = Dx_rev - Dx_std_rev,
Dx_rev_max = Dx_rev + Dx_std_rev)
fwd_max <- dat %>%
group_by(as.character(x)) %>%
summarise(val = mean(Dx_std_fwd) + sd(Dx_std_fwd)) %>%
pull(val) %>%
max()
fwd_min <- dat %>%
group_by(as.character(x)) %>%
summarise(val = mean(Dx_std_fwd) - sd(Dx_std_fwd)) %>%
pull(val) %>%
min()
rev_max <- dat %>%
group_by(as.character(x)) %>%
summarise(val = mean(Dx_std_rev) + sd(Dx_std_rev)) %>%
pull(val) %>%
max()
rev_min <- dat %>%
group_by(as.character(x)) %>%
summarise(val = mean(Dx_std_rev) - sd(Dx_std_rev)) %>%
pull(val) %>%
min()
if (orient == "fwd") {
ggplot() +
geom_ribbon(data = dat, aes(x, ymin = Dx_fwd_min, ymax = Dx_fwd_max, group = interaction(name, taxid)), alpha = 0.6, fill = "darkcyan") +
geom_line(data = dat, aes(x, Dx_fwd, group = interaction(name, taxid)), color = "black") +
geom_point(data = dat, aes(x, f_fwd), alpha = .50, size = 2, fill = "black") +
theme_test() +
xlab("Position") +
ylab("Frequency") +
scale_y_continuous(limits = c(y_min, y_max), breaks = p_breaks) +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
facet_wrap(~label, ncol = 1)
} else {
ggplot() +
geom_ribbon(data = dat, aes(x, ymin = Dx_rev_min, ymax = Dx_rev_max, group = interaction(name, taxid)), alpha = 0.6, fill = "orange") +
geom_path(data = dat, aes(x, Dx_rev, group = interaction(name, taxid)), color = "black") +
geom_point(data = dat, aes(x, f_rev), alpha = .50, size = 2, fill = "black") +
theme_test() +
xlab("Position") +
ylab("Frequency") +
scale_x_continuous(trans = "reverse") +
scale_y_continuous(limits = c(y_min, y_max), position = "right", breaks = p_breaks) +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
facet_wrap(~label, ncol = 1)
}
}
```
We load our metaDMG output data (TSV file) and the metadata with information on the age of each sample. We generate the damage plots as seen in @fig-authentication-fagusovisdmg using the function `get-damage`.
```{r eval=F}
df <- read.csv("concatenated_metaDMGfinal.tsv", sep = "\t")
#Rename sample column
colnames(df)[colnames(df) == 'filename'] <- 'sample'
#Modify sample name with short names
df$sample[df$sample == "VM-11_aggregated_results.stat"] <- "VM-11"
df$sample[df$sample == "VM-14_aggregated_results.stat"] <- "VM-14"
df$sample[df$sample == "VM-15_aggregated_results.stat"] <- "VM-15"
df$sample[df$sample == "VM-17_aggregated_results.stat"] <- "VM-17"
df$sample[df$sample == "VM-19_aggregated_results.stat"] <- "VM-19"
df$sample[df$sample == "VM-3_aggregated_results.stat"] <- "VM-3"
#Import the metadata with dates BP
depth_data <- read.csv ("figures/depth_data.csv", header = TRUE)
View (depth_data)
#Merge context_data and depth_data with dataframe (adding new column for dates BP)
df$new <- depth_data$Date_BP[match(df$sample, depth_data$Sample_ID)]
names(df)[names(df) == 'new'] <- 'Date_BP'
# Convert Date_BP columns to factors (categorical variable)
df$Date_BP <- as.factor(df$Date_BP)
#Setting filtering theshold for ancient reads
minDMG = 0.02 # filter criteria, plot only taxa above set value
zfit = 2 # minimum significance, the higher the better, 2 would mean that we estimante the damage with 95% confidence.
MinLength = 35 # minimum mean readlength, while we set a hard filter initially while trimming, we would like the mean readlength to be 35 or higher.
reads = 200 # number of reads required depends on the amount of damage and the significance
#Subsetting only animals and plants, at the genus level, number of reads > 200.
dt1 <- df %>% filter(A > minDMG, nreads >= reads, mean_rlen >= MinLength, Zfit > zfit, grepl("\\bgenus\\b", rank), !grepl("Bacteria",taxa_path))
#deamination plot with facet wrap per each taxon in a sample
tax_g_list <- unique(dt1$name)
nrank <- "rank" # Replace with the actual rank column name
X <- tax_g_list
purrr::map(tax_g_list, function(X, nrank) {
sel_tax <- dt1 %>%
rename(label = sample) %>%
filter(name == X) %>%
filter(rank == rank) %>%
select(name, label) %>%
distinct() %>%
arrange(name)
if (nrow(sel_tax) > 0) {
n_readsa <- dt1 %>%
inner_join(sel_tax) %>%
filter(rank == rank) %>%
pull(nreads) %>%
sum()
ggpubr::ggarrange(plotlist = list(
get_dmg_decay_fit(df = dt1 %>% rename(label = sample) %>% inner_join(sel_tax) %>% filter(rank == rank), orient = "fwd", y_max = 0.70) +
ggtitle(paste0(X, " nreads=", n_readsa, " Forward")),
get_dmg_decay_fit(df = dt1 %>% rename(label = sample) %>% inner_join(sel_tax) %>% filter(rank == rank), orient = "rev", y_max = 0.70) +
ggtitle(paste0(X, " nreads=", n_readsa, " Reverse"))
), align = "hv")
ggsave(paste0("figures/", X, "-dmg.pdf"), plot = last_plot(), width = 8, height = 4)
}
})
```
![Deamination patterns for sheep (*Ovis*) and beech (*Fagus*) reads.](assets/images/chapters/authentication/Fagus_Ovis-dmg.png){#fig-authentication-fagusovisdmg}
### Amplitude of damage vs Significance
We provide an R script to investigate the main statistics.
Here we visualise the amplitude of damage (A) and its significance (Zfit), for the full dataset but filtering it to a minimum of 100 reads and at the genus level (@fig-authentication-fig8).
```{r eval=F}
#Subset dataset animal and plants at the genus level
dt2 <- df %>% filter(nreads > 100, grepl("\\bgenus\\b", rank), grepl("Metazoa", taxa_path) | grepl("Viridiplant", taxa_path))
#Adding factor column for Kingdom
dt2 <- dt2 %>%
mutate(Kingdom = # creating our new column
case_when(grepl("Viridiplant", taxa_path) ~ "Viridiplantae",
grepl("Metazoa",taxa_path) ~ "Metazoa"))
#Plotting amplitude of damage vs its significance and saving as pdf file
pdf(file = "figures/p1.pdf", width = 8, height = 6)
ggplot(dt2, aes(y=A, x=Zfit)) +
geom_point(aes(size=nreads, col=Kingdom)) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust =1)) +
scale_color_manual(values = c("#8B1A1A", "#458B00"))+
scale_size_continuous(labels = function(x) format(x, scientific = FALSE)) +
xlab("significance") + ylab("damage") + theme_minimal()
dev.off()
```
![Amplitude of damage (A) vs significance (Zfit) for animals and plants.](assets/images/chapters/authentication/p2.png){#fig-authentication-fig8}
### Amplitude of damage and mean fragment length through time
Here we visualise the amplitude of damage (A) and the mean length of the fragments (mean_rlen) by date (BP) for the filtered dataset with a minimum of 100 reads and at the genus level (@fig-authentication-fig9).
```{r eval=F}
#Plotting damage (A) by period (dates BP)
p2a<- dt2 %>%
mutate(Date_BP = fct_relevel(Date_BP,
"6100","5300","4100","3900","3000", "800")) %>%
ggplot(aes(x=A, y=Date_BP))+
geom_boxplot(aes(x=A, y=Date_BP, fill = sample))+
geom_point(aes(fill = sample), size = 3, shape = 21, color = "black", stroke = .5) +
scale_x_continuous(limits = c(0, 0.20), breaks = seq(0, 0.20, by = 0.05)) +
theme_minimal() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
p2a
#Plotting mean length (mean_rlen) by period (dates BP)
p2b<- dt2 %>%
mutate(Date_BP = fct_relevel(Date_BP,
"6100","5300","4100","3900","3000", "800")) %>%
ggplot(aes(x=mean_rlen, y=Date_BP))+
geom_boxplot(aes(x=mean_rlen, y=Date_BP, fill = sample)) +
geom_point(aes(fill = sample), size = 3, shape = 21, color = "black", stroke = .5) +
scale_x_continuous(limits = c(30, 80), breaks = seq(30, 80, by = 10)) +
theme_minimal() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
p2b
#Combining the plots and saving as pdf file
pdf(file = "figures/p2.pdf", width = 8, height = 6)
p2 <- grid.arrange(p2a, p2b,
ncol = 2, nrow = 1)
dev.off()
```
![Amplitude of damage (A) and mean fragment length (mean_rlen) through time.](assets/images/chapters/authentication/p3.png){#fig-authentication-fig9}
::: {.callout-tip}
Once finished examining the plots you can quit R
```bash
## Press 'n' when asked if you want to save your workspace image.
quit()
```
:::
::: {.callout-tip}
You can manually navigate to the folder `metadmg/figures/`
And click "Open folder"
You can double-click on the pdf files to visualise them.
:::
## (Optional) clean-up
Let's clean up our working directory by removing all the data and output from this chapter.