-
Notifications
You must be signed in to change notification settings - Fork 10
/
taxonomic-profiling.qmd
1700 lines (1353 loc) · 85.4 KB
/
taxonomic-profiling.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: Taxonomic Profiling, OTU Tables and Visualisation
author: Maxime Borry
bibliography: assets/references/taxonomic-profiling.bib
---
::: {.callout-note collapse="true" title="Self guided: chapter environment setup"}
For this chapter's exercises, if not already performed, you will need to download the chapter's dataset, decompress the archive, and create and activate the conda environment.
Do this, use `wget` or right click and save to download this Zenodo archive: [10.5281/zenodo.13760277](https://doi.org/10.5281/zenodo.13760277), and unpack
```bash
tar xvf taxonomic-profiling.tar.gz
cd taxonomic-profiling/
```
You can then create the subsequently activate environment with
```bash
conda env create -f taxonomic-profiling.yml
conda activate taxonomic-profiling
```
:::
## Introduction
In this chapter, we're going to look at taxonomic profiling, or in other words, how to get the microbial composition of a sample from the DNA sequencing data.
Though there are many algorithms, and even more different tools available to perform taxonomic profiling, the general idea remains the same (@fig-taxonomicprofiling-overview).
After cleaning up the sequencing data, generally saved as FASTQ files, a **taxonomic profiler** is used to compare the sequenced DNA to a reference database of sequences from known organisms, in order to generate a taxonomic profile of all organisms identified in a sample (@fig-taxonomicprofiling-overview)
![General overview of a taxonomic profiling analysis workflow](assets/images/chapters/taxonomic-profiling/how_to_analyze.png){#fig-taxonomicprofiling-overview}
If you prefer text instead of pictograms, the workflow we're going to cover today is outlined in @fig-taxonomicprofiling-workflow, adapted from @sharptonIntroductionAnalysisShotgun2014
![A typical metagenomics analysis workflow, adapted from @sharptonIntroductionAnalysisShotgun2014](assets/images/chapters/taxonomic-profiling/analysis_workflow.png){#fig-taxonomicprofiling-workflow}
Because different organisms can possess the same DNA, especially when looking at shorter sequences, taxonomic profilers need to have a way to resolve the ambiguity in the taxonomic assignation (@fig-taxonomicprofiling-ambiguity).
![Different species can share the same DNA sequence](assets/images/chapters/taxonomic-profiling/ambiguity.png){#fig-taxonomicprofiling-ambiguity}
By leveraging an algorithm known as the **Lowest Common Ancestor (LCA)**, and the taxonomic tree of all known species, ambiguities are going to be resolved by assigning a higher, less precise, taxonomic rank to ambiguous matches (@fig-taxonomicprofiling-lca).
![A diagram of the LCA algorithm, a way to resolve these ambiguities](assets/images/chapters/taxonomic-profiling/lca.png){#fig-taxonomicprofiling-lca}
## Chapter Overview
Today, we're going to use the following tools:
- fastp [@chenFastpUltrafastAllinone2018] for sequencing data cleaning
- MetaPhlAn [@segata2012metagenomic], for taxonomic profiling
- Krona [@ondovInteractiveMetagenomicVisualization2011] and Pavian [@breitwieserPavianInteractiveAnalysis2016] for the interactive exploration of the taxonomic profiles
- curatedMetagenomicData [@pasolliAccessibleCuratedMetagenomic2017] for retrieving modern comparison data
- Python, pandas [@rebackPandasdevPandasPandas2022], plotnine ([https://plotnine.readthedocs.io/en/stable/](https://plotnine.readthedocs.io/en/stable/)), and scikit-bio [@scikit-bioScikitbioBioinformaticsLibrary2022] to perform exploratory data analysis and a bit of microbial ecology
to explore a toy dataset that has already been prepared for you.
::: {.callout-note collapse="true"}
## Preamble: what has been done to generate this toy dataset
### Download and Subsample
```python
import subprocess
import glob
from pathlib import Path
```
For this tutorial, we will be using the ERR5766177 library from the sample 2612 published by [@maixner2021hallstatt]
![](assets/images/chapters/taxonomic-profiling/1-s2.0-S0960982221012719-fx1.jpg)
![](assets/images/chapters/taxonomic-profiling/1-s2.0-S0960982221012719-gr1.jpg)`
### Subsampling the sequencing files to make the analysis quicker for this tutorial
This Python code defines a function called subsample that takes in a FASTQ file name, an output directory, and a depth value (defaulting to 1000000).
The function uses the seqtk command-line tool to subsample the input FASTQ file to the desired depth and saves the output to a new file in the specified output directory.
The function prints the constructed command string to the console for debugging purposes.
```python
def subsample(filename, outdir, depth=1000000):
basename = Path(filename).stem
cmd = f"seqtk sample -s42 {filename} {depth} > {outdir}/{basename}_subsample_{depth}.fastq"
print(cmd)
subprocess.check_output(cmd, shell=True)
```
This Python code uses a for loop to iterate over all the files in the `../data/raw/` directory that match the pattern `*`, and calls the `subsample` (defined above) function on each file in the directory `../data/subsampled`.
```python
for f in glob.glob("../data/raw/*"):
outdir = "../data/subsampled"
subsample(f, outdir)
```
::: {.callout-tip collapse="true" title='Expand to see command output'}
```{verbatim}
seqtk sample -s42 ../data/raw/ERR5766177_PE.mapped.hostremoved.fwd.fq.gz 1000000 >
../data/subsampled/ERR5766177_PE.mapped.hostremoved.fwd.fq_subsample_1000000.fastq
seqtk sample -s42 ../data/raw/ERR5766177_PE.mapped.hostremoved.rev.fq.gz 1000000 >
../data/subsampled/ERR5766177_PE.mapped.hostremoved.rev.fq_subsample_1000000.fastq
```
:::
Finally, we compress all files to gzip format
```bash
gzip -f ../data/subsampled/*.fastq
```
:::
## Working in a jupyter environment
This tutorial run-through is using a Jupyter Notebook ([https://jupyter.org](https://jupyter.org)) for writing & executing Python code and for annotating.
Jupyter notebooks are convenient and have two types of cells: _Markdown_ and _Code_.
The _markup cell_ syntax is very similar to _R markdown_.
The markdown cells are used for annotating, which is important for sharing code with collaborators, reproducibility, and documentation.
To load, please run the following command from within the chapter's directory.
```bash
cd notebooks/
jupyter notebook analysis.ipynb
```
You can then follow that notebook, which should mirror the contents of this chapter! Otherwise try making a new notebook within Jupyter `File > New > Notebook`!
::: {.callout-warning}
If you wish to run all commands manually (i.e., without the notebook), you must make sure you run all commands while within the `notebook` directory of this chapter.
:::
## Data pre-processing
Before starting to analyze our data, we will need to pre-process them to remove reads mapping to the host genome, here, _Homo sapiens_.
To do so, I've used the first steps of the nf-core/eager [@Fellows-Yates2020-iy] pipeline, more information of which can be found in the [Ancient Metagenomic Pipelines](ancient-metagenomic-pipelines.qmd) chapter.
I've already done some pre-processed the data, and the resulting cleaned files are available in the `data/eager_cleaned/`.
::: {.callout-note collapse="true"}
## Self-guided: onstructions for manual pre-preparation of data
If you wish to re-pre-prepare the data yourself, the basic eager command to do so is below, running on the output of the previous block in [chapter overview](#chapter-overview).
```bash
nextflow run nf-core/eager \
-r 2.4.7 \
-profile <docker/singularity/podman/conda/institute> \
--input '*_R{1,2}.fastq.gz' \
--fasta 'human_genome.fasta' \
--hostremoval_input_fastq
```
:::
## Adapter sequence trimming and low-quality bases trimming
Sequencing adapters are small DNA sequences adding prior to DNA sequencing to allow the DNA fragments to attach to the
sequencing flow cells (see [Introduction to NGS Sequencing](introduction-to-ngs-sequencing.qmd)).
Because these adapters could interfere with downstream analyses, we need to remove them before proceeding any further.
Furthermore, because the quality of the sequencing is not always optimal, we need to remove bases of lower sequencing quality to might lead to spurious results in downstream analyses.
To perform both of these tasks, we'll use the program fastp ([https://github.com/OpenGene/fastp](https://github.com/OpenGene/fastp)) by Chen et al. [-@chenFastpUltrafastAllinone2018].
The following command gets you the help of fastp (the `--help` option is a common option in command-line tools that displays a list of available options and their descriptions).
```bash
fastp -h
```
::: {.callout-tip collapse="true" title='Expand to see command output'}
option needs value: --html
usage: fastp [options] ...
options:
-i, --in1 read1 input file name (string [=])
-o, --out1 read1 output file name (string [=])
-I, --in2 read2 input file name (string [=])
-O, --out2 read2 output file name (string [=])
--unpaired1 for PE input, if read1 passed QC but read2 not, it will be written to unpaired1.
Default is to discard it. (string [=])
--unpaired2 for PE input, if read2 passed QC but read1 not, it will be written to unpaired2.
If --unpaired2 is same as --unpaired1 (default mode), both unpaired reads will be
written to this same file. (string [=])
--overlapped_out for each read pair, output the overlapped region if it has no any mismatched
base. (string [=])
--failed_out specify the file to store reads that cannot pass the filters. (string [=])
-m, --merge for paired-end input, merge each pair of reads into a single read if they are
overlapped. The merged reads will be written
to the file given by --merged_out, the unmerged reads will be written to the
files specified by --out1 and --out2. The merging mode is disabled by default.
--merged_out in the merging mode, specify the file name to store merged output, or specify
--stdout to stream the merged output (string [=])
--include_unmerged in the merging mode, write the unmerged or unpaired reads to the file specified
by --merge. Disabled by default.
-6, --phred64 indicate the input is using phred64 scoring (it'll be converted to phred33,
so the output will still be phred33)
-z, --compression compression level for gzip output (1 ~ 9). 1 is fastest, 9 is smallest, default is 4. (int [=4])
--stdin input from STDIN. If the STDIN is interleaved paired-end FASTQ, please also add --interleaved_in.
--stdout stream passing-filters reads to STDOUT. This option will result in interleaved
FASTQ output for paired-end output. Disabled by default.
--interleaved_in indicate that <in1> is an interleaved FASTQ which contains both read1 and read2.
Disabled by default.
--reads_to_process specify how many reads/pairs to be processed. Default 0 means process all reads. (int [=0])
--dont_overwrite don't overwrite existing files. Overwritting is allowed by default.
--fix_mgi_id the MGI FASTQ ID format is not compatible with many BAM operation tools, enable this option to fix it.
-V, --verbose output verbose log information (i.e. when every 1M reads are processed).
-A, --disable_adapter_trimming adapter trimming is enabled by default. If this option is specified, adapter trimming is disabled
-a, --adapter_sequence the adapter for read1. For SE data, if not specified, the adapter will be auto-detected.
For PE data, this is used if R1/R2 are found not overlapped. (string [=auto])
--adapter_sequence_r2 the adapter for read2 (PE data only). This is used if R1/R2 are found not overlapped.
If not specified, it will be the same as <adapter_sequence> (string [=auto])
--adapter_fasta specify a FASTA file to trim both read1 and read2 (if PE) by all the sequences in this FASTA file (string [=])
--detect_adapter_for_pe by default, the auto-detection for adapter is for SE data input only, turn on this
option to enable it for PE data.
-f, --trim_front1 trimming how many bases in front for read1, default is 0 (int [=0])
-t, --trim_tail1 trimming how many bases in tail for read1, default is 0 (int [=0])
-b, --max_len1 if read1 is longer than max_len1, then trim read1 at its tail to make it as
long as max_len1. Default 0 means no limitation (int [=0])
-F, --trim_front2 trimming how many bases in front for read2. If it's not specified, it will follow read1's settings (int [=0])
-T, --trim_tail2 trimming how many bases in tail for read2. If it's not specified, it will follow read1's settings (int [=0])
-B, --max_len2 if read2 is longer than max_len2, then trim read2 at its tail to make it as long as max_len2.
Default 0 means no limitation. If it's not specified, it will follow read1's settings (int [=0])
-D, --dedup enable deduplication to drop the duplicated reads/pairs
--dup_calc_accuracy accuracy level to calculate duplication (1~6), higher level uses more memory (1G, 2G, 4G, 8G, 16G, 24G).
Default 1 for no-dedup mode, and 3 for dedup mode. (int [=0])
--dont_eval_duplication don't evaluate duplication rate to save time and use less memory.
-g, --trim_poly_g force polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
--poly_g_min_len the minimum length to detect polyG in the read tail. 10 by default. (int [=10])
-G, --disable_trim_poly_g disable polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
-x, --trim_poly_x enable polyX trimming in 3' ends.
--poly_x_min_len the minimum length to detect polyX in the read tail. 10 by default. (int [=10])
-5, --cut_front move a sliding window from front (5') to tail, drop the bases in the window if
its mean quality < threshold, stop otherwise.
-3, --cut_tail move a sliding window from tail (3') to front, drop the bases in the window if
its mean quality < threshold, stop otherwise.
-r, --cut_right move a sliding window from front to tail, if meet one window with mean quality
< threshold, drop the bases in the window and the right part, and then stop.
-W, --cut_window_size the window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4 (int [=4])
-M, --cut_mean_quality the mean quality requirement option shared by cut_front, cut_tail or cut_sliding.
Range: 1~36 default: 20 (Q20) (int [=20])
--cut_front_window_size the window size option of cut_front, default to cut_window_size if not specified (int [=4])
--cut_front_mean_quality the mean quality requirement option for cut_front, default to cut_mean_quality if not specified (int [=20])
--cut_tail_window_size the window size option of cut_tail, default to cut_window_size if not specified (int [=4])
--cut_tail_mean_quality the mean quality requirement option for cut_tail, default to cut_mean_quality if not specified (int [=20])
--cut_right_window_size the window size option of cut_right, default to cut_window_size if not specified (int [=4])
--cut_right_mean_quality the mean quality requirement option for cut_right, default to cut_mean_quality if not specified (int [=20])
-Q, --disable_quality_filtering quality filtering is enabled by default. If this option is specified, quality filtering is disabled
-q, --qualified_quality_phred the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified. (int [=15])
-u, --unqualified_percent_limit how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40% (int [=40])
-n, --n_base_limit if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5 (int [=5])
-e, --average_qual if one read's average quality score <avg_qual, then this read/pair is discarded.
Default 0 means no requirement (int [=0])
-L, --disable_length_filtering length filtering is enabled by default. If this option is specified, length filtering is disabled
-l, --length_required reads shorter than length_required will be discarded, default is 15. (int [=15])
--length_limit reads longer than length_limit will be discarded, default 0 means no limitation. (int [=0])
-y, --low_complexity_filter enable low complexity filter. The complexity is defined as the percentage of base
that is different from its next base (base[i] != base[i+1]).
-Y, --complexity_threshold the threshold for low complexity filter (0~100). Default is 30, which means 30% complexity is required. (int [=30])
--filter_by_index1 specify a file contains a list of barcodes of index1 to be filtered out, one barcode per line (string [=])
--filter_by_index2 specify a file contains a list of barcodes of index2 to be filtered out, one barcode per line (string [=])
--filter_by_index_threshold the allowed difference of index barcode for index filtering, default 0 means completely identical. (int [=0])
-c, --correction enable base correction in overlapped regions (only for PE data), default is disabled
--overlap_len_require the minimum length to detect overlapped region of PE reads. This will affect overlap analysis based PE merge,
adapter trimming and correction. 30 by default. (int [=30])
--overlap_diff_limit the maximum number of mismatched bases to detect overlapped region of PE reads.
This will affect overlap analysis based PE merge, adapter trimming and correction. 5 by default. (int [=5])
--overlap_diff_percent_limit the maximum percentage of mismatched bases to detect overlapped region of PE reads.
This will affect overlap analysis based PE merge, adapter trimming and correction. Default 20 means 20%. (int [=20])
-U, --umi enable unique molecular identifier (UMI) preprocessing
--umi_loc specify the location of UMI, can be (index1/index2/read1/read2/per_index/per_read, default is none (string [=])
--umi_len if the UMI is in read1/read2, its length should be provided (int [=0])
--umi_prefix if specified, an underline will be used to connect prefix and UMI (i.e.
prefix=UMI, UMI=AATTCG, final=UMI_AATTCG). No prefix by default (string [=])
--umi_skip if the UMI is in read1/read2, fastp can skip several bases following UMI, default is 0 (int [=0])
-p, --overrepresentation_analysis enable overrepresented sequence analysis.
-P, --overrepresentation_sampling one in (--overrepresentation_sampling) reads will be computed for overrepresentation
analysis (1~10000), smaller is slower, default is 20. (int [=20])
-j, --json the json format report file name (string [=fastp.json])
-h, --html the html format report file name (string [=fastp.html])
-R, --report_title should be quoted with ' or ", default is "fastp report" (string [=fastp report])
-w, --thread worker thread number, default is 3 (int [=3])
-s, --split split output by limiting total split file number with this option (2~999), a sequential number prefix
will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (int [=0])
-S, --split_by_lines split output by limiting lines of each file with this option(>=1000), a sequential number prefix will be
added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (long [=0])
-d, --split_prefix_digits the digits for the sequential number padding (1~10), default is 4, so the filename will be padded as
0001.xxx, 0 to disable padding (int [=4])
--cut_by_quality5 DEPRECATED, use --cut_front instead.
--cut_by_quality3 DEPRECATED, use --cut_tail instead.
--cut_by_quality_aggressive DEPRECATED, use --cut_right instead.
--discard_unmerged DEPRECATED, no effect now, see the introduction for merging.
-?, --help print this message
:::
Here we use fastp to preprocess a pair of FASTQ files.
The code specifies the input files, merges the paired-end reads on their overlaps, removes duplicate reads, and generates JSON and HTML reports.
The output files are saved in the `../results/fastp/ directory`.
```bash
fastp \
--in1 ../data/subsampled/ERR5766177_PE.mapped.hostremoved.fwd.fq_subsample_1000000.fastq.gz \
--in2 ../data/subsampled/ERR5766177_PE.mapped.hostremoved.fwd.fq_subsample_1000000.fastq.gz \
--merge \
--merged_out ../results/fastp/ERR5766177.merged.fastq.gz \
--include_unmerged \
--dedup \
--json ../results/fastp/ERR5766177.fastp.json \
--html ../results/fastp/ERR5766177.fastp.html \
```
::: {.callout-tip collapse="true" title='Expand to see command output'}
Read1 before filtering:
total reads: 1000000
total bases: 101000000
Q20 bases: 99440729(98.4562%)
Q30 bases: 94683150(93.7457%)
Read2 before filtering:
total reads: 1000000
total bases: 101000000
Q20 bases: 99440729(98.4562%)
Q30 bases: 94683150(93.7457%)
Merged and filtered:
total reads: 1994070
total bases: 201397311
Q20 bases: 198330392(98.4772%)
Q30 bases: 188843169(93.7665%)
Filtering result:
reads passed filter: 1999252
reads failed due to low quality: 728
reads failed due to too many N: 20
reads failed due to too short: 0
reads with adapter trimmed: 282
bases trimmed due to adapters: 18654
reads corrected by overlap analysis: 0
bases corrected by overlap analysis: 0
Duplication rate: 0.2479%
Insert size peak (evaluated by paired-end reads): 31
Read pairs merged: 228
% of original read pairs: 0.0228%
% in reads after filtering: 0.0114339%
JSON report: ../results/fastp/ERR5766177.fastp.json
HTML report: ../results/fastp/ERR5766177.fastp.html
fastp --in1 ../data/subsampled/ERR5766177_PE.mapped.hostremoved.fwd.fq_subsample_1000000.fastq.gz \
--in2 ../data/subsampled/ERR5766177_PE.mapped.hostremoved.fwd.fq_subsample_1000000.fastq.gz --merge \
--merged_out ../results/fastp/ERR5766177.merged.fastq.gz --include_unmerged --dedup \
--json ../results/fastp/ERR5766177.fastp.json --html ../results/fastp/ERR5766177.fastp.html
fastp v0.23.2, time used: 11 seconds
:::
::: {.callout-tip title="Question" appearance="simple"}
What do you think of the number of read pairs that were merged ?
:::
::: {.callout-note collapse="true" title="Answer"}
Here, only 228 read pairs were merged.
This is due to the length of the reads of 100bp, and length of the DNA fragments.
If you would use fewer cycles, and have shorter DNA fragments, you would expect this number to go up.
:::
## Taxonomic profiling with Metaphlan
MetaPhlAn is a computational tool for profiling the composition of microbial communities from metagenomic shotgun sequencing data.
```bash
metaphlan --help
```
::: {.callout-tip collapse="true" title='Expand to see command output'}
usage: metaphlan --input_type {fastq,fasta,bowtie2out,sam} [--force]
[--bowtie2db METAPHLAN_BOWTIE2_DB] [-x INDEX]
[--bt2_ps BowTie2 presets] [--bowtie2_exe BOWTIE2_EXE]
[--bowtie2_build BOWTIE2_BUILD] [--bowtie2out FILE_NAME]
[--min_mapq_val MIN_MAPQ_VAL] [--no_map] [--tmp_dir]
[--tax_lev TAXONOMIC_LEVEL] [--min_cu_len]
[--min_alignment_len] [--add_viruses] [--ignore_eukaryotes]
[--ignore_bacteria] [--ignore_archaea] [--stat_q]
[--perc_nonzero] [--ignore_markers IGNORE_MARKERS]
[--avoid_disqm] [--stat] [-t ANALYSIS TYPE]
[--nreads NUMBER_OF_READS] [--pres_th PRESENCE_THRESHOLD]
[--clade] [--min_ab] [-o output file] [--sample_id_key name]
[--use_group_representative] [--sample_id value]
[-s sam_output_file] [--legacy-output] [--CAMI_format_output]
[--unknown_estimation] [--biom biom_output] [--mdelim mdelim]
[--nproc N] [--install] [--force_download]
[--read_min_len READ_MIN_LEN] [-v] [-h]
[INPUT_FILE] [OUTPUT_FILE]
DESCRIPTION
MetaPhlAn version 3.1.0 (25 Jul 2022):
METAgenomic PHyLogenetic ANalysis for metagenomic taxonomic profiling.
AUTHORS: Francesco Beghini ([email protected]),Nicola Segata ([email protected]), Duy Tin Truong,
Francesco Asnicar ([email protected]), Aitor Blanco Miguez ([email protected])
COMMON COMMANDS
We assume here that MetaPhlAn is installed using the several options available (pip, conda, PyPi)
Also BowTie2 should be in the system path with execution and read permissions, and Perl should be installed)
========== MetaPhlAn clade-abundance estimation =================
The basic usage of MetaPhlAn consists in the identification of the clades (from phyla to species )
present in the metagenome obtained from a microbiome sample and their
relative abundance. This correspond to the default analysis type (-t rel_ab).
* Profiling a metagenome from raw reads:
$ metaphlan metagenome.fastq --input_type fastq -o profiled_metagenome.txt
* You can take advantage of multiple CPUs and save the intermediate BowTie2 output for re-running
MetaPhlAn extremely quickly:
$ metaphlan metagenome.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq -o profiled_metagenome.txt
* If you already mapped your metagenome against the marker DB (using a previous MetaPhlAn run), you
can obtain the results in few seconds by using the previously saved --bowtie2out file and
specifying the input (--input_type bowtie2out):
$ metaphlan metagenome.bowtie2.bz2 --nproc 5 --input_type bowtie2out -o profiled_metagenome.txt
* bowtie2out files generated with MetaPhlAn versions below 3 are not compatibile.
Starting from MetaPhlAn 3.0, the BowTie2 ouput now includes the size of the profiled metagenome and the average read length.
If you want to re-run MetaPhlAn using these file you should provide the metagenome s via --nreads:
$ metaphlan metagenome.bowtie2.bz2 --nproc 5 --input_type bowtie2out --nreads 520000 -o profiled_metagenome.txt
* You can also provide an externally BowTie2-mapped SAM if you specify this format with
--input_type. Two steps: first apply BowTie2 and then feed MetaPhlAn with the obtained sam:
$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x \
${mpa_dir}/metaphlan_databases/mpa_v30_CHOCOPhlAn_201901 -U metagenome.fastq
$ metaphlan metagenome.sam --input_type sam -o profiled_metagenome.txt
* We can also natively handle paired-end metagenomes, and, more generally, metagenomes stored in
multiple files (but you need to specify the --bowtie2out parameter):
$ metaphlan metagenome_1.fastq,metagenome_2.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq
-------------------------------------------------------------------
========== Marker level analysis ============================
MetaPhlAn introduces the capability of characterizing organisms at the strain level using non
aggregated marker information. Such capability comes with several slightly different flavours and
are a way to perform strain tracking and comparison across multiple samples.
Usually, MetaPhlAn is first ran with the default -t to profile the species present in
the community, and then a strain-level profiling can be performed to zoom-in into specific species
of interest. This operation can be performed quickly as it exploits the --bowtie2out intermediate
file saved during the execution of the default analysis type.
* The following command will output the abundance of each marker with a RPK (reads per kilo-base)
higher 0.0. (we are assuming that metagenome_outfmt.bz2 has been generated before as
shown above).
$ metaphlan -t marker_ab_table metagenome_outfmt.bz2 --input_type bowtie2out -o marker_abundance_table.txt
The obtained RPK can be optionally normald by the total number of reads in the metagenome
to guarantee fair comparisons of abundances across samples. The number of reads in the metagenome
needs to be passed with the '--nreads' argument
* The list of markers present in the sample can be obtained with '-t marker_pres_table'
$ metaphlan -t marker_pres_table metagenome_outfmt.bz2 --input_type bowtie2out -o marker_abundance_table.txt
The --pres_th argument (default 1.0) set the minimum RPK value to consider a marker present
* The list '-t clade_profiles' analysis type reports the same information of '-t marker_ab_table'
but the markers are reported on a clade-by-clade basis.
$ metaphlan -t clade_profiles metagenome_outfmt.bz2 --input_type bowtie2out -o marker_abundance_table.txt
* Finally, to obtain all markers present for a specific clade and all its subclades, the
'-t clade_specific_strain_tracker' should be used. For example, the following command
is reporting the presence/absence of the markers for the B. fragilis species and its strains
the optional argument --min_ab specifies the minimum clade abundance for reporting the markers
$ metaphlan -t clade_specific_strain_tracker --clade s__Bacteroides_fragilis metagenome_outfmt.bz2 --input_typ
bowtie2out -o marker_abundance_table.txt
-------------------------------------------------------------------
positional arguments:
INPUT_FILE the input file can be:
* a fastq file containing metagenomic reads
OR
* a BowTie2 produced SAM file.
OR
* an intermediary mapping file of the metagenome generated by a previous MetaPhlAn run
If the input file is missing, the script assumes that the input is provided using the standard
input, or named pipes.
IMPORTANT: the type of input needs to be specified with --input_type
OUTPUT_FILE the tab-separated output file of the predicted taxon relative abundances
[stdout if not present]
Required arguments:
--input_type {fastq,fasta,bowtie2out,sam}
set whether the input is the FASTA file of metagenomic reads or
the SAM file of the mapping of the reads against the MetaPhlAn db.
Mapping arguments:
--force Force profiling of the input file by removing the bowtie2out file
--bowtie2db METAPHLAN_BOWTIE2_DB
Folder containing the MetaPhlAn database. You can specify the location by exporting the
DEFAULT_DB_FOLDER variable in the shell.
[default /Users/maxime/mambaforge/envs/summer_school_microbiome/lib/python3.9/site-packages/metaphlan/metaphlan_databases]
-x INDEX, --index INDEX
Specify the id of the database version to use. If "latest", MetaPhlAn will get the latest version.
If an index name is provided, MetaPhlAn will try to use it, if available, and skip the online check.
If the database files are not found on the local MetaPhlAn installation they
will be automatically downloaded [default latest]
--bt2_ps BowTie2 presets
Presets options for BowTie2 (applied only when a FASTA file is provided)
The choices enabled in MetaPhlAn are:
* sensitive
* very-sensitive
* sensitive-local
* very-sensitive-local
[default very-sensitive]
--bowtie2_exe BOWTIE2_EXE
Full path and name of the BowTie2 executable. This option allowsMetaPhlAn to reach the
executable even when it is not in the system PATH or the system PATH is unreachable
--bowtie2_build BOWTIE2_BUILD
Full path to the bowtie2-build command to use, deafult assumes that 'bowtie2-build is present in the system path
--bowtie2out FILE_NAME
The file for saving the output of BowTie2
--min_mapq_val MIN_MAPQ_VAL
Minimum mapping quality value (MAPQ) [default 5]
--no_map Avoid storing the --bowtie2out map file
--tmp_dir The folder used to store temporary files [default is the OS dependent tmp dir]
Post-mapping arguments:
--tax_lev TAXONOMIC_LEVEL
The taxonomic level for the relative abundance output:
'a' : all taxonomic levels
'k' : kingdoms
'p' : phyla only
'c' : classes only
'o' : orders only
'f' : families only
'g' : genera only
's' : species only
[default 'a']
--min_cu_len minimum total nucleotide length for the markers in a clade for
estimating the abundance without considering sub-clade abundances
[default 2000]
--min_alignment_len The sam records for aligned reads with the longest subalignment
length smaller than this threshold will be discarded.
[default None]
--add_viruses Allow the profiling of viral organisms
--ignore_eukaryotes Do not profile eukaryotic organisms
--ignore_bacteria Do not profile bacterial organisms
--ignore_archaea Do not profile archeal organisms
--stat_q Quantile value for the robust average
[default 0.2]
--perc_nonzero Percentage of markers with a non zero relative abundance for misidentify a species
[default 0.33]
--ignore_markers IGNORE_MARKERS
File containing a list of markers to ignore.
--avoid_disqm Deactivate the procedure of disambiguating the quasi-markers based on the
marker abundance pattern found in the sample. It is generally recommended
to keep the disambiguation procedure in order to minimize false positives
--stat Statistical approach for converting marker abundances into clade abundances
'avg_g' : clade global (i.e. normalizing all markers together) average
'avg_l' : average of length-normalized marker counts
'tavg_g' : truncated clade global average at --stat_q quantile
'tavg_l' : truncated average of length-normalized marker counts (at --stat_q)
'wavg_g' : winsorized clade global average (at --stat_q)
'wavg_l' : winsorized average of length-normalized marker counts (at --stat_q)
'med' : median of length-normalized marker counts
[default tavg_g]
Additional analysis types and arguments:
-t ANALYSIS TYPE Type of analysis to perform:
* rel_ab: profiling a metagenomes in terms of relative abundances
* rel_ab_w_read_stats: profiling a metagenomes in terms of relative abundances and estimate
the number of reads coming from each clade.
* reads_map: mapping from reads to clades (only reads hitting a marker)
* clade_profiles: normalized marker counts for clades with at least a non-null marker
* marker_ab_table: normalized marker counts (only when > 0.0 and normalized by metagenome size if --nreads is specified)
* marker_counts: non-normalized marker counts [use with extreme caution]
* marker_pres_table: list of markers present in the sample (threshold at 1.0 if not differently specified with --pres_th
* clade_specific_strain_tracker: list of markers present for a specific clade, specified with --clade, and all its subclades
[default 'rel_ab']
--nreads NUMBER_OF_READS
The total number of reads in the original metagenome. It is used only when
-t marker_table is specified for normalizing the length-normalized counts
with the metagenome size as well. No normalization applied if --nreads is not
specified
--pres_th PRESENCE_THRESHOLD
Threshold for calling a marker present by the -t marker_pres_table option
--clade The clade for clade_specific_strain_tracker analysis
--min_ab The minimum percentage abundance for the clade in the clade_specific_strain_tracker analysis
Output arguments:
-o output file, --output_file output file
The output file (if not specified as positional argument)
--sample_id_key name Specify the sample ID key for this analysis. Defaults to 'SampleID'.
--use_group_representative
Use a species as representative for species groups.
--sample_id value Specify the sample ID for this analysis. Defaults to 'Metaphlan_Analysis'.
-s sam_output_file, --samout sam_output_file
The sam output file
--legacy-output Old MetaPhlAn2 two columns output
--CAMI_format_output Report the profiling using the CAMI output format
--unknown_estimation Scale relative abundances to the number of reads mapping to known clades in order to estimate unknowness
--biom biom_output, --biom_output_file biom_output
If requesting biom file output: The name of the output file in biom format
--mdelim mdelim, --metadata_delimiter_char mdelim
Delimiter for bug metadata: - defaults to pipe. e.g. the pipe in k__Bacteria|p__Proteobacteria
Other arguments:
--nproc N The number of CPUs to use for parallelizing the mapping [default 4]
--install Only checks if the MetaPhlAn DB is installed and installs it if not. All other parameters are ignored.
--force_download Force the re-download of the latest MetaPhlAn database.
--read_min_len READ_MIN_LEN
Specify the minimum length of the reads to be considered when parsing the input file with
'read_fastx.py' script, default value is 70
-v, --version Prints the current MetaPhlAn version and exit
-h, --help show this help message and exit
:::
The following command uses MetaPhlAn to profile the taxonomic composition of the ERR5766177 metagenomic sample.
The input file is specified as a merged FASTQ file, and the output is saved as a text file containing the taxonomic profile.
The `--bowtie2out` option is used to specify the output file for the Bowtie2 alignment, and the --nproc option is used to specify the number of CPUs to use for the analysis.
```bash
metaphlan ../results/fastp/ERR5766177.merged.fastq.gz \
--input_type fastq \
--bowtie2out ../results/metaphlan/ERR5766177.bt2.out \
--nproc 4 \
> ../results/metaphlan/ERR5766177.metaphlan_profile.txt
```
The main results files that we're interested in is located at
`../results/metaphlan/ERR5766177.metaphlan_profile.txt`
It's a tab separated file, with taxons in rows, with their relative abundance in the sample
```bash
head ../results/metaphlan/ERR5766177.metaphlan_profile.txt
```
::: {.callout-tip collapse="true" title='Expand to see command output'}
#mpa_v30_CHOCOPhlAn_201901
#/home/maxime_borry/.conda/envs/maxime/envs/summer_school_microbiome/bin/metaphlan ../results/fastp/ERR5766177.merged.fastq.gz \
--input_type fastq --bowtie2out ../results/metaphlan/ERR5766177.bt2.out --nproc 8
#SampleID Metaphlan_Analysis
#clade_name NCBI_tax_id relative_abundance additional_species
k__Bacteria 2 82.23198
k__Archaea 2157 17.76802
k__Bacteria|p__Firmicutes 2|1239 33.47957
k__Bacteria|p__Bacteroidetes 2|976 28.4209
k__Bacteria|p__Actinobacteria 2|201174 20.33151
k__Archaea|p__Euryarchaeota 2157|28890 17.76802
:::
::: {.callout-tip title="Question" appearance="simple"}
What is the letter preceding the `__` before every clade name ? Eg. `k__Bacteria`
:::
::: {.callout-note collapse="true" title="Answer"}
This letter corresponds to the rank of the clade.
- k for Kingdom
- p for Phylum
- [...]
- f for Family
- g for Genus
- s for Species
:::
## Visualizing the taxonomic profile
### Visualizing metaphlan taxonomic profile with Pavian
Pavian ([https://github.com/fbreitwieser/pavian](https://github.com/fbreitwieser/pavian)) by Breitweiser et al. [-@Breitwieser2020-zw] is a web-based tool for interactive visualization and analysis of metagenomics data.
It provides a user-friendly interface for exploring taxonomic and functional profiles of microbial communities, and allows users to generate interactive plots and tables that can be customised and shared (@fig-taxonomicprofiling-pavianinterface).
You can open Pavian in your browser by visiting [https://fbreitwieser.shinyapps.io/pavian](https://fbreitwieser.shinyapps.io/pavian/.).
![Screenshot of the pavian metagenomics visualisation interface, with menus on the left, a select sample and filter taxa search bar at the top, and a Sankey visualisation of the example metagenome sample](assets/images/chapters/taxonomic-profiling/pavian.png){#fig-taxonomicprofiling-pavianinterface}
::: {.callout-note collapse="true" title='Expand for instructions for running yourself'}
There are different ways to run it:
- If you have docker ([https://www.docker.com/](https://www.docker.com/)) installed
```bash
docker pull 'florianbw/pavian'
docker run --rm -p 5000:80 florianbw/pavian
```
Then open your browser and visit [localhost:5000](localhost:5000)
- If you are familiar with R ([https://www.r-project.org/](https://www.r-project.org/))
```{r eval=F}
if (!require(remotes)) {
install.packages("remotes")
}
remotes::install_github("fbreitwieser/pavian")
pavian::runApp(port = 5000)
```
Then open your browser and visit [localhost:5000](localhost:5000)
:::
::: {.callout-tip title="Question" appearance="simple"}
What is the relative abundance of the phylum Firmicutes ?
:::
::: {.callout-note collapse="true" title="Answer"}
In this example, the relative abundance of the Firmicutes (officially renamed Bacillota since 2021) is 33.5%.
You can verify this number either in the Sankey plot (section "Sample"), or in the comparison table (section "Comparison") by selecting the Phylum tab.
:::
### Visualizing metaphlan taxonomic profile with Krona
Krona ([https://github.com/marbl/Krona/wiki](https://github.com/marbl/Krona/wiki)) by Ondov et al. [@Ondov2011-se] is a software tool for interactive visualization of hierarchical data, such as taxonomic profiles generated by metagenomics tools like MetaPhlAn.
Krona allows users to explore the taxonomic composition of microbial communities in a hierarchical manner, from the highest taxonomic level down to the species level.
The `metaphlan2krona.py` script is used to convert the MetaPhlAn taxonomic profile output to a format that can be visualised by Krona.
The output of the script is a text file that contains the taxonomic profile in a hierarchical format that can be read by Krona.
The `ktImportText` command is then used to generate an interactive HTML file that displays the taxonomic profile in a hierarchical manner using Krona.
```bash
python ../scripts/metaphlan2krona.py -p ../results/metaphlan/ERR5766177.metaphlan_profile.txt -k ../results/krona/ERR5766177_krona.out
ktImportText -o ../results/krona/ERR5766177_krona.html ../results/krona/ERR5766177_krona.out
```
::: {.callout-tip collapse="true" title='Expand to see command output'}
Writing ../results/krona/ERR5766177_krona.html...
:::
::: {.callout-tip title="Question" appearance="simple"}
Which proportion of the Firmicutes is made of Clostridiales ?
:::
::: {.callout-note collapse="true" title="Answer"}
The Clostridiales represent 63% of the Firmicutes in this sample.
You can verify this by clicking in the Clostridiales and looking at the pie chart for Firmicutes on the top right of the screen.
:::
## Getting modern comparative reference data
In order to compare our sample with modern reference samples, I used the curatedMetagenomicsData package,
which provides both curated metadata, and pre-computed metaphlan taxonomic profiles for published modern human samples.
The full R code to get these data is available in `curatedMetagenomics/get_sources.Rmd`.
I pre-selected 200 gut microbiome samples from non-westernised (100) and westernised (100) from healthy, non-antibiotic users donors.
```{r eval=F}
# Load required packages
library(curatedMetagenomicData)
library(tidyverse)
# Filter samples based on specific criteria
sampleMetadata %>%
filter(body_site == "stool" & antibiotics_current_use == "no" & disease == "healthy") %>%
group_by(non_westernized) %>%
sample_n(100) %>%
ungroup() -> selected_samples
# Extract relative abundance data for selected samples
selected_samples %>%
returnSamples("relative_abundance") -> rel_ab
# Split relative abundance data by taxonomic rank and write to CSV files
data_ranks <- splitByRanks(rel_ab)
for (r in names(data_ranks)) {
# Print taxonomic rank and output file name
print(r)
output_file <- paste0("../../data/curated_metagenomics/modern_sources_", tolower(r), ".csv")
print(output_file)
# Write relative abundance data to CSV file
assay_rank <- as.data.frame(assay(data_ranks[[r]]))
write.csv(assay_rank, output_file)
}
```
- The resulting pre-computed metaphlan taxonomic profiles (split by taxonomic ranks) are available in `data/curated_metagenomics`
- The associated metadata is available at `data/metadata/curated_metagenomics_modern_sources.csv`
## Loading the ancient sample taxonomic profile
This is the moment where we will use the Pandas ([https://pandas.pydata.org](https://pandas.pydata.org)) Python library ([https://www.python.org/](https://www.python.org/)) to perform some data manipulation.
We will also use the Taxopy ([https://github.com/apcamargo/taxopy](https://github.com/apcamargo/taxopy)) library to work with taxonomic information.
In python we need to import necessary libraries, i.e. pandas and taxopy, and a couple of other utility libraries.
```python
import pandas as pd
import taxopy
import pickle
import gzip
```
And we then create an instance of the taxopy taxonomy database.
This will take a few seconds/minutes, as it needs to download the entire NCBI taxonomy before storing in a local database.
```python
taxdb = taxopy.TaxDb()
```
Let's read the metaphlan profile table with pandas (a python package with a similar concept to tidyverse dyplyr, tidyr packa).
It's a tab separated file, so we need to specify the delimiter as `\t`, and skip the comment lines of the files that start with `#`.
```python
ancient_data = pd.read_csv("../results/metaphlan/ERR5766177.metaphlan_profile.txt",
comment="#",
delimiter="\t",
names=['clade_name','NCBI_tax_id','relative_abundance','additional_species'])
```
To look at the head of a dataframe (@tbl-taxonomicprofiling-mp3raw) with pandas
```python
ancient_data.head()
```
| | clade_name | NCBI_tax_id | relative_abundance | additional_species |
|---|--------------------------------|-------------|--------------------|--------------------|
| 0 | k__Bacteria | 2 | 82.23198 | NaN |
| 1 | k__Archaea | 2157 | 17.76802 | NaN |
| 2 | k__Bacteria\|p__Firmicutes | 2\|1239 | 33.47957 | NaN |
| 3 | k__Bacteria\|p__Bacteroidetes | 2\|976 | 28.42090 | NaN |
| 4 | k__Bacteria\|p__Actinobacteria | 2\|201174 | 20.33151 | NaN |
: Top few lines of a MetaPhlAn taxonomic profile {#tbl-taxonomicprofiling-mp3raw}
We can also specify more rows by randomly picking 10 rows to display (@tbl-taxonomicprofiling-mp3sampled).
```python
ancient_data.sample(10)
```
| | clade_name | NCBI_tax_id | relative_abundance | additional_species |
|----|------------------------------------------------------|------------------------------------------------|--------------------|------------------------------------------------------|
| 1 | k__Archaea | 2157 | 17.76802 | NaN |
| 46 | k__Bacteria\|p__Bacteroidetes\|c__Bacteroidia\|o_... | 2\|976\|200643\|171549\|171552\|838\|165179 | 25.75544 | k__Bacteria\|p__Bacteroidetes\|c__Bacteroidia\|o_... |
| 55 | k__Bacteria\|p__Firmicutes\|c__Clostridia\|o__Clo... | 2\|1239\|186801\|186802\|186803\|189330\|88431 | 0.91178 | NaN |
| 18 | k__Archaea\|p__Euryarchaeota\|c__Halobacteria\|o_... | 2157\|28890\|183963\|2235 | 0.71177 | NaN |
| 36 | k__Bacteria\|p__Actinobacteria\|c__Actinobacteri... | 2\|201174\|1760\|85004\|31953\|1678 | 9.39377 | NaN |
| 65 | k__Bacteria\|p__Actinobacteria\|c__Actinobacteri... | 2\|201174\|1760\|85004\|31953\|1678\|216816 | 0.05447 | k__Bacteria\|p__Actinobacteria\|c__Actinobacteri... |
| 37 | k__Bacteria\|p__Firmicutes\|c__Clostridia\|o__Clo... | 2\|1239\|186801\|186802\|186803\| | 2.16125 | NaN |
| 38 | k__Bacteria\|p__Firmicutes\|c__Clostridia\|o__Clo... | 2\|1239\|186801\|186802\|541000\|216851 | 1.24537 | NaN |
| 26 | k__Bacteria\|p__Actinobacteria\|c__Actinobacteri... | 2\|201174\|1760\|85004\|31953 | 9.39377 | NaN |
| 48 | k__Bacteria\|p__Firmicutes\|c__Clostridia\|o__Clo... | 2\|1239\|186801\|186802\|541000\|1263\|40518 | 14.96816 | k__Bacteria\|p__Firmicutes\|c__Clostridia\|o__Clo... |
: Ten randomly selected lines of a MetaPhlAn taxonomic profile {#tbl-taxonomicprofiling-mp3sampled}
Because for this analysis, we're only going to look at the relative abundance, we'll only use this column, and
the Taxonomic ID (TAXID) ([https://www.ncbi.nlm.nih.gov/taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy)) information, so we can `drop` (get rid of) the unnecessary columns.
```python
ancient_data = (
ancient_data
.rename(columns={'NCBI_tax_id': 'TAXID'})
.drop(['clade_name','additional_species'], axis=1)
)
```
:::{.callout-important}
Always investigate your data at first !
```python
ancient_data.relative_abundance.sum()
```
700.00007
:::
::: {.callout-tip title="Question" appearance="simple"}
What do you think of a 700% relative abundance ?
:::
::: {.callout-tip collapse="true" title="Answer"}
Let's proceed further and try to understand what's happening (@tbl-taxonomicprofiling-taxidtable).
```python
ancient_data.head()
```
| | TAXID | relative_abundance |
|---|-----------|--------------------|
| 0 | 2 | 82.23198 |
| 1 | 2157 | 17.76802 |
| 2 | 2\|1239 | 33.47957 |
| 3 | 2\|976 | 28.42090 |
| 4 | 2\|201174 | 20.33151 |
: A two column table of TAXIDs and the organisms corresponding relative abundance {#tbl-taxonomicprofiling-taxidtable}
To make sense of the TAXID, we will use taxopy to get all the taxonomic related informations such as (@tbl-taxonomicprofiling-taxidtablewithpath):
- Name of the taxon
- Rank of the taxon
- Lineage of the taxon
```python
## This function is here to help us get the taxon information
## from the metaphlan taxonomic ID lineage, of the following form
## 2|976|200643|171549|171552|838|165179
def to_taxopy(taxid_entry, taxo_db):
"""Returns a taxopy taxon object
Args:
taxid_entry(str): metaphlan TAXID taxonomic lineage
taxo_db(taxopy database)
Returns:
(bool): Returns a taxopy taxon object
"""
taxid = taxid_entry.split("|")[-1] # get the last element
try:
if len(taxid) > 0:
return taxopy.Taxon(int(taxid), taxo_db) # if it's not empty, get the taxon corresponding to the taxid
else:
return taxopy.Taxon(12908, taxo_db) # otherwise, return the taxon associated with unclassified sequences
except taxopy.exceptions.TaxidError as e:
return taxopy.Taxon(12908, taxo_db)
```
```python
ancient_data['taxopy'] = ancient_data['TAXID'].apply(to_taxopy, taxo_db=taxo_db)
ancient_data.head()
```
| | TAXID | relative_abundance | taxopy |
|---|-----------|--------------------|---------------------------------------------------|
| 0 | 2 | 82.23198 | s__Bacteria |
| 1 | 2157 | 17.76802 | s__Archaea |
| 2 | 2\|1239 | 33.47957 | s__Bacteria;c__Terrabacteria group;p__Firmicutes |
| 3 | 2\|976 | 28.42090 | s__Bacteria;c__FCB group;p__Bacteroidetes |
| 4 | 2\|201174 | 20.33151 | s__Bacteria;c__Terrabacteria group;p__Actinoba... |
: A three column table of TAXIDs and the organisms corresponding relative abundance, and the attached taxonomic path associated with the TAXID {#tbl-taxonomicprofiling-taxidtablewithpath}
```python
ancient_data = ancient_data.assign(
rank = ancient_data.taxopy.apply(lambda x: x.rank),
name = ancient_data.taxopy.apply(lambda x: x.name),
lineage = ancient_data.taxopy.apply(lambda x: x.name_lineage),
)
ancient_data
```
```{r}
#| label: tbl-taxonomicprofiling-taxidtablewithpathsep
#| echo: false
#| warning: false
#| results: 'asis'
#| tbl-cap: A five column table of TAXIDs and the organisms corresponding relative abundance, and the attached taxonomic path associated with the TAXID, but also the rank and name of the particular taxonomic ID
#| tbl-cap-location: top
library(tidyverse)
library(gt)
# Load the data from CSV
data <- read_csv("assets/images/chapters/taxonomic-profiling/tbl-taxonomicprofiling-taxidtablewithpathsep.csv")
# Create the table with gt
data %>%
gt() %>%
tab_options(
table.width = pct(100),
table.layout = "fixed",
container.overflow.x = "scroll"
)
```
Because our modern data are split by ranks, we'll first split our ancient sample by rank
Which of the entries are at the `species rank` level?
```python
ancient_species = ancient_data.query("rank == 'species'")
ancient_species.head()
```
```{r}
#| label: tbl-taxonomicprofiling-taxidtablewithpathspecies
#| echo: false
#| warning: false
#| results: 'asis'
#| tbl-cap: A five column table of TAXIDs and the organisms corresponding relative abundance, and the attached taxonomic path associated with the TAXID, but also the rank and name of the particular taxonomic ID, filtered to only species
#| tbl-cap-location: top
library(tidyverse)
library(gt)
# Load the data from CSV