You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
cgMLST schemas are defined as the set of loci that are present in all strains under analysis
8
-
or, due to sequencing/assembly limitations, >95% of strains analyzed. In order to have a
9
-
robust definition of a cgMLST schema for a given bacterial species, a set of representative
10
-
strains of the diversity of a given species should be selected. Furthermore, since cgMLST
11
-
schema definition is based on pre-defined thresholds, only when a sufficient number of strains
12
-
have been analyzed can the cgMLST schema be considered stable. This number will always depend
13
-
on the population structure and diversity of the species in question, with non-recombinant
14
-
monomorphic species possibly requiring a smaller number of strais to define cgMLST schemas
15
-
than panmictic highly recombinogenic species that are prone to have large numbers of accessory
16
-
genes and mobile genetic elements. It is also important to refer that the same strategy
17
-
described here can be used to defined lineage specific schemas for more detailed analysis
18
-
within a given bacterial lineage. Also, by definition, all the loci that are not considered
19
-
core genome, can be classified as being part of an accessory genome MLST (agMLST) schema.
7
+
cgMLST schemas are defined as the set of loci that are present in all strains under analysis or, due to sequencing/assembly limitations, >95% of strains analyzed. In order to have a robust definition of a cgMLST schema for a given bacterial species, a set of representative strains of the diversity of a given species should be selected. Furthermore, since cgMLST schema definition is based on pre-defined thresholds, only when a sufficient number of strains have been analyzed can the cgMLST schema be considered stable. This number will always depend on the population structure and diversity of the species in question, with non-recombinant monomorphic species possibly requiring a smaller number of strais to define cgMLST schemas than panmictic highly recombinogenic species that are prone to have large numbers of accessory genes and mobile genetic elements. It is also important to refer that the same strategy described here can be used to defined lineage specific schemas for more detailed analysis within a given bacterial lineage. Also, by definition, all the loci that are not considered core genome, can be classified as being part of an accessory genome MLST (agMLST) schema.
20
8
21
9
Determine the loci that constitute the cgMLST
22
10
:::::::::::::::::::::::::::::::::::::::::::::
@@ -62,41 +50,39 @@ Outputs
62
50
63
51
The output folder contains 3 files:
64
52
65
-
- ``Presence_Abscence.tsv`` - allele presence and absence matrix (1 or 0, respectively) for
66
-
all the loci found in the ``-i`` file (excluding the loci and genomes that were flagged
67
-
to be excluded).
53
+
- ``Presence_Abscence.tsv`` - allele presence and absence matrix (1 or 0, respectively) for all the loci found in the ``-i`` file (excluding the loci and genomes that were flagged to be excluded).
68
54
- ``mdata_stats.tsv`` - total number and percentage of loci missing from each genome.
69
-
- ``cgMLST<threshold>.tsv`` - a file for each specified threshold that contains the matrix with
70
-
the allelic profiles for the cgMLST (already excluding the list of loci and list of genomes
71
-
passed to the ``--r`` and ``--g`` parameters, respectively).
72
-
- ``cgMLSTschema<threshold>.txt`` - a file for each specified threshold that contains the list of
73
-
loci that constitute the cgMLST schema.
74
-
- ``cgMLST.html`` - HTML file with a line plot for the number of loci in the cgMLST per threshold.
75
-
Also includes a black line with the number of loci present in each genome that is added to the
76
-
analysis.
55
+
- ``cgMLST<threshold>.tsv`` - a file for each specified threshold that contains the matrix with the allelic profiles for the cgMLST (already excluding the list of loci and list of genomes passed to the ``--r`` and ``--g`` parameters, respectively).
56
+
- ``cgMLSTschema<threshold>.txt`` - a file for each specified threshold that contains the list of loci that constitute the cgMLST schema. This file can be passed to the ``--gl`` parameter of the *AlleleCall* module to perform allele calling only for the loci in the list.
57
+
- ``cgMLST.html`` - HTML file with a line plot for the number of loci in the cgMLST per threshold. Also includes a black line with the number of loci present in each genome that is added to the analysis.
77
58
78
59
.. important::
79
-
The ExtractCgMLST module converts/masks all non-integer classifications in the profile matrix to ``0``
80
-
and removes all the ``INF-`` prefixes.
60
+
The ExtractCgMLST module masks the allelic profiles, which removes all ``INF-`` prefixes and substitutes *non-EXC* and *non-INF* classifications by ``0``.
81
61
82
-
Example of the plot created by the ExtractCgMLST module based on the allelic profiles for 680
83
-
*Streptococcus agalactiae* genomes:
62
+
Example of the plot created by the ExtractCgMLST module based on the allelic profiles for 680 *Streptococcus agalactiae* genomes:
84
63
85
64
.. image:: /_static/images/cgMLST_docs.png
86
65
:width:900px
87
66
:align:center
88
67
89
-
.. important::
90
-
The ``cgMLSTschema<threshold>.txt`` file can be passed to the ``--gl`` parameter of the *AlleleCall*
91
-
module to perform allele calling only for the loci in the cgMLST schema.
92
-
93
68
.. note::
94
-
The matrix with allelic profiles created by the *ExtractCgMLST* process can be imported
95
-
into `PHYLOViZ <https://online.phyloviz.net/index>`_ to visualize and explore typing results.
69
+
The matrix with allelic profiles created by the *ExtractCgMLST* process can be imported into `PHYLOViZ <https://online.phyloviz.net/index>`_ to visualize and explore typing results.
96
70
97
71
Workflow of the ExtractCgMLST module
98
72
::::::::::::::::::::::::::::::::::::
99
73
100
74
.. image:: /_static/images/ExtractCgMLST.png
101
75
:width:1000px
102
76
:align:center
77
+
78
+
The ExtractCgMLST module determines the set of core loci based on the allelic profiles determined by the AlleleCall module. Brief description of the workflow:
79
+
80
+
- The process starts by excluding loci and samples from the analysis based on lists of loci and samples provided by the user. This allows users to filter out low-quality samples and problematic loci that would affect the determination of the core genome
81
+
82
+
- The filtered allelic profiles are masked to remove the *INF-* prefixes from newly inferred alleles and substitute special classifications by ``0``.
83
+
84
+
- The masked profiles are used to compute a loci presence-absence matrix and count the number of special classifications per sample.
85
+
86
+
- The presence-absence matrix is also used to determine the set of core loci based on the default loci presence thresholds of 0.9, 0.95 and 1, or based on threshold values specified by the user.
87
+
88
+
- The process creates output files with the list of loci and allelic profiles per threshold and creates an HTML file with a scatter plot representing the core genome size variation for each threshold.
0 commit comments