You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: CHEWBBACA/docs/user/modules/CreateSchema.rst
+24-89
Original file line number
Diff line number
Diff line change
@@ -4,59 +4,39 @@ CreateSchema - Create a gene-by-gene schema
4
4
What is a Schema and how is it defined
5
5
::::::::::::::::::::::::::::::::::::::
6
6
7
-
A Schema is a pre-defined set of loci that is used in MLST analyses. Traditional MLST schemas
8
-
relied in 7 loci that were internal fragments of housekeeping genes and each locus was defined
9
-
by its amplification by a pair of primers yielding a fragment of a defined size.
7
+
A Schema is a pre-defined set of loci that is used in MLST analyses. Traditional MLST schemas relied in 7 loci that were internal fragments of housekeeping genes and each locus was defined by its amplification by a pair of primers yielding a fragment of a defined size.
10
8
11
9
In genomic analyses, schemas are a set of loci that are:
12
10
13
-
- Present in the majority of strains for core genome (cg) MLST schemas, typically a threshold
14
-
of presence in 95% of the strains is used in schema creation. The assumption is that in each
15
-
strain up to 5% of loci may not be identified due to sequencing coverage problems, assembly
16
-
problems or other issues related to the use of draft genome assemblies.
11
+
- Present in the majority of strains for core genome (cg) MLST schemas, typically a threshold of presence in 95% of the strains is used in schema creation. The assumption is that in each strain up to 5% of loci may not be identified due to sequencing coverage problems, assembly problems or other issues related to the use of draft genome assemblies.
17
12
18
-
- Present in at least one of the analyzed strains in the schema creation for pan genome/whole
19
-
genome (pg/wg) MLST schemas.
13
+
- Present in at least one of the analyzed strains in the schema creation for pan genome/whole genome (pg/wg) MLST schemas.
20
14
21
15
- Present in less than 95% of the strains for accessory genome (ag) MLST schemas.
22
16
23
-
It is important to consider that these definitions are always operational in nature, in the sense
24
-
that the analyses are performed on a limited number of strains representing part of the biological
25
-
diversity of a given species or genus and are always dependent on the definition of thresholds.
17
+
It is important to consider that these definitions are always operational in nature, in the sense that the analyses are performed on a limited number of strains representing part of the biological diversity of a given species or genus and are always dependent on the definition of thresholds.
26
18
27
-
In most cg/wg/pg/ag MLST schemas, contrary to MLST schemas, each locus corresponds to a coding sequence
28
-
(CDS). However, depending on the allele calling algorithm, the alleles called for a given locus can be
29
-
CDSs or best matches to existing CDSs without enforcing the need for the identified allele to be a CDS.
19
+
In most cg/wg/pg/ag MLST schemas, contrary to MLST schemas, each locus corresponds to a coding sequence (CDS). However, depending on the allele calling algorithm, the alleles called for a given locus can be CDSs or best matches to existing CDSs without enforcing the need for the identified allele to be a CDS.
30
20
31
-
In **chewBBACA**, schemas are composed of loci defined by CDSs and, by default, all the called alleles of a given
32
-
locus are CDSs as defined by `Prodigal <https://github.com/hyattpd/Prodigal>`_ (for chewBBACA<=3.2.0) or
33
-
`Pyrodigal <https://github.com/althonos/pyrodigal>`_ (for chewBBACA>=3.3.0) (it is also possible to provide
34
-
FASTA files with CDSs and the ``--cds`` parameter to skip the gene prediction step with Prodigal).
35
-
The use of Prodigal or Pyrodigal, instead of simply ensuring the presence of start and stop codons, adds an extra layer
36
-
of confidence in identifying the most probable CDS for each allele. Because of this approach there may
37
-
be variability in the size of the alleles identified by **chewBBACA** and by default a threshold of +/-20%
38
-
of the mode of the size of the alleles of a given locus is used to identify a locus as present.
21
+
In **chewBBACA**, schemas are composed of loci defined by CDSs and, by default, all the called alleles of a given locus are CDSs as defined by `Prodigal <https://github.com/hyattpd/Prodigal>`_ (for chewBBACA<=3.2.0) or `Pyrodigal <https://github.com/althonos/pyrodigal>`_ (for chewBBACA>=3.3.0) (it is also possible to provide FASTA files with CDSs and the ``--cds`` parameter to skip the gene prediction step). The use of Prodigal or Pyrodigal, instead of simply ensuring the presence of start and stop codons, adds an extra layer of confidence in identifying the most probable CDS for each allele. Because of this approach there may be variability in the size of the alleles identified by **chewBBACA** and by default a threshold of +/-20% of the mode of the size of the alleles of a given locus is used to identify a locus as present.
39
22
40
23
Create a wgMLST schema
41
24
::::::::::::::::::::::
42
25
43
-
Given a set of genome assemblies in FASTA format, chewBBACA offers the option to create a new schema by defining
44
-
the distinct loci present in the genomes.
26
+
Given a set of genome assemblies in FASTA format, chewBBACA offers the option to create a new schema by defining the distinct loci present in the genomes.
You should adjust the value passed to the ``--cpu`` parameter based on the specifications of
55
-
your machine. chewBBACA will automatically adjust the value if it matches or exceeds the number
56
-
of available CPU cores.
36
+
You should adjust the value passed to the ``--cpu`` parameter based on the specifications of your machine. chewBBACA will automatically adjust the value if it matches or exceeds the number of available CPU cores.
57
37
58
38
.. important::
59
-
The use of a prodigal training file for schema creation is highly recommended.
39
+
The use of a prodigal training file for schema creation is highly recommended. The training file is included in the newly created schema and is used to predict genes during the allele calling process.
60
40
61
41
Parameters
62
42
----------
@@ -106,12 +86,7 @@ Parameters
106
86
are not deleted at the end (default: False).
107
87
108
88
.. important::
109
-
If you provide the ``--cds-input`` parameter, chewBBACA assumes that the input FASTA files contain
110
-
coding sequences and skips the gene prediction step with Prodigal. To avoid issues related with the
111
-
format of the sequence headers, chewBBACA renames the sequence headers based on the unique basename
112
-
prefix determined for each input file and on the order of the coding sequences (e.g.: coding sequences
113
-
inside a file named ``GCF_000007125.1_ASM712v1_cds_from_genomic.fna`` are renamed to
If you provide the ``--cds-input`` parameter, chewBBACA assumes that the input FASTA files contain coding sequences and skips the gene prediction step with Prodigal. To avoid issues related with the format of the sequence headers, chewBBACA renames the sequence headers based on the unique basename prefix determined for each input file and on the order of the coding sequences (e.g.: coding sequences inside a file named ``GCF_000007125.1_ASM712v1_cds_from_genomic.fna`` are renamed to ``GCF_000007125-protein1``, ``GCF_000007125-protein2``, ..., ``GCF_000007125-proteinN``).
115
90
116
91
Outputs
117
92
-------
@@ -131,27 +106,15 @@ Outputs
131
106
├── invalid_cds.txt
132
107
└── cds_coordinates.tsv
133
108
134
-
- One FASTA file per distinct gene identified in the schema creation process in the
135
-
``OutputFolderName/SchemaName`` directory. The name attributed to each FASTA file in
136
-
the schema is based on the genome of origin of the representative allele chosen for that
137
-
gene and on the order of gene prediction (e.g.: ``GCA-000167715-protein12.fasta``,
138
-
first allele for the gene was identified in a genome assembly with the prefix ``GCA-000167715``
139
-
and the gene was the 12th gene predicted by Prodigal in that assembly).
109
+
- One FASTA file per distinct locus identified in the schema creation process in the ``OutputFolderName/SchemaName`` directory. The name attributed to each FASTA file in the schema is based on the genome of origin of the representative allele chosen for that locus and on the order of gene prediction (e.g.: ``GCA-000167715-protein12.fasta``, first allele for the locus was identified in a genome assembly with the prefix ``GCA-000167715`` and the locus was the 12th gene predicted in that assembly).
140
110
141
-
- The ``OutputFolderName/SchemaName`` directory also contains a directory named ``short`` that
142
-
includes FASTA files with the representative alleles for each locus.
111
+
- The ``OutputFolderName/SchemaName`` directory also contains a directory named ``short`` that includes FASTA files with the representative alleles for each locus.
143
112
144
-
- The training file passed to create the schema is also included in ``OutputFolderName/SchemaName``
145
-
and will be automatically detected during the allele calling process.
113
+
- The training file passed to create the schema is also included in ``OutputFolderName/SchemaName`` and will be automatically detected during the allele calling process.
146
114
147
-
- The ``cds_coordinates.tsv`` file contains the coordinates (genome unique identifier, contig
148
-
identifier, start position, stop position, protein identifier attributed by chewBBACA, and coding
149
-
strand (chewBBACA<=3.2.0 assigns 1 to the forward strand and 0 to the reverse strand and
150
-
chewBBACA>=3.3.0 assigns 1 and -1 to the forward and reverse strands, respectively)) of the CDSs
151
-
identified in each genome.
115
+
- The ``cds_coordinates.tsv`` file contains the coordinates (genome unique identifier, contig identifier, start position, stop position, protein identifier attributed by chewBBACA, and coding strand (chewBBACA<=3.2.0 assigns 1 to the forward strand and 0 to the reverse strand and chewBBACA>=3.3.0 assigns 1 and -1 to the forward and reverse strands, respectively)) of the CDSs identified in each genome.
152
116
153
-
- The ``invalid_cds.txt`` file contains the list of alleles predicted by Prodigal that were
154
-
excluded based on the minimum sequence size value and presence of ambiguous bases.
117
+
- The ``invalid_cds.txt`` file contains the list of alleles predicted that were excluded based on the minimum sequence size value and presence of ambiguous bases.
155
118
156
119
Workflow of the CreateSchema module
157
120
:::::::::::::::::::::::::::::::::::
@@ -160,46 +123,18 @@ Workflow of the CreateSchema module
160
123
:width:1000px
161
124
:align:center
162
125
163
-
The schema creation algorithm has the following main steps:
126
+
The CreateSchema module creates a schema seed based on a set of FASTA files with genome assemblies or CDSs. Brief description of the workflow:
164
127
165
-
- Gene predictipon with Prodigal followed by coding sequence (CDS) extraction to create FASTA files
166
-
that contain all CDSs extracted from the inputs. (there is also the option to provide FASTA files
167
-
with CDSs and the ``--cds`` parameter to skip the gene prediction step with Prodigal).
128
+
- If genome assemblies are given, the process starts by predicting CDSs for each genome using Pyrodigal. Alternatively, if FASTA files containing CDSs are provided (``--cds`` parameter), the process skips the gene prediction step.
168
129
169
-
- Identification of the distinct CDSs (chewBBACA stores information about the distinct CDSs and the
170
-
genomes that contain those CDSs in a hashtable with the mapping between CDS SHA-256 and list of unique
171
-
integer identifiers for the inputs that contain each CDS compressed with `polyline encoding <https://developers.google.com/maps/documentation/utilities/polylinealgorithm>`_
172
-
adapted from `numcompress <https://github.com/amit1rrr/numcompress>`_).
130
+
- The CDSs identified in the input files are deduplicated and translated (CDS translation identifies and excludes CDSs that contain ambiguous bases and with length below the theshold defined by the ``--l`` parameter), followed by a second deduplication step to determine the set of distinct translated CDSs. The schema creation process creates the same hash tables used in the allele calling process to store information about the distinct CDSs and distinct translated CDSs.
173
131
174
-
- Exclusion of the CDSs smaller than the value passed to the ``--l`` parameter (default: 201).
132
+
- The distinct translated CDSs are clustered based on the proportion of minimizers (minimizers selected based on lexicographic order, k=5, w=5) shared with representative CDSs. The largest or one of the largest CDSs is selected as the first representative CDS. New representative CDSs are selected when CDSs share a low proportion (<0.2) of minimizers with any of the chosen representative CDSs.
175
133
176
-
- Translation of distinct CDSs that were not an exact match in the previous step (This step identifies
177
-
and excludes CDSs that contain ambiguous bases).
134
+
- Non-representative CDSs that share a proportion of minimizers ≥ 0.9 with the cluster representative are considered to correspond to the same locus and are excluded from the analysis.
178
135
179
-
- Protein deduplication to identify the distinct set of proteins and keep information about the inputs that
180
-
contain CDSs that encode each distinct protein (hashtable with mapping between protein SHA-256 and list of
181
-
unique integer identifiers for the distinct CDSs encoded with polyline encoding).
136
+
- The proportion of shared minimizers between non-representative CDSs is determined to exclude CDSs sharing a proportion of minimizers ≥ 0.9 with larger CDSs.
182
137
183
-
- Minimizer-based clustering. The distinct proteins are sorted in order of decreasing length and
184
-
clustered based on the percentage of shared distinct minimizers (default >= 20%, interior minimizers
185
-
selected based on lexicographic order, k=5, w=5). The first protein is chosen as representative of
186
-
the first cluster and a new cluster is defined each time a protein cannot be added to any of the
187
-
previously defined clusters based on the percentage of minimizers shared with the cluster repsentatives.
138
+
- Intracluster and intercluster alignment with BLASTp enable identifying and excluding CDSs similar to representative or larger non-representative CDSs based on a BLAST Score Ratio (BSR) ≥ 0.6.
188
139
189
-
- Exclude proteins that share >=90% minimizers with cluster representatives (we assume that these
190
-
sequences represent alleles for the same gene and only keep one representative per gene).
191
-
192
-
- Exclude proteins that share >=90% minimizers with other proteins in the same cluster (a cluster
193
-
might include sequences from multiple genes and we want to keep only one representative sequence
194
-
per gene).
195
-
196
-
- Align proteins in each cluster with BLASTp to select a set of representative proteins per cluster
197
-
based on the BLAST Score Ratio (BSR) computed for each alignment.
198
-
199
-
- Align the selected representatives for all clusters with BLASTp to identify and exclude representative
200
-
proteins that are highly similar (default: BSR >= 0.6) to other representative proteins. The remaining
201
-
set of proteins is not considered highly similar based on the clustering or alignment approach and
202
-
constitutes the schema seed.
203
-
204
-
- Create the schema seed directory structure with one FASTA file per representative CDS (proteins are converted
205
-
back into DNA). The schema seed can be used to perform allele calling.
140
+
- Each remaining distinct CDS is considered to be an allele of a distinct locus. The process ends by creating a schema seed, which includes one FASTA file containing a single representative allele per distinct locus identified in the analysis.
0 commit comments