You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: CHEWBBACA/docs/user/modules/CreateSchema.rst
+51-46
Original file line number
Diff line number
Diff line change
@@ -43,52 +43,6 @@ Create a wgMLST schema
43
43
Given a set of genome assemblies in FASTA format, chewBBACA offers the option to create a new schema by defining
44
44
the distinct loci present in the genomes.
45
45
46
-
The schema creation algorithm has the following main steps:
47
-
48
-
- Gene predictipon with Prodigal followed by coding sequence (CDS) extraction to create FASTA files
49
-
that contain all CDSs extracted from the inputs. (there is also the option to provide FASTA files
50
-
with CDSs and the ``--cds`` parameter to skip the gene prediction step with Prodigal).
51
-
52
-
- Identification of the distinct CDSs (chewBBACA stores information about the distinct CDSs and the
53
-
genomes that contain those CDSs in a hashtable with the mapping between CDS SHA-256 and list of unique
54
-
integer identifiers for the inputs that contain each CDS compressed with `polyline encoding <https://developers.google.com/maps/documentation/utilities/polylinealgorithm>`_
55
-
adapted from `numcompress <https://github.com/amit1rrr/numcompress>`_).
56
-
57
-
- Exclusion of the CDSs smaller than the value passed to the ``--l`` parameter (default: 201).
58
-
59
-
- Translation of distinct CDSs that were not an exact match in the previous step (This step identifies
60
-
and excludes CDSs that contain ambiguous bases).
61
-
62
-
- Protein deduplication to identify the distinct set of proteins and keep information about the inputs that
63
-
contain CDSs that encode each distinct protein (hashtable with mapping between protein SHA-256 and list of
64
-
unique integer identifiers for the distinct CDSs encoded with polyline encoding).
65
-
66
-
- Minimizer-based clustering. The distinct proteins are sorted in order of decreasing length and
67
-
clustered based on the percentage of shared distinct minimizers (default >= 20%, interior minimizers
68
-
selected based on lexicographic order, k=5, w=5). The first protein is chosen as representative of
69
-
the first cluster and a new cluster is defined each time a protein cannot be added to any of the
70
-
previously defined clusters based on the percentage of minimizers shared with the cluster repsentatives.
71
-
72
-
- Exclude proteins that share >=90% minimizers with cluster representatives (we assume that these
73
-
sequences represent alleles for the same gene and only keep one representative per gene).
74
-
75
-
- Exclude proteins that share >=90% minimizers with other proteins in the same cluster (a cluster
76
-
might include sequences from multiple genes and we want to keep only one representative sequence
77
-
per gene).
78
-
79
-
- Align proteins in each cluster with BLASTp to select a set of representative proteins per cluster
80
-
based on the BLAST Score Ratio (BSR) computed for each alignment.
81
-
82
-
- Align the selected representatives for all clusters with BLASTp to identify and exclude representative
83
-
proteins that are highly similar (default: BSR >= 0.6) to other representative proteins. The remaining
84
-
set of proteins is not considered highly similar based on the clustering or alignment approach and
85
-
constitutes the schema seed.
86
-
87
-
- Create the schema seed directory structure with one FASTA file per representative CDS (proteins are converted
88
-
back into DNA). The schema seed can be used to perform allele calling.
89
-
90
-
.. image::
91
-
92
46
Basic Usage
93
47
-----------
94
48
@@ -198,3 +152,54 @@ Outputs
198
152
199
153
- The ``invalid_cds.txt`` file contains the list of alleles predicted by Prodigal that were
200
154
excluded based on the minimum sequence size value and presence of ambiguous bases.
155
+
156
+
Workflow of the CreateSchema module
157
+
:::::::::::::::::::::::::::::::::::
158
+
159
+
.. image:: /_static/images/CreateSchema.png
160
+
:width:1200px
161
+
:align:center
162
+
163
+
The schema creation algorithm has the following main steps:
164
+
165
+
- Gene predictipon with Prodigal followed by coding sequence (CDS) extraction to create FASTA files
166
+
that contain all CDSs extracted from the inputs. (there is also the option to provide FASTA files
167
+
with CDSs and the ``--cds`` parameter to skip the gene prediction step with Prodigal).
168
+
169
+
- Identification of the distinct CDSs (chewBBACA stores information about the distinct CDSs and the
170
+
genomes that contain those CDSs in a hashtable with the mapping between CDS SHA-256 and list of unique
171
+
integer identifiers for the inputs that contain each CDS compressed with `polyline encoding <https://developers.google.com/maps/documentation/utilities/polylinealgorithm>`_
172
+
adapted from `numcompress <https://github.com/amit1rrr/numcompress>`_).
173
+
174
+
- Exclusion of the CDSs smaller than the value passed to the ``--l`` parameter (default: 201).
175
+
176
+
- Translation of distinct CDSs that were not an exact match in the previous step (This step identifies
177
+
and excludes CDSs that contain ambiguous bases).
178
+
179
+
- Protein deduplication to identify the distinct set of proteins and keep information about the inputs that
180
+
contain CDSs that encode each distinct protein (hashtable with mapping between protein SHA-256 and list of
181
+
unique integer identifiers for the distinct CDSs encoded with polyline encoding).
182
+
183
+
- Minimizer-based clustering. The distinct proteins are sorted in order of decreasing length and
184
+
clustered based on the percentage of shared distinct minimizers (default >= 20%, interior minimizers
185
+
selected based on lexicographic order, k=5, w=5). The first protein is chosen as representative of
186
+
the first cluster and a new cluster is defined each time a protein cannot be added to any of the
187
+
previously defined clusters based on the percentage of minimizers shared with the cluster repsentatives.
188
+
189
+
- Exclude proteins that share >=90% minimizers with cluster representatives (we assume that these
190
+
sequences represent alleles for the same gene and only keep one representative per gene).
191
+
192
+
- Exclude proteins that share >=90% minimizers with other proteins in the same cluster (a cluster
193
+
might include sequences from multiple genes and we want to keep only one representative sequence
194
+
per gene).
195
+
196
+
- Align proteins in each cluster with BLASTp to select a set of representative proteins per cluster
197
+
based on the BLAST Score Ratio (BSR) computed for each alignment.
198
+
199
+
- Align the selected representatives for all clusters with BLASTp to identify and exclude representative
200
+
proteins that are highly similar (default: BSR >= 0.6) to other representative proteins. The remaining
201
+
set of proteins is not considered highly similar based on the clustering or alignment approach and
202
+
constitutes the schema seed.
203
+
204
+
- Create the schema seed directory structure with one FASTA file per representative CDS (proteins are converted
205
+
back into DNA). The schema seed can be used to perform allele calling.
0 commit comments