Skip to content

Commit 1413cdd

Browse files
committed
docs: add images with the workflows of the remaining modules.
1 parent a59ea12 commit 1413cdd

18 files changed

+144
-83
lines changed
Loading
Loading
Loading
224 KB
Loading
Loading
Loading
243 KB
Loading
Loading

CHEWBBACA/docs/user/modules/AlleleCall.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -221,13 +221,13 @@ The column headers stand for:
221221
in the 5' end or 3' end (respectively) of the contig). This could be an artifact caused by
222222
genome fragmentation resulting in a shorter CDS prediction by Prodigal. To avoid locus
223223
misclassification, loci in such situations are classified as *PLOT*.
224+
- *LOTSC* - A locus is classified as *LOTSC* when the contig of the query genome is smaller
225+
than the matched allele.
224226

225227
.. image:: /_static/images/PLOT5_PLOT3_LOTSC.png
226228
:width: 700px
227229
:align: center
228230

229-
- *LOTSC* - A locus is classified as *LOTSC* when the contig of the query genome is smaller
230-
than the matched allele.
231231
- *NIPH* - Non-Informative Paralogous Hit (see image below). When ≥2 CDSs in the query
232232
genome match one locus in the schema with a BSR > 0.6, that locus is classified as *NIPH*.
233233
This suggests that such locus can have paralogous (or orthologous) loci in the query genome

CHEWBBACA/docs/user/modules/AlleleCallEvaluator.rst

+7
Original file line numberDiff line numberDiff line change
@@ -263,3 +263,10 @@ options ``--retree 1`` and ``--maxiterate 0``. The MSAs for all the core loci ar
263263
.. image:: /_static/images/allelecall_report_cgMLST_tree.png
264264
:width: 1400px
265265
:align: center
266+
267+
Workflow of the AlleleCallEvaluator module
268+
::::::::::::::::::::::::::::::::::::::::::
269+
270+
.. image:: /_static/images/AlleleCallEvaluator.png
271+
:width: 1200px
272+
:align: center

CHEWBBACA/docs/user/modules/CreateSchema.rst

+51-46
Original file line numberDiff line numberDiff line change
@@ -43,52 +43,6 @@ Create a wgMLST schema
4343
Given a set of genome assemblies in FASTA format, chewBBACA offers the option to create a new schema by defining
4444
the distinct loci present in the genomes.
4545

46-
The schema creation algorithm has the following main steps:
47-
48-
- Gene predictipon with Prodigal followed by coding sequence (CDS) extraction to create FASTA files
49-
that contain all CDSs extracted from the inputs. (there is also the option to provide FASTA files
50-
with CDSs and the ``--cds`` parameter to skip the gene prediction step with Prodigal).
51-
52-
- Identification of the distinct CDSs (chewBBACA stores information about the distinct CDSs and the
53-
genomes that contain those CDSs in a hashtable with the mapping between CDS SHA-256 and list of unique
54-
integer identifiers for the inputs that contain each CDS compressed with `polyline encoding <https://developers.google.com/maps/documentation/utilities/polylinealgorithm>`_
55-
adapted from `numcompress <https://github.com/amit1rrr/numcompress>`_).
56-
57-
- Exclusion of the CDSs smaller than the value passed to the ``--l`` parameter (default: 201).
58-
59-
- Translation of distinct CDSs that were not an exact match in the previous step (This step identifies
60-
and excludes CDSs that contain ambiguous bases).
61-
62-
- Protein deduplication to identify the distinct set of proteins and keep information about the inputs that
63-
contain CDSs that encode each distinct protein (hashtable with mapping between protein SHA-256 and list of
64-
unique integer identifiers for the distinct CDSs encoded with polyline encoding).
65-
66-
- Minimizer-based clustering. The distinct proteins are sorted in order of decreasing length and
67-
clustered based on the percentage of shared distinct minimizers (default >= 20%, interior minimizers
68-
selected based on lexicographic order, k=5, w=5). The first protein is chosen as representative of
69-
the first cluster and a new cluster is defined each time a protein cannot be added to any of the
70-
previously defined clusters based on the percentage of minimizers shared with the cluster repsentatives.
71-
72-
- Exclude proteins that share >=90% minimizers with cluster representatives (we assume that these
73-
sequences represent alleles for the same gene and only keep one representative per gene).
74-
75-
- Exclude proteins that share >=90% minimizers with other proteins in the same cluster (a cluster
76-
might include sequences from multiple genes and we want to keep only one representative sequence
77-
per gene).
78-
79-
- Align proteins in each cluster with BLASTp to select a set of representative proteins per cluster
80-
based on the BLAST Score Ratio (BSR) computed for each alignment.
81-
82-
- Align the selected representatives for all clusters with BLASTp to identify and exclude representative
83-
proteins that are highly similar (default: BSR >= 0.6) to other representative proteins. The remaining
84-
set of proteins is not considered highly similar based on the clustering or alignment approach and
85-
constitutes the schema seed.
86-
87-
- Create the schema seed directory structure with one FASTA file per representative CDS (proteins are converted
88-
back into DNA). The schema seed can be used to perform allele calling.
89-
90-
.. image::
91-
9246
Basic Usage
9347
-----------
9448

@@ -198,3 +152,54 @@ Outputs
198152

199153
- The ``invalid_cds.txt`` file contains the list of alleles predicted by Prodigal that were
200154
excluded based on the minimum sequence size value and presence of ambiguous bases.
155+
156+
Workflow of the CreateSchema module
157+
:::::::::::::::::::::::::::::::::::
158+
159+
.. image:: /_static/images/CreateSchema.png
160+
:width: 1200px
161+
:align: center
162+
163+
The schema creation algorithm has the following main steps:
164+
165+
- Gene predictipon with Prodigal followed by coding sequence (CDS) extraction to create FASTA files
166+
that contain all CDSs extracted from the inputs. (there is also the option to provide FASTA files
167+
with CDSs and the ``--cds`` parameter to skip the gene prediction step with Prodigal).
168+
169+
- Identification of the distinct CDSs (chewBBACA stores information about the distinct CDSs and the
170+
genomes that contain those CDSs in a hashtable with the mapping between CDS SHA-256 and list of unique
171+
integer identifiers for the inputs that contain each CDS compressed with `polyline encoding <https://developers.google.com/maps/documentation/utilities/polylinealgorithm>`_
172+
adapted from `numcompress <https://github.com/amit1rrr/numcompress>`_).
173+
174+
- Exclusion of the CDSs smaller than the value passed to the ``--l`` parameter (default: 201).
175+
176+
- Translation of distinct CDSs that were not an exact match in the previous step (This step identifies
177+
and excludes CDSs that contain ambiguous bases).
178+
179+
- Protein deduplication to identify the distinct set of proteins and keep information about the inputs that
180+
contain CDSs that encode each distinct protein (hashtable with mapping between protein SHA-256 and list of
181+
unique integer identifiers for the distinct CDSs encoded with polyline encoding).
182+
183+
- Minimizer-based clustering. The distinct proteins are sorted in order of decreasing length and
184+
clustered based on the percentage of shared distinct minimizers (default >= 20%, interior minimizers
185+
selected based on lexicographic order, k=5, w=5). The first protein is chosen as representative of
186+
the first cluster and a new cluster is defined each time a protein cannot be added to any of the
187+
previously defined clusters based on the percentage of minimizers shared with the cluster repsentatives.
188+
189+
- Exclude proteins that share >=90% minimizers with cluster representatives (we assume that these
190+
sequences represent alleles for the same gene and only keep one representative per gene).
191+
192+
- Exclude proteins that share >=90% minimizers with other proteins in the same cluster (a cluster
193+
might include sequences from multiple genes and we want to keep only one representative sequence
194+
per gene).
195+
196+
- Align proteins in each cluster with BLASTp to select a set of representative proteins per cluster
197+
based on the BLAST Score Ratio (BSR) computed for each alignment.
198+
199+
- Align the selected representatives for all clusters with BLASTp to identify and exclude representative
200+
proteins that are highly similar (default: BSR >= 0.6) to other representative proteins. The remaining
201+
set of proteins is not considered highly similar based on the clustering or alignment approach and
202+
constitutes the schema seed.
203+
204+
- Create the schema seed directory structure with one FASTA file per representative CDS (proteins are converted
205+
back into DNA). The schema seed can be used to perform allele calling.

CHEWBBACA/docs/user/modules/DownloadSchema.rst

+7
Original file line numberDiff line numberDiff line change
@@ -111,3 +111,10 @@ Parameters
111111

112112
--latest (Optional) If the compressed version that is available is not the
113113
latest, downloads all loci and constructs schema locally (default: False).
114+
115+
Workflow of the DownloadSchema module
116+
:::::::::::::::::::::::::::::::::::::
117+
118+
.. image:: /_static/images/DownloadSchema.png
119+
:width: 1200px
120+
:align: center

CHEWBBACA/docs/user/modules/ExtractCgMLST.rst

+7
Original file line numberDiff line numberDiff line change
@@ -93,3 +93,10 @@ Example of the plot created by the ExtractCgMLST module based on the allelic pro
9393
.. note::
9494
The matrix with allelic profiles created by the *ExtractCgMLST* process can be imported
9595
into `PHYLOViZ <https://online.phyloviz.net/index>`_ to visualize and explore typing results.
96+
97+
Workflow of the ExtractCgMLST module
98+
::::::::::::::::::::::::::::::::::::
99+
100+
.. image:: /_static/images/ExtractCgMLST.png
101+
:width: 1200px
102+
:align: center

CHEWBBACA/docs/user/modules/LoadSchema.rst

+7
Original file line numberDiff line numberDiff line change
@@ -183,3 +183,10 @@ Parameters
183183

184184
--continue_up (Optional) Check if the schema upload was interrupted and attempt
185185
to continue upload (default: False).
186+
187+
Workflow of the LoadSchema module
188+
:::::::::::::::::::::::::::::::::
189+
190+
.. image:: /_static/images/LoadSchema.png
191+
:width: 1200px
192+
:align: center

CHEWBBACA/docs/user/modules/PrepExternalSchema.rst

+42-35
Original file line numberDiff line numberDiff line change
@@ -19,41 +19,6 @@ Every sequence that is included in the final schema has to represent a complete
1919
multiple of 3 and cannot contain in-frame stop codons) and contain no invalid or ambiguous
2020
characters (sequences must be composed of ATCG only).
2121

22-
By default, the process will adapt the external schema based on a BLAST Score Ratio (BSR) value of
23-
``0.6``, it will accept sequences of any length and will use the genetic code ``11`` (Bacteria and
24-
Archaea) to translate sequences. These options can be changed by passing different values to
25-
the ``--bsr``, ``--l`` and ``--t`` arguments. The process runs relatively fast with the default value
26-
for the ``--cpu`` argument, but it will complete considerably faster if it can use several CPU cores
27-
to evaluate several loci in parallel.
28-
29-
For each gene in the external schema, and assuming the default BSR value, the process will:
30-
31-
- Exclude sequences with invalid or ambiguous characters.
32-
- Exclude sequences with length value that is not a multiple of 3.
33-
- Try to translate sequences and exclude sequences that cannot be translated in any possible
34-
orientation due to invalid start and/or stop codons or in-frame stop codons.
35-
- Select the longest (or one of the longest) sequence as the first representative for that gene;
36-
- Use BLASTp to align the representative against all sequences that were not excluded.
37-
- Calculate the BSR value for each alignment.
38-
- If all BSR values are greater than 0.7, the current representative is considered appropriate
39-
to capture the gene sequence diversity when performing allele calling.
40-
- Otherwise, an additional representative has to be chosen in order to find a suitable set of
41-
representatives for the gene. The new representative will be the longest sequence from the
42-
set of non-representative sequences that had a BSR value in the interval [0.6,0.7] (in this
43-
BSR value interval, aligned sequences are still considered to be alleles of the same gene but
44-
display a degree of dissimilarity that can contribute to an increase of the sensitivity
45-
compared to the utilization of only one of those sequences as representative). If there is
46-
no alignment with a BSR value in the interval [0.6,0.7], the next representative will be the
47-
longest (or one of the longest) sequence from the set of sequences that had an alignment with
48-
a BSR<0.6.
49-
- The process will keep expanding the set of representatives until we have a set of
50-
representatives that when aligned against all alleles of the gene, guarantee that each allele
51-
has at least one alignment with a BSR>0.7.
52-
53-
After determining the representative sequences, the process writes the FASTA file with all valid
54-
sequences to the adapted schema directory and the FASTA file with only the representatives to
55-
the *short* directory inside the adapted schema directory.
56-
5722
Basic Usage
5823
-----------
5924

@@ -126,3 +91,45 @@ Outputs
12691
highly similar sequences, and for genes that have inversions, deletions or insertions
12792
that can lead to several High-scoring Segment Pairs (HSPs), none of which have a score
12893
sufficiently high to identify both sequences as belonging to the same gene.
94+
95+
Workflow of the PrepExternalSchema module
96+
:::::::::::::::::::::::::::::::::::::::::
97+
98+
.. image:: /_static/images/PrepExternalSchema.png
99+
:width: 1200px
100+
:align: center
101+
102+
By default, the process will adapt the external schema based on a BLAST Score Ratio (BSR) value of
103+
``0.6``, it will accept sequences of any length and will use the genetic code ``11`` (Bacteria and
104+
Archaea) to translate sequences. These options can be changed by passing different values to
105+
the ``--bsr``, ``--l`` and ``--t`` arguments. The process runs relatively fast with the default value
106+
for the ``--cpu`` argument, but it will complete considerably faster if it can use several CPU cores
107+
to evaluate several loci in parallel.
108+
109+
For each gene in the external schema, and assuming the default BSR value, the process will:
110+
111+
- Exclude sequences with invalid or ambiguous characters.
112+
- Exclude sequences with length value that is not a multiple of 3.
113+
- Try to translate sequences and exclude sequences that cannot be translated in any possible
114+
orientation due to invalid start and/or stop codons or in-frame stop codons.
115+
- Select the longest (or one of the longest) sequence as the first representative for that gene;
116+
- Use BLASTp to align the representative against all sequences that were not excluded.
117+
- Calculate the BSR value for each alignment.
118+
- If all BSR values are greater than 0.7, the current representative is considered appropriate
119+
to capture the gene sequence diversity when performing allele calling.
120+
- Otherwise, an additional representative has to be chosen in order to find a suitable set of
121+
representatives for the gene. The new representative will be the longest sequence from the
122+
set of non-representative sequences that had a BSR value in the interval [0.6,0.7] (in this
123+
BSR value interval, aligned sequences are still considered to be alleles of the same gene but
124+
display a degree of dissimilarity that can contribute to an increase of the sensitivity
125+
compared to the utilization of only one of those sequences as representative). If there is
126+
no alignment with a BSR value in the interval [0.6,0.7], the next representative will be the
127+
longest (or one of the longest) sequence from the set of sequences that had an alignment with
128+
a BSR<0.6.
129+
- The process will keep expanding the set of representatives until we have a set of
130+
representatives that when aligned against all alleles of the gene, guarantee that each allele
131+
has at least one alignment with a BSR>0.7.
132+
133+
After determining the representative sequences, the process writes the FASTA file with all valid
134+
sequences to the adapted schema directory and the FASTA file with only the representatives to
135+
the *short* directory inside the adapted schema directory.

CHEWBBACA/docs/user/modules/SchemaEvaluator.rst

+7
Original file line numberDiff line numberDiff line change
@@ -370,3 +370,10 @@ code editor is in readonly mode (possible to copy and search but not to edit the
370370
.. image:: /_static/images/loci_reports_protein_editor.png
371371
:width: 1400px
372372
:align: center
373+
374+
Workflow of the SchemaEvaluator module
375+
::::::::::::::::::::::::::::::::::::::
376+
377+
.. image:: /_static/images/SchemaEvaluator.png
378+
:width: 1200px
379+
:align: center

CHEWBBACA/docs/user/modules/SyncSchema.rst

+7
Original file line numberDiff line numberDiff line change
@@ -106,3 +106,10 @@ Parameters
106106
--submit If the process should identify new alleles in the local schema and
107107
send them to the Chewie-NS instance. (only authorized users can submit new alleles)
108108
(default: False).
109+
110+
Workflow of the LoadSchema module
111+
:::::::::::::::::::::::::::::::::
112+
113+
.. image:: /_static/images/SyncSchema.png
114+
:width: 1200px
115+
:align: center

CHEWBBACA/docs/user/modules/UniprotFinder.rst

+7
Original file line numberDiff line numberDiff line change
@@ -64,3 +64,10 @@ passed to the ``-t`` parameter, the output file will also include the loci coord
6464
the provided terms. The reference proteomes are downloaded and the process aligns the loci
6565
representative alleles against the reference proteomes to include the product and gene name
6666
in the reference proteomes.
67+
68+
Workflow of the UniprotFinder module
69+
::::::::::::::::::::::::::::::::::::
70+
71+
.. image:: /_static/images/UniprotFinder.png
72+
:width: 1200px
73+
:align: center

0 commit comments

Comments
 (0)