B-UMMI
diff --git a/‎CHEWBBACA/docs/_static/images/AlleleCallEvaluator.png
342 KB b/‎CHEWBBACA/docs/_static/images/AlleleCallEvaluator.png
342 KB
diff --git a/‎CHEWBBACA/docs/_static/images/DownloadSchema.png
117 KB b/‎CHEWBBACA/docs/_static/images/DownloadSchema.png
117 KB
diff --git a/‎CHEWBBACA/docs/_static/images/ExtractCgMLST.png
142 KB b/‎CHEWBBACA/docs/_static/images/ExtractCgMLST.png
142 KB
diff --git a/‎CHEWBBACA/docs/_static/images/LoadSchema.png
224 KB b/‎CHEWBBACA/docs/_static/images/LoadSchema.png
224 KB
diff --git a/‎CHEWBBACA/docs/_static/images/PrepExternalSchema.png
124 KB b/‎CHEWBBACA/docs/_static/images/PrepExternalSchema.png
124 KB
diff --git a/‎CHEWBBACA/docs/_static/images/SchemaEvaluator.png
173 KB b/‎CHEWBBACA/docs/_static/images/SchemaEvaluator.png
173 KB
diff --git a/‎CHEWBBACA/docs/_static/images/SyncSchema.png
243 KB b/‎CHEWBBACA/docs/_static/images/SyncSchema.png
243 KB
diff --git a/‎CHEWBBACA/docs/_static/images/UniprotFinder.png
209 KB b/‎CHEWBBACA/docs/_static/images/UniprotFinder.png
209 KB
diff --git a/‎CHEWBBACA/docs/user/modules/AlleleCall.rst
+2-2 b/‎CHEWBBACA/docs/user/modules/AlleleCall.rst
+2-2
diff --git a/‎CHEWBBACA/docs/user/modules/AlleleCallEvaluator.rst
+7 b/‎CHEWBBACA/docs/user/modules/AlleleCallEvaluator.rst
+7
diff --git a/‎CHEWBBACA/docs/user/modules/CreateSchema.rst
+51-46 b/‎CHEWBBACA/docs/user/modules/CreateSchema.rst
+51-46
diff --git a/‎CHEWBBACA/docs/user/modules/DownloadSchema.rst
+7 b/‎CHEWBBACA/docs/user/modules/DownloadSchema.rst
+7
diff --git a/‎CHEWBBACA/docs/user/modules/ExtractCgMLST.rst
+7 b/‎CHEWBBACA/docs/user/modules/ExtractCgMLST.rst
+7
diff --git a/‎CHEWBBACA/docs/user/modules/LoadSchema.rst
+7 b/‎CHEWBBACA/docs/user/modules/LoadSchema.rst
+7
diff --git a/‎CHEWBBACA/docs/user/modules/PrepExternalSchema.rst
+42-35 b/‎CHEWBBACA/docs/user/modules/PrepExternalSchema.rst
+42-35
diff --git a/‎CHEWBBACA/docs/user/modules/SchemaEvaluator.rst
+7 b/‎CHEWBBACA/docs/user/modules/SchemaEvaluator.rst
+7
diff --git a/‎CHEWBBACA/docs/user/modules/SyncSchema.rst
+7 b/‎CHEWBBACA/docs/user/modules/SyncSchema.rst
+7
diff --git a/‎CHEWBBACA/docs/user/modules/UniprotFinder.rst
+7 b/‎CHEWBBACA/docs/user/modules/UniprotFinder.rst
+7
@@ -221,13 +221,13 @@ The column headers stand for:
   in the 5' end or 3' end (respectively) of the contig). This could be an artifact caused by
   genome fragmentation resulting in a shorter CDS prediction by Prodigal. To avoid locus
   misclassification, loci in such situations are classified as *PLOT*.
+- *LOTSC* - A locus is classified as *LOTSC* when the contig of the query genome is smaller
+  than the matched allele.
 
 .. image:: /_static/images/PLOT5_PLOT3_LOTSC.png
    :width: 700px
    :align: center
 
-- *LOTSC* - A locus is classified as *LOTSC* when the contig of the query genome is smaller
-  than the matched allele.
 - *NIPH* - Non-Informative Paralogous Hit (see image below). When ≥2 CDSs in the query
   genome match one locus in the schema with a BSR > 0.6, that locus is classified as *NIPH*.
   This suggests that such locus can have paralogous (or orthologous) loci in the query genome
 
@@ -263,3 +263,10 @@ options ``--retree 1`` and ``--maxiterate 0``. The MSAs for all the core loci ar
 .. image:: /_static/images/allelecall_report_cgMLST_tree.png
    :width: 1400px
    :align: center
+
+Workflow of the AlleleCallEvaluator module
+::::::::::::::::::::::::::::::::::::::::::
+
+.. image:: /_static/images/AlleleCallEvaluator.png
+   :width: 1200px
+   :align: center
@@ -43,52 +43,6 @@ Create a wgMLST schema
 Given a set of genome assemblies in FASTA format, chewBBACA offers the option to create a new schema by defining
 the distinct loci present in the genomes.
 
-The schema creation algorithm has the following main steps:
-
-- Gene predictipon with Prodigal followed by coding sequence (CDS) extraction to create FASTA files
-  that contain all CDSs extracted from the inputs. (there is also the option to provide FASTA files
-  with CDSs and the ``--cds`` parameter to skip the gene prediction step with Prodigal).
-
-- Identification of the distinct CDSs (chewBBACA stores information about the distinct CDSs and the
-  genomes that contain those CDSs in a hashtable with the mapping between CDS SHA-256 and list of unique
-  integer identifiers for the inputs that contain each CDS compressed with `polyline encoding <https://developers.google.com/maps/documentation/utilities/polylinealgorithm>`_
-  adapted from `numcompress <https://github.com/amit1rrr/numcompress>`_).
-
-- Exclusion of the CDSs smaller than the value passed to the ``--l`` parameter (default: 201).
-
-- Translation of distinct CDSs that were not an exact match in the previous step (This step identifies
-  and excludes CDSs that contain ambiguous bases).
-
-- Protein deduplication to identify the distinct set of proteins and keep information about the inputs that
-  contain CDSs that encode each distinct protein (hashtable with mapping between protein SHA-256 and list of
-  unique integer identifiers for the distinct CDSs encoded with polyline encoding).
-
-- Minimizer-based clustering. The distinct proteins are sorted in order of decreasing length and
-  clustered based on the percentage of shared distinct minimizers (default >= 20%, interior minimizers
-  selected based on lexicographic order, k=5, w=5). The first protein is chosen as representative of
-  the first cluster and a new cluster is defined each time a protein cannot be added to any of the
-  previously defined clusters based on the percentage of minimizers shared with the cluster repsentatives.
-
-- Exclude proteins that share >=90% minimizers with cluster representatives (we assume that these
-  sequences represent alleles for the same gene and only keep one representative per gene).
-
-- Exclude proteins that share >=90% minimizers with other proteins in the same cluster (a cluster
-  might include sequences from multiple genes and we want to keep only one representative sequence
-  per gene).
-
-- Align proteins in each cluster with BLASTp to select a set of representative proteins per cluster
-  based on the BLAST Score Ratio (BSR) computed for each alignment.
-
-- Align the selected representatives for all clusters with BLASTp to identify and exclude representative
-  proteins that are highly similar (default: BSR >= 0.6) to other representative proteins. The remaining
-  set of proteins is not considered highly similar based on the clustering or alignment approach and
-  constitutes the schema seed.
-
-- Create the schema seed directory structure with one FASTA file per representative CDS (proteins are converted
-  back into DNA). The schema seed can be used to perform allele calling.
-
-.. image::
-
 Basic Usage
 -----------
 
@@ -198,3 +152,54 @@ Outputs
 
 - The ``invalid_cds.txt`` file contains the list of alleles predicted by Prodigal that were
   excluded based on the minimum sequence size value and presence of ambiguous bases.
+
+Workflow of the CreateSchema module
+:::::::::::::::::::::::::::::::::::
+
+.. image:: /_static/images/CreateSchema.png
+   :width: 1200px
+   :align: center
+
+The schema creation algorithm has the following main steps:
+
+- Gene predictipon with Prodigal followed by coding sequence (CDS) extraction to create FASTA files
+  that contain all CDSs extracted from the inputs. (there is also the option to provide FASTA files
+  with CDSs and the ``--cds`` parameter to skip the gene prediction step with Prodigal).
+
+- Identification of the distinct CDSs (chewBBACA stores information about the distinct CDSs and the
+  genomes that contain those CDSs in a hashtable with the mapping between CDS SHA-256 and list of unique
+  integer identifiers for the inputs that contain each CDS compressed with `polyline encoding <https://developers.google.com/maps/documentation/utilities/polylinealgorithm>`_
+  adapted from `numcompress <https://github.com/amit1rrr/numcompress>`_).
+
+- Exclusion of the CDSs smaller than the value passed to the ``--l`` parameter (default: 201).
+
+- Translation of distinct CDSs that were not an exact match in the previous step (This step identifies
+  and excludes CDSs that contain ambiguous bases).
+
+- Protein deduplication to identify the distinct set of proteins and keep information about the inputs that
+  contain CDSs that encode each distinct protein (hashtable with mapping between protein SHA-256 and list of
+  unique integer identifiers for the distinct CDSs encoded with polyline encoding).
+
+- Minimizer-based clustering. The distinct proteins are sorted in order of decreasing length and
+  clustered based on the percentage of shared distinct minimizers (default >= 20%, interior minimizers
+  selected based on lexicographic order, k=5, w=5). The first protein is chosen as representative of
+  the first cluster and a new cluster is defined each time a protein cannot be added to any of the
+  previously defined clusters based on the percentage of minimizers shared with the cluster repsentatives.
+
+- Exclude proteins that share >=90% minimizers with cluster representatives (we assume that these
+  sequences represent alleles for the same gene and only keep one representative per gene).
+
+- Exclude proteins that share >=90% minimizers with other proteins in the same cluster (a cluster
+  might include sequences from multiple genes and we want to keep only one representative sequence
+  per gene).
+
+- Align proteins in each cluster with BLASTp to select a set of representative proteins per cluster
+  based on the BLAST Score Ratio (BSR) computed for each alignment.
+
+- Align the selected representatives for all clusters with BLASTp to identify and exclude representative
+  proteins that are highly similar (default: BSR >= 0.6) to other representative proteins. The remaining
+  set of proteins is not considered highly similar based on the clustering or alignment approach and
+  constitutes the schema seed.
+
+- Create the schema seed directory structure with one FASTA file per representative CDS (proteins are converted
+  back into DNA). The schema seed can be used to perform allele calling.
@@ -111,3 +111,10 @@ Parameters
 
     --latest                     (Optional) If the compressed version that is available is not the
                                  latest, downloads all loci and constructs schema locally (default: False).
+
+Workflow of the DownloadSchema module
+:::::::::::::::::::::::::::::::::::::
+
+.. image:: /_static/images/DownloadSchema.png
+   :width: 1200px
+   :align: center
@@ -93,3 +93,10 @@ Example of the plot created by the ExtractCgMLST module based on the allelic pro
 .. note::
 	The matrix with allelic profiles created by the *ExtractCgMLST* process can be imported
 	into `PHYLOViZ <https://online.phyloviz.net/index>`_ to visualize and explore typing results.
+
+Workflow of the ExtractCgMLST module
+::::::::::::::::::::::::::::::::::::
+
+.. image:: /_static/images/ExtractCgMLST.png
+   :width: 1200px
+   :align: center
@@ -183,3 +183,10 @@ Parameters
 
     --continue_up               (Optional) Check if the schema upload was interrupted and attempt
                                 to continue upload (default: False).
+
+Workflow of the LoadSchema module
+:::::::::::::::::::::::::::::::::
+
+.. image:: /_static/images/LoadSchema.png
+   :width: 1200px
+   :align: center
@@ -19,41 +19,6 @@ Every sequence that is included in the final schema has to represent a complete
 multiple of 3 and cannot contain in-frame stop codons) and contain no invalid or ambiguous
 characters (sequences must be composed of ATCG only).
 
-By default, the process will adapt the external schema based on a BLAST Score Ratio (BSR) value of
-``0.6``, it will accept sequences of any length and will use the genetic code ``11`` (Bacteria and
-Archaea) to translate sequences. These options can be changed by passing different values to
-the ``--bsr``, ``--l`` and ``--t`` arguments. The process runs relatively fast with the default value
-for the ``--cpu`` argument, but it will complete considerably faster if it can use several CPU cores
-to evaluate several loci in parallel.
-
-For each gene in the external schema, and assuming the default BSR value, the process will:
-
-- Exclude sequences with invalid or ambiguous characters.
-- Exclude sequences with length value that is not a multiple of 3.
-- Try to translate sequences and exclude sequences that cannot be translated in any possible
-  orientation due to invalid start and/or stop codons or in-frame stop codons.
-- Select the longest (or one of the longest) sequence as the first representative for that gene;
-- Use BLASTp to align the representative against all sequences that were not excluded.
-- Calculate the BSR value for each alignment.
-- If all BSR values are greater than 0.7, the current representative is considered appropriate
-  to capture the gene sequence diversity when performing allele calling.
-- Otherwise, an additional representative has to be chosen in order to find a suitable set of
-  representatives for the gene. The new representative will be the longest sequence from the
-  set of non-representative sequences that had a BSR value in the interval [0.6,0.7] (in this
-  BSR value interval, aligned sequences are still considered to be alleles of the same gene but
-  display a degree of dissimilarity that can contribute to an increase of the sensitivity
-  compared to the utilization of only one of those sequences as representative). If there is
-  no alignment with a BSR value in the interval [0.6,0.7], the next representative will be the
-  longest (or one of the longest) sequence from the set of sequences that had an alignment with
-  a BSR<0.6.
-- The process will keep expanding the set of representatives until we have a set of
-  representatives that when aligned against all alleles of the gene, guarantee that each allele
-  has at least one alignment with a BSR>0.7.
-
-After determining the representative sequences, the process writes the FASTA file with all valid
-sequences to the adapted schema directory and the FASTA file with only the representatives to
-the *short* directory inside the adapted schema directory.
-
 Basic Usage
 -----------
 
@@ -126,3 +91,45 @@ Outputs
 	highly similar sequences, and for genes that have inversions, deletions or insertions
 	that can lead to several High-scoring Segment Pairs (HSPs), none of which have a score
 	sufficiently high to identify both sequences as belonging to the same gene.
+
+Workflow of the PrepExternalSchema module
+:::::::::::::::::::::::::::::::::::::::::
+
+.. image:: /_static/images/PrepExternalSchema.png
+   :width: 1200px
+   :align: center
+
+By default, the process will adapt the external schema based on a BLAST Score Ratio (BSR) value of
+``0.6``, it will accept sequences of any length and will use the genetic code ``11`` (Bacteria and
+Archaea) to translate sequences. These options can be changed by passing different values to
+the ``--bsr``, ``--l`` and ``--t`` arguments. The process runs relatively fast with the default value
+for the ``--cpu`` argument, but it will complete considerably faster if it can use several CPU cores
+to evaluate several loci in parallel.
+
+For each gene in the external schema, and assuming the default BSR value, the process will:
+
+- Exclude sequences with invalid or ambiguous characters.
+- Exclude sequences with length value that is not a multiple of 3.
+- Try to translate sequences and exclude sequences that cannot be translated in any possible
+  orientation due to invalid start and/or stop codons or in-frame stop codons.
+- Select the longest (or one of the longest) sequence as the first representative for that gene;
+- Use BLASTp to align the representative against all sequences that were not excluded.
+- Calculate the BSR value for each alignment.
+- If all BSR values are greater than 0.7, the current representative is considered appropriate
+  to capture the gene sequence diversity when performing allele calling.
+- Otherwise, an additional representative has to be chosen in order to find a suitable set of
+  representatives for the gene. The new representative will be the longest sequence from the
+  set of non-representative sequences that had a BSR value in the interval [0.6,0.7] (in this
+  BSR value interval, aligned sequences are still considered to be alleles of the same gene but
+  display a degree of dissimilarity that can contribute to an increase of the sensitivity
+  compared to the utilization of only one of those sequences as representative). If there is
+  no alignment with a BSR value in the interval [0.6,0.7], the next representative will be the
+  longest (or one of the longest) sequence from the set of sequences that had an alignment with
+  a BSR<0.6.
+- The process will keep expanding the set of representatives until we have a set of
+  representatives that when aligned against all alleles of the gene, guarantee that each allele
+  has at least one alignment with a BSR>0.7.
+
+After determining the representative sequences, the process writes the FASTA file with all valid
+sequences to the adapted schema directory and the FASTA file with only the representatives to
+the *short* directory inside the adapted schema directory.
@@ -370,3 +370,10 @@ code editor is in readonly mode (possible to copy and search but not to edit the
 .. image:: /_static/images/loci_reports_protein_editor.png
    :width: 1400px
    :align: center
+
+Workflow of the SchemaEvaluator module
+::::::::::::::::::::::::::::::::::::::
+
+.. image:: /_static/images/SchemaEvaluator.png
+   :width: 1200px
+   :align: center
@@ -106,3 +106,10 @@ Parameters
     --submit                    If the process should identify new alleles in the local schema and
                                 send them to the Chewie-NS instance. (only authorized users can submit new alleles)
                                 (default: False).
+
+Workflow of the LoadSchema module
+:::::::::::::::::::::::::::::::::
+
+.. image:: /_static/images/SyncSchema.png
+   :width: 1200px
+   :align: center
@@ -64,3 +64,10 @@ passed to the ``-t`` parameter, the output file will also include the loci coord
 the provided terms. The reference proteomes are downloaded and the process aligns the loci
 representative alleles against the reference proteomes to include the product and gene name
 in the reference proteomes.
+
+Workflow of the UniprotFinder module
+::::::::::::::::::::::::::::::::::::
+
+.. image:: /_static/images/UniprotFinder.png
+   :width: 1200px
+   :align: center