Skip to content

Commit 58c8046

Browse files
committed
docs: update CreateSchema page.
1 parent 15deb36 commit 58c8046

File tree

1 file changed

+24
-89
lines changed

1 file changed

+24
-89
lines changed

CHEWBBACA/docs/user/modules/CreateSchema.rst

+24-89
Original file line numberDiff line numberDiff line change
@@ -4,59 +4,39 @@ CreateSchema - Create a gene-by-gene schema
44
What is a Schema and how is it defined
55
::::::::::::::::::::::::::::::::::::::
66

7-
A Schema is a pre-defined set of loci that is used in MLST analyses. Traditional MLST schemas
8-
relied in 7 loci that were internal fragments of housekeeping genes and each locus was defined
9-
by its amplification by a pair of primers yielding a fragment of a defined size.
7+
A Schema is a pre-defined set of loci that is used in MLST analyses. Traditional MLST schemas relied in 7 loci that were internal fragments of housekeeping genes and each locus was defined by its amplification by a pair of primers yielding a fragment of a defined size.
108

119
In genomic analyses, schemas are a set of loci that are:
1210

13-
- Present in the majority of strains for core genome (cg) MLST schemas, typically a threshold
14-
of presence in 95% of the strains is used in schema creation. The assumption is that in each
15-
strain up to 5% of loci may not be identified due to sequencing coverage problems, assembly
16-
problems or other issues related to the use of draft genome assemblies.
11+
- Present in the majority of strains for core genome (cg) MLST schemas, typically a threshold of presence in 95% of the strains is used in schema creation. The assumption is that in each strain up to 5% of loci may not be identified due to sequencing coverage problems, assembly problems or other issues related to the use of draft genome assemblies.
1712

18-
- Present in at least one of the analyzed strains in the schema creation for pan genome/whole
19-
genome (pg/wg) MLST schemas.
13+
- Present in at least one of the analyzed strains in the schema creation for pan genome/whole genome (pg/wg) MLST schemas.
2014

2115
- Present in less than 95% of the strains for accessory genome (ag) MLST schemas.
2216

23-
It is important to consider that these definitions are always operational in nature, in the sense
24-
that the analyses are performed on a limited number of strains representing part of the biological
25-
diversity of a given species or genus and are always dependent on the definition of thresholds.
17+
It is important to consider that these definitions are always operational in nature, in the sense that the analyses are performed on a limited number of strains representing part of the biological diversity of a given species or genus and are always dependent on the definition of thresholds.
2618

27-
In most cg/wg/pg/ag MLST schemas, contrary to MLST schemas, each locus corresponds to a coding sequence
28-
(CDS). However, depending on the allele calling algorithm, the alleles called for a given locus can be
29-
CDSs or best matches to existing CDSs without enforcing the need for the identified allele to be a CDS.
19+
In most cg/wg/pg/ag MLST schemas, contrary to MLST schemas, each locus corresponds to a coding sequence (CDS). However, depending on the allele calling algorithm, the alleles called for a given locus can be CDSs or best matches to existing CDSs without enforcing the need for the identified allele to be a CDS.
3020

31-
In **chewBBACA**, schemas are composed of loci defined by CDSs and, by default, all the called alleles of a given
32-
locus are CDSs as defined by `Prodigal <https://github.com/hyattpd/Prodigal>`_ (for chewBBACA<=3.2.0) or
33-
`Pyrodigal <https://github.com/althonos/pyrodigal>`_ (for chewBBACA>=3.3.0) (it is also possible to provide
34-
FASTA files with CDSs and the ``--cds`` parameter to skip the gene prediction step with Prodigal).
35-
The use of Prodigal or Pyrodigal, instead of simply ensuring the presence of start and stop codons, adds an extra layer
36-
of confidence in identifying the most probable CDS for each allele. Because of this approach there may
37-
be variability in the size of the alleles identified by **chewBBACA** and by default a threshold of +/-20%
38-
of the mode of the size of the alleles of a given locus is used to identify a locus as present.
21+
In **chewBBACA**, schemas are composed of loci defined by CDSs and, by default, all the called alleles of a given locus are CDSs as defined by `Prodigal <https://github.com/hyattpd/Prodigal>`_ (for chewBBACA<=3.2.0) or `Pyrodigal <https://github.com/althonos/pyrodigal>`_ (for chewBBACA>=3.3.0) (it is also possible to provide FASTA files with CDSs and the ``--cds`` parameter to skip the gene prediction step). The use of Prodigal or Pyrodigal, instead of simply ensuring the presence of start and stop codons, adds an extra layer of confidence in identifying the most probable CDS for each allele. Because of this approach there may be variability in the size of the alleles identified by **chewBBACA** and by default a threshold of +/-20% of the mode of the size of the alleles of a given locus is used to identify a locus as present.
3922

4023
Create a wgMLST schema
4124
::::::::::::::::::::::
4225

43-
Given a set of genome assemblies in FASTA format, chewBBACA offers the option to create a new schema by defining
44-
the distinct loci present in the genomes.
26+
Given a set of genome assemblies in FASTA format, chewBBACA offers the option to create a new schema by defining the distinct loci present in the genomes.
4527

4628
Basic Usage
4729
-----------
4830

4931
::
5032

51-
$ chewBBACA.py CreateSchema -i /path/to/InputAssembliesFolder -o /path/to/OutputFolder --n SchemaName --ptf /path/to/ProdigalTrainingFile --cpu 4
33+
$ chewBBACA.py CreateSchema -i /path/to/InputAssembliesFolder -o /path/to/OutputFolder --n SchemaName --ptf /path/to/ProdigalTrainingFile
5234

5335
.. important::
54-
You should adjust the value passed to the ``--cpu`` parameter based on the specifications of
55-
your machine. chewBBACA will automatically adjust the value if it matches or exceeds the number
56-
of available CPU cores.
36+
You should adjust the value passed to the ``--cpu`` parameter based on the specifications of your machine. chewBBACA will automatically adjust the value if it matches or exceeds the number of available CPU cores.
5737

5838
.. important::
59-
The use of a prodigal training file for schema creation is highly recommended.
39+
The use of a prodigal training file for schema creation is highly recommended. The training file is included in the newly created schema and is used to predict genes during the allele calling process.
6040

6141
Parameters
6242
----------
@@ -106,12 +86,7 @@ Parameters
10686
are not deleted at the end (default: False).
10787

10888
.. important::
109-
If you provide the ``--cds-input`` parameter, chewBBACA assumes that the input FASTA files contain
110-
coding sequences and skips the gene prediction step with Prodigal. To avoid issues related with the
111-
format of the sequence headers, chewBBACA renames the sequence headers based on the unique basename
112-
prefix determined for each input file and on the order of the coding sequences (e.g.: coding sequences
113-
inside a file named ``GCF_000007125.1_ASM712v1_cds_from_genomic.fna`` are renamed to
114-
``GCF_000007125-protein1``, ``GCF_000007125-protein2``, ..., ``GCF_000007125-proteinN``).
89+
If you provide the ``--cds-input`` parameter, chewBBACA assumes that the input FASTA files contain coding sequences and skips the gene prediction step with Prodigal. To avoid issues related with the format of the sequence headers, chewBBACA renames the sequence headers based on the unique basename prefix determined for each input file and on the order of the coding sequences (e.g.: coding sequences inside a file named ``GCF_000007125.1_ASM712v1_cds_from_genomic.fna`` are renamed to ``GCF_000007125-protein1``, ``GCF_000007125-protein2``, ..., ``GCF_000007125-proteinN``).
11590

11691
Outputs
11792
-------
@@ -131,27 +106,15 @@ Outputs
131106
├── invalid_cds.txt
132107
└── cds_coordinates.tsv
133108

134-
- One FASTA file per distinct gene identified in the schema creation process in the
135-
``OutputFolderName/SchemaName`` directory. The name attributed to each FASTA file in
136-
the schema is based on the genome of origin of the representative allele chosen for that
137-
gene and on the order of gene prediction (e.g.: ``GCA-000167715-protein12.fasta``,
138-
first allele for the gene was identified in a genome assembly with the prefix ``GCA-000167715``
139-
and the gene was the 12th gene predicted by Prodigal in that assembly).
109+
- One FASTA file per distinct locus identified in the schema creation process in the ``OutputFolderName/SchemaName`` directory. The name attributed to each FASTA file in the schema is based on the genome of origin of the representative allele chosen for that locus and on the order of gene prediction (e.g.: ``GCA-000167715-protein12.fasta``, first allele for the locus was identified in a genome assembly with the prefix ``GCA-000167715`` and the locus was the 12th gene predicted in that assembly).
140110

141-
- The ``OutputFolderName/SchemaName`` directory also contains a directory named ``short`` that
142-
includes FASTA files with the representative alleles for each locus.
111+
- The ``OutputFolderName/SchemaName`` directory also contains a directory named ``short`` that includes FASTA files with the representative alleles for each locus.
143112

144-
- The training file passed to create the schema is also included in ``OutputFolderName/SchemaName``
145-
and will be automatically detected during the allele calling process.
113+
- The training file passed to create the schema is also included in ``OutputFolderName/SchemaName`` and will be automatically detected during the allele calling process.
146114

147-
- The ``cds_coordinates.tsv`` file contains the coordinates (genome unique identifier, contig
148-
identifier, start position, stop position, protein identifier attributed by chewBBACA, and coding
149-
strand (chewBBACA<=3.2.0 assigns 1 to the forward strand and 0 to the reverse strand and
150-
chewBBACA>=3.3.0 assigns 1 and -1 to the forward and reverse strands, respectively)) of the CDSs
151-
identified in each genome.
115+
- The ``cds_coordinates.tsv`` file contains the coordinates (genome unique identifier, contig identifier, start position, stop position, protein identifier attributed by chewBBACA, and coding strand (chewBBACA<=3.2.0 assigns 1 to the forward strand and 0 to the reverse strand and chewBBACA>=3.3.0 assigns 1 and -1 to the forward and reverse strands, respectively)) of the CDSs identified in each genome.
152116

153-
- The ``invalid_cds.txt`` file contains the list of alleles predicted by Prodigal that were
154-
excluded based on the minimum sequence size value and presence of ambiguous bases.
117+
- The ``invalid_cds.txt`` file contains the list of alleles predicted that were excluded based on the minimum sequence size value and presence of ambiguous bases.
155118

156119
Workflow of the CreateSchema module
157120
:::::::::::::::::::::::::::::::::::
@@ -160,46 +123,18 @@ Workflow of the CreateSchema module
160123
:width: 1000px
161124
:align: center
162125

163-
The schema creation algorithm has the following main steps:
126+
The CreateSchema module creates a schema seed based on a set of FASTA files with genome assemblies or CDSs. Brief description of the workflow:
164127

165-
- Gene predictipon with Prodigal followed by coding sequence (CDS) extraction to create FASTA files
166-
that contain all CDSs extracted from the inputs. (there is also the option to provide FASTA files
167-
with CDSs and the ``--cds`` parameter to skip the gene prediction step with Prodigal).
128+
- If genome assemblies are given, the process starts by predicting CDSs for each genome using Pyrodigal. Alternatively, if FASTA files containing CDSs are provided (``--cds`` parameter), the process skips the gene prediction step.
168129

169-
- Identification of the distinct CDSs (chewBBACA stores information about the distinct CDSs and the
170-
genomes that contain those CDSs in a hashtable with the mapping between CDS SHA-256 and list of unique
171-
integer identifiers for the inputs that contain each CDS compressed with `polyline encoding <https://developers.google.com/maps/documentation/utilities/polylinealgorithm>`_
172-
adapted from `numcompress <https://github.com/amit1rrr/numcompress>`_).
130+
- The CDSs identified in the input files are deduplicated and translated (CDS translation identifies and excludes CDSs that contain ambiguous bases and with length below the theshold defined by the ``--l`` parameter), followed by a second deduplication step to determine the set of distinct translated CDSs. The schema creation process creates the same hash tables used in the allele calling process to store information about the distinct CDSs and distinct translated CDSs.
173131

174-
- Exclusion of the CDSs smaller than the value passed to the ``--l`` parameter (default: 201).
132+
- The distinct translated CDSs are clustered based on the proportion of minimizers (minimizers selected based on lexicographic order, k=5, w=5) shared with representative CDSs. The largest or one of the largest CDSs is selected as the first representative CDS. New representative CDSs are selected when CDSs share a low proportion (<0.2) of minimizers with any of the chosen representative CDSs.
175133

176-
- Translation of distinct CDSs that were not an exact match in the previous step (This step identifies
177-
and excludes CDSs that contain ambiguous bases).
134+
- Non-representative CDSs that share a proportion of minimizers ≥ 0.9 with the cluster representative are considered to correspond to the same locus and are excluded from the analysis.
178135

179-
- Protein deduplication to identify the distinct set of proteins and keep information about the inputs that
180-
contain CDSs that encode each distinct protein (hashtable with mapping between protein SHA-256 and list of
181-
unique integer identifiers for the distinct CDSs encoded with polyline encoding).
136+
- The proportion of shared minimizers between non-representative CDSs is determined to exclude CDSs sharing a proportion of minimizers ≥ 0.9 with larger CDSs.
182137

183-
- Minimizer-based clustering. The distinct proteins are sorted in order of decreasing length and
184-
clustered based on the percentage of shared distinct minimizers (default >= 20%, interior minimizers
185-
selected based on lexicographic order, k=5, w=5). The first protein is chosen as representative of
186-
the first cluster and a new cluster is defined each time a protein cannot be added to any of the
187-
previously defined clusters based on the percentage of minimizers shared with the cluster repsentatives.
138+
- Intracluster and intercluster alignment with BLASTp enable identifying and excluding CDSs similar to representative or larger non-representative CDSs based on a BLAST Score Ratio (BSR) ≥ 0.6.
188139

189-
- Exclude proteins that share >=90% minimizers with cluster representatives (we assume that these
190-
sequences represent alleles for the same gene and only keep one representative per gene).
191-
192-
- Exclude proteins that share >=90% minimizers with other proteins in the same cluster (a cluster
193-
might include sequences from multiple genes and we want to keep only one representative sequence
194-
per gene).
195-
196-
- Align proteins in each cluster with BLASTp to select a set of representative proteins per cluster
197-
based on the BLAST Score Ratio (BSR) computed for each alignment.
198-
199-
- Align the selected representatives for all clusters with BLASTp to identify and exclude representative
200-
proteins that are highly similar (default: BSR >= 0.6) to other representative proteins. The remaining
201-
set of proteins is not considered highly similar based on the clustering or alignment approach and
202-
constitutes the schema seed.
203-
204-
- Create the schema seed directory structure with one FASTA file per representative CDS (proteins are converted
205-
back into DNA). The schema seed can be used to perform allele calling.
140+
- Each remaining distinct CDS is considered to be an allele of a distinct locus. The process ends by creating a schema seed, which includes one FASTA file containing a single representative allele per distinct locus identified in the analysis.

0 commit comments

Comments
 (0)