An implementation of the pairwise distance metric between groups of genetic variants, based on differences in fixed and non-fixed allele frequencies, described in Álvarez-Herrera & Sevilla et al. (2024) (see also CITATION.cff).
Briefly, we define the difference between two vectors of
Usage: afwdist [OPTIONS] --input <INPUT> --reference <REFERENCE> --output <OUTPUT>
Options:
-i, --input <INPUT> Input tree in CSV format (mandatory CSV columns are 'sample', 'position', 'sequence' and 'frequency')
-r, --reference <REFERENCE> Reference sequence in FASTA format
-o, --output <OUTPUT> Output CSV file with distances between each pair of samples
-s, --include-reference Include reference as a sample with 100% fixed alleles
-v, --verbose Enable debug messages
-h, --help Print help
-V, --version Print versionThe program takes as input a table in CSV format (possibly derived from a VCF file) where each row represents a single genetic variant. The input table must contain four columns:
sample(a string): a unique identifier for the group of variants used in pairwise comparisons.position(an integer): the site of the variant.sequence(a string): the sequence of the variant (i.e. the alternate allele).frequency(a real number from 0 to 1): the relative frequency of the variant within the sample.
In addition to the variant table, the program requires a reference sequence in FASTA format. The sequence should be the same one used for variant calling. This reference is used to infer the frequencies of reference alleles, assuming that any frequency not taken up by listed variants belongs to the reference allele at that site. In addition to the pairwise distance between samples, the distance between each sample and the reference sequence can also calculated (if requested) by building a reference sample as a baseline with no variant alleles (i.e. all sites are assumed to have an allele frequency of 1).
The distance of each sample is calculated against the reference as well, treating it as a normal sample with no allele variants (all reference allele frequencies are fixed within the reference virtual sample).
As a result, a table in CSV format is produced. This table contains three columns:
sample_mandsample_n(strings): the identifiers of the two samples being compared.distance(a real number): the calculated pairwise distance between the two samples.
Álvarez-Herrera, M. & Sevilla, J., Ruiz-Rodriguez, P., Vergara, A., Vila, J., Cano-Jiménez, P., González-Candelas, F., Comas, I., & Coscollá, M. (2024). VIPERA: Viral Intra-Patient Evolution Reporting and Analysis. Virus Evolution, 10(1), veae018. https://doi.org/10.1093/ve/veae018
Thanks goes to these wonderful people (emoji key):
Miguel Álvarez Herrera 💻 |
Jordi Sevilla Fortuny 🐛 📓 |
This project follows the all-contributors specification.