Multiallelic Sites #11
alkaZeltser
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
What are multiallelic sites?
A multiallelic site (as opposed to a biallelic site) is a locus in the genome where more than two alleles have been detected. They are not common. The thousand genomes project reported that only 0.3% of variant loci were multiallelic SNPs in their dataset (see 1KG supplementary information table 3). However, it is still possible to encounter such sites at polygenic score (PGS) loci, and thus it is important to understand how they will be handled by our tool.
Multiallelic site formatting in VCF files.
Sources of genetic data
Multiallelic sites can be detected in genetic data and written into a VCF from several types of sources.
Variant callers detect alleles by counting reads, and will typically report a multiallelic site as a single record in a VCF, e.g.
Microarray data can also contain multiallelic sites, depending on whether the analysis software used to interpret microarray intensity data is designed to call them.
Multiallelic sites can be found in imputation results from imputation algorithms designed to handle them. Typically the different alleles are imputed separately [still fact-checking this], thus the output may record the site across two lines, e.g.
You may stumble on a multiallelic site that wasn't originally present in your data when merging your cohort with another cohort or reference panel that does contain a third allele at a certain site. Tools like
PLINK
andbcftools
that provide merge algorithms have different ways of handling this scenario which may result in either of the above styles, or the removal of the variant altogether.Regardless of the data source, multiallelics can be formatted in both one-line (merged) and two-line (split) formats, with tools such as bcftools norm designed to be able to switch the formatting of your file between the two. Merging multiallelic sites into one line is generally preferred since it more accurately mimics the biology - all the alleles belong to the same locus. This format also minimizes the possibility of edge cases that could lead to inaccurate dosage calculations relative to a risk allele from a PGS. Particularly tricky cases may result if the PGS risk allele happens to be the reference (REF) allele in a VCF, which can lead to a double counting of risk dosage if the multiallelic site is not identified.
Multiallelic site handling in GWAS.
As discussed in #2, the model fitting stage during which individual SNP betas for a PGS are computed is typically a Genome-Wide Association Study (GWAS). Since multiallelic sites are rare, GWAS are typically underpowered to compute the effect of the rarer third allele. It is also standard practice to restrict GWAS to only biallelic sites, thus it is highly unlikely that such a beta was computed in the first place and included in published pgs weight files.
Multiallelic site handling in ApplyPolygenicScore
PGS betas for multiple alleles
If the score you wish to apply provides betas for more than one allele at a locus, you can provide that information to our tool and an allele-specific weight x dosage calculation will be performed. Additional betas must be provided as additional rows in the PGS weight file. e.g.
Multiple alleles in genetic input
If multiple alleles at a multiallelic site are present in the input VCF cohort, our tool will attempt to match them to a corresponding beta value in the PGS weight file. If a beta for the extra allele is not found, the extra allele will be treated as non-risk and will be given a dosage of "0".
Our tool requires that the VCF input file be formatted to encode multiallelic sites in merged format. If multiallelic sites are detected in split format, the tool will produce an error.
Beta Was this translation helpful? Give feedback.
All reactions