Multiallelic Sites #11

alkaZeltser · 2024-02-06T05:08:36Z

alkaZeltser
Feb 6, 2024
Maintainer

What are multiallelic sites?

A multiallelic site (as opposed to a biallelic site) is a locus in the genome where more than two alleles have been detected. They are not common. The thousand genomes project reported that only 0.3% of variant loci were multiallelic SNPs in their dataset (see 1KG supplementary information table 3). However, it is still possible to encounter such sites at polygenic score (PGS) loci, and thus it is important to understand how they will be handled by our tool.

Multiallelic site formatting in VCF files.

Sources of genetic data

Multiallelic sites can be detected in genetic data and written into a VCF from several types of sources.

Variant callers from sequencing data
Genotype callers from microarray data
Imputed dosage callers from data provided by either 1 or 2, and a reference panel.

Variant callers detect alleles by counting reads, and will typically report a multiallelic site as a single record in a VCF, e.g.

CHROM	POS	REF	ALT
chr1	1	A	T,C

Microarray data can also contain multiallelic sites, depending on whether the analysis software used to interpret microarray intensity data is designed to call them.

Multiallelic sites can be found in imputation results from imputation algorithms designed to handle them. Typically the different alleles are imputed separately [still fact-checking this], thus the output may record the site across two lines, e.g.

CHROM	POS	REF	ALT
chr1	1	A	T
chr1	1	A	C

You may stumble on a multiallelic site that wasn't originally present in your data when merging your cohort with another cohort or reference panel that does contain a third allele at a certain site. Tools like PLINK and bcftools that provide merge algorithms have different ways of handling this scenario which may result in either of the above styles, or the removal of the variant altogether.

Regardless of the data source, multiallelics can be formatted in both one-line (merged) and two-line (split) formats, with tools such as bcftools norm designed to be able to switch the formatting of your file between the two. Merging multiallelic sites into one line is generally preferred since it more accurately mimics the biology - all the alleles belong to the same locus. This format also minimizes the possibility of edge cases that could lead to inaccurate dosage calculations relative to a risk allele from a PGS. Particularly tricky cases may result if the PGS risk allele happens to be the reference (REF) allele in a VCF, which can lead to a double counting of risk dosage if the multiallelic site is not identified.

Multiallelic site handling in GWAS.

As discussed in #2, the model fitting stage during which individual SNP betas for a PGS are computed is typically a Genome-Wide Association Study (GWAS). Since multiallelic sites are rare, GWAS are typically underpowered to compute the effect of the rarer third allele. It is also standard practice to restrict GWAS to only biallelic sites, thus it is highly unlikely that such a beta was computed in the first place and included in published pgs weight files.

Multiallelic site handling in ApplyPolygenicScore

PGS betas for multiple alleles

If the score you wish to apply provides betas for more than one allele at a locus, you can provide that information to our tool and an allele-specific weight x dosage calculation will be performed. Additional betas must be provided as additional rows in the PGS weight file. e.g.

rsID	chr_name	chr_position	effect_allele	other_allele
rs1234	1	10	A	T
rs1234	1	10	C	T

Multiple alleles in genetic input

If multiple alleles at a multiallelic site are present in the input VCF cohort, our tool will attempt to match them to a corresponding beta value in the PGS weight file. If a beta for the extra allele is not found, the extra allele will be treated as non-risk and will be given a dosage of "0".

Our tool requires that the VCF input file be formatted to encode multiallelic sites in merged format. If multiallelic sites are detected in split format, the tool will produce an error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiallelic Sites #11

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Multiallelic Sites #11

alkaZeltser Feb 6, 2024 Maintainer

What are multiallelic sites?

Multiallelic site formatting in VCF files.

Sources of genetic data

Multiallelic site handling in GWAS.

Multiallelic site handling in ApplyPolygenicScore

PGS betas for multiple alleles

Multiple alleles in genetic input

Replies: 0 comments

alkaZeltser
Feb 6, 2024
Maintainer