Missing Genotype Data #17
alkaZeltser
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
What are missing genotype data?
In VCF files, particularly ones that include an entire merged cohort, it is possible to encounter cases where some or all samples in a cohort will have missing genotype data for a specific variant site.
Missing genotype data in some but not all individuals
When some, but not all, individuals are missing genotypes at a locus, they are typically encoded as
./.
in the VCF, in contrast to, for example, a called heterozygous genotype0/1
.If the cohort represented in the file was jointly genotyped, missing genotypes occur when the bioinformatics tool used to perform variant calling is not able to confidently call a variant at this location. This may be due to various reasons such as insufficient coverage or poor read or base quality (for sequencing), technical problems with interpreting intensity (for arrays), and more complex quality filters that take into account many variables and quality metrics to identify false positives and false negatives.
If the cohort represented in the file was not jointly genotyped, but rather merged from a set of individual VCFs, missing genotypes can represent individuals who simply did not have a variant at this site. The most commonly used variant calling tools for sequencing data typically do not write a genotype to a final VCF file if the individual is homozygous for the reference allele. The individual is non-variant for that site, and thus the site is considered not informative as it is identical to the reference genome. To reduce the size of the resulting VCF file, only variant sites (heterozygous or homozygous for the non-reference allele) are written as a row in the VCF. When a VCF of an individual who is non-variant at a given site is merged with a VCF of an individual who is variant at that site, the merged VCF will have a row for this site, but will typically have the non-variant individual's genotype marked as missing, since there was no information about this site in their input file.
Missing genotype data in all individuals
When applying the weights of a polygenic score to a specific set of variants, genotype data at that specific set of genetic coordinates is required. You may encounter a scenario where, when looking through the rows of a VCF file, a required site is simply nowhere to be found. No row exists for its coordinates. If your genetic data was acquired via microarray, this likely means your variant of interest was not included in the array. We recommend looking into genotype imputation tools to impute the missing genotypes. If your data comes from a sequencing experiment, once again, the reason for this lies in the output settings of your variant caller. If not a single individual in your cohort had a variant called at this locus, then most variant callers are configured to exclude this site from the final output, as it is no different from the reference genome. It is important to consider that the absence of a site does not necessarily mean that all individuals in the cohort were homozygous for the reference allele. It is also possible that there was not sufficient information to call any genotype at this location due to all the potential quality issues mentioned above. Thus, it is not necessarily correct to assume that all individuals are homozygous for the reference allele at a site that is not listed in the VCF.
To resolve this ambiguity in a sequencing experiment, it is usually possible to modify the output setting of the variant caller to report all sites that pass quality checks, including non-variant ones. We advise you to be mindful of the increase in storage space that this will likely require, as the resulting files will be much larger.
Another strategy would be to simply increase the size of your cohort. The greater the number of individuals, the more likely it is that at least one will carry a common variant.
Handling missing genotype data for PGS variants in ApplyPolygenicScore
Collister, Liu, and Clifton (2022, Front Genet.) have published an excellent guide to PGS application, and particularly recommendations for handling missing genotype data. Below we summarize the methods that are available in ApplyPoligenicScore. These methods may be specified in the main PGS application function through the parameter
missing.genotype.method
e.g.We include formulaic representations based on the following base model, as seen in #2:
where$i$ is an individual, $m$ is a PGS component variant out of a total $M$ variants, and $\beta_m$ represents the effect size weight of the $m^{th}$ variant.
Omission
missing.genotype.method = 'none'
Missing genotypes are simply ignored and not included in score calculation. Internally, values are treated as
NA
and are not incorporated in the weighted sum. Since the dosage for homozygous reference genotypes is 0, this strategy is equivalent to assuming that all missing variants are homozygous reference.Omission + Normalization
missing.genotype.method = 'normalize'
Missing genotypes are excluded from score calculation as in omission, but the final score for each sample is normalized by the number of non-missing alleles. The calculation assumes a diploid genome.
where$P$ = ploidy = 2
Substitution with Mean Dosage
This strategy differentiates between variants that are missing in some individuals and variants that are missing in all individuals.
For variants missing in n - 1 individuals, substitute missing genotype dosage with the mean population dosage, which can be calculated internally based on the effect allele frequency in the provided cohort, or based on an effect allele frequency provided externally by the user via a standardized column in the pgs weight file (
allelefrequency_effect
).where$k$ is a PGS component variant that is missing in between 1 and n-1 individuals and $P$ = ploidy = 2
This dosage calculation holds under assumptions of Hardy-Weinberg equilibrium.
For variants that are missing in every individual, the omission strategy is implemented, effectively assuming a dosage of 0 in every individual for this variant.
where$m$ is a PGS component variant that is present in all $n$ individuals and $k$ is a PGS component variant that is missing in 1 to (n-1) individuals.
Alternative methods not supported by ApplyPolygenicScore
Imputation
Several genotype imputation tools are available for reference-panel population-based imputation of common variants. After recovering your missing genotypes with an imputation method, you may use the resulting VCFs with ApplyPolygenicScore.
Substitution with Proxy SNP via LD metric
Common SNPs are frequently in high Linkage Disequilibrium (LD) with other nearby SNPs in "LD blocks", meaning their genotype is highly correlated with the genotype of various other SNPs. If you are missing a specific PGS SNP but have a called genotype at a nearby SNP that is in LD with the original, you could use the genotype of this "proxy" SNP to represent the genotype of the missing SNP. Tools are available to identify proxy SNPs for any given SNP for different population groups. Once you have identified a proxy SNP, you may replace the original SNP by simply editing the coordinates of the original SNP in the PGS weight file to match the proxy instead. It is important to note that this proxy SNP is not a perfect replacement and may not have the same true biological effect as the replaced SNP. However this method tends to produce more accurate scores than omission, especially when the original SNP has a larger effect size (Chagnon et al.).
Beta Was this translation helpful? Give feedback.
All reactions