-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Ability to use Genotype likelihoods #12
Comments
Oh yeah we can definitely do this. @cjbattey do you have your hands on a decent training set of low(er) coverage data that we can get geno_liks out of? |
Yeah I think we can do this pretty easily but TBD. In theory we can just flatten the GL matrix for each individual and pass that to the network instead of the allele count vector we've been using. I don't have a good test dataset for this though. Any ideas? |
Ive looked into this a bit more over the week. While im used to using the Beagle genotype likelihood format with ANGSD, the VCF spec already includes columns for genotype likelihoods https://github.com/samtools/hts-specs/blob/master/VCFv4.4.pdf Maybe Locator could have an option when inputting a VCF to choose between the called genotype column (default behaviour), genotype likelihood column (GL) if available, or the Phred scaled genotype likelihoods (PL). The PL is more commonly output by variant callers like GATK and should be back transformable to GLs, but is lossy due to integer rounding. I noticed that the Anopheles VCFs analysed in the locator MS have the PL column present, so you could potentially use this to develop the functionality on a familiar dataset. If you want to test on some actual low coverage data, Ive had a poke around the literature looking for datasets that may be appropriate:
Note i'm not super familiar with the VCF spec and genotype likelihoods, so forgive me if i'm misunderstanding something. |
Hi everyone, I plan to use locator in low-coverage (as low as ~0.5X) wgs datasets. So, I should use GL instead of allele counts. Was this feature added to locator? If so, how it is implemented? Thanks! |
hi @fernandoh33 -- unfortunately this is still on the TODO list. We are hoping to turn our attention to this in the coming months. |
While most human population genomics datasets are now able to achieve >30x sequencing coverage on the regular, for a lot of non-model organism studies its becoming more popular to instead use low-coverage sequencing (1-5x coverage) and sample more individuals from the population. The most popular methods for analysing these datasets (ANGSD and related software) use the genotype likelihoods directly rather than the called variants in order to better take uncertainty into account.
From a brief twitter discussion with CJ i understand it may be possible to extend Locator to work with genotype likelihoods. I think this feature would be quite valuable to those of us working with low coverage data.
Cheers,
Alex
The text was updated successfully, but these errors were encountered: