Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Ability to use Genotype likelihoods #12

Open
alexpiper opened this issue Jul 4, 2020 · 5 comments
Open

Feature request: Ability to use Genotype likelihoods #12

alexpiper opened this issue Jul 4, 2020 · 5 comments
Labels
enhancement New feature or request

Comments

@alexpiper
Copy link

While most human population genomics datasets are now able to achieve >30x sequencing coverage on the regular, for a lot of non-model organism studies its becoming more popular to instead use low-coverage sequencing (1-5x coverage) and sample more individuals from the population. The most popular methods for analysing these datasets (ANGSD and related software) use the genotype likelihoods directly rather than the called variants in order to better take uncertainty into account.

From a brief twitter discussion with CJ i understand it may be possible to extend Locator to work with genotype likelihoods. I think this feature would be quite valuable to those of us working with low coverage data.

Cheers,
Alex

@andrewkern
Copy link
Member

Oh yeah we can definitely do this. @cjbattey do you have your hands on a decent training set of low(er) coverage data that we can get geno_liks out of?

@cjbattey
Copy link
Collaborator

cjbattey commented Jul 5, 2020

Yeah I think we can do this pretty easily but TBD. In theory we can just flatten the GL matrix for each individual and pass that to the network instead of the allele count vector we've been using. I don't have a good test dataset for this though. Any ideas?

@alexpiper
Copy link
Author

Ive looked into this a bit more over the week. While im used to using the Beagle genotype likelihood format with ANGSD, the VCF spec already includes columns for genotype likelihoods https://github.com/samtools/hts-specs/blob/master/VCFv4.4.pdf

Maybe Locator could have an option when inputting a VCF to choose between the called genotype column (default behaviour), genotype likelihood column (GL) if available, or the Phred scaled genotype likelihoods (PL). The PL is more commonly output by variant callers like GATK and should be back transformable to GLs, but is lossy due to integer rounding.

I noticed that the Anopheles VCFs analysed in the locator MS have the PL column present, so you could potentially use this to develop the functionality on a familiar dataset. If you want to test on some actual low coverage data, Ive had a poke around the literature looking for datasets that may be appropriate:

Note i'm not super familiar with the VCF spec and genotype likelihoods, so forgive me if i'm misunderstanding something.

@andrewkern andrewkern added the enhancement New feature or request label Oct 1, 2020
@fernandoh33
Copy link

Hi everyone, I plan to use locator in low-coverage (as low as ~0.5X) wgs datasets. So, I should use GL instead of allele counts. Was this feature added to locator? If so, how it is implemented? Thanks!

@andrewkern
Copy link
Member

hi @fernandoh33 -- unfortunately this is still on the TODO list. We are hoping to turn our attention to this in the coming months.

cc: @silastittes @stsmall @nspope @clararehmann

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants