Feature request: Ability to use Genotype likelihoods #12

alexpiper · 2020-07-04T00:17:10Z

While most human population genomics datasets are now able to achieve >30x sequencing coverage on the regular, for a lot of non-model organism studies its becoming more popular to instead use low-coverage sequencing (1-5x coverage) and sample more individuals from the population. The most popular methods for analysing these datasets (ANGSD and related software) use the genotype likelihoods directly rather than the called variants in order to better take uncertainty into account.

From a brief twitter discussion with CJ i understand it may be possible to extend Locator to work with genotype likelihoods. I think this feature would be quite valuable to those of us working with low coverage data.

Cheers,
Alex

andrewkern · 2020-07-04T22:16:54Z

Oh yeah we can definitely do this. @cjbattey do you have your hands on a decent training set of low(er) coverage data that we can get geno_liks out of?

cjbattey · 2020-07-05T22:10:56Z

Yeah I think we can do this pretty easily but TBD. In theory we can just flatten the GL matrix for each individual and pass that to the network instead of the allele count vector we've been using. I don't have a good test dataset for this though. Any ideas?

alexpiper · 2020-07-10T06:27:39Z

Ive looked into this a bit more over the week. While im used to using the Beagle genotype likelihood format with ANGSD, the VCF spec already includes columns for genotype likelihoods https://github.com/samtools/hts-specs/blob/master/VCFv4.4.pdf

Maybe Locator could have an option when inputting a VCF to choose between the called genotype column (default behaviour), genotype likelihood column (GL) if available, or the Phred scaled genotype likelihoods (PL). The PL is more commonly output by variant callers like GATK and should be back transformable to GLs, but is lossy due to integer rounding.

I noticed that the Anopheles VCFs analysed in the locator MS have the PL column present, so you could potentially use this to develop the functionality on a familiar dataset. If you want to test on some actual low coverage data, Ive had a poke around the literature looking for datasets that may be appropriate:

Human 1000 genomes Phase 1 dataset as analysed in http://www.genome.org/cgi/doi/10.1101/gr.146084.112 This dataset contains ~1000 individuals from ~14 populations with an average coverage of 5× and subsets of it have been used in a number of studies as a benchmark for performance on low coverage data. There is a nice data portal https://www.internationalgenome.org/data-portal/sample where you can pick and choose appropriate populations.
Waterbuck dataset analysed in https://doi.org/10.1534/genetics.118.301336
This dataset contains 73 samples that were sampled at five different sites in Africa with a varying sequencing depth from 2.23 to 4.73x. The BAM files are available at https://www.ebi.ac.uk/ena/data/view/PRJEB28089
Atlantic cod dataset analysed in https://doi.org/10.1111/eva.12861 - 306 individuals with an average coverage of 0.67X. Only raw reads available reads here: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA560242/
Hawaiian Planthoppers analysed in https://doi.org/10.1111/mec.15231 184 individuals 5-15x coverage using exon capture.
Again only raw reads available from https://www.ncbi.nlm.nih.gov/bioproject/PRJNA341388

Note i'm not super familiar with the VCF spec and genotype likelihoods, so forgive me if i'm misunderstanding something.

fernandoh33 · 2024-10-20T20:43:06Z

Hi everyone, I plan to use locator in low-coverage (as low as ~0.5X) wgs datasets. So, I should use GL instead of allele counts. Was this feature added to locator? If so, how it is implemented? Thanks!

andrewkern · 2024-10-20T22:24:58Z

hi @fernandoh33 -- unfortunately this is still on the TODO list. We are hoping to turn our attention to this in the coming months.

cc: @silastittes @stsmall @nspope @clararehmann

andrewkern added the enhancement New feature or request label Oct 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Ability to use Genotype likelihoods #12

Feature request: Ability to use Genotype likelihoods #12

alexpiper commented Jul 4, 2020

andrewkern commented Jul 4, 2020

cjbattey commented Jul 5, 2020

alexpiper commented Jul 10, 2020

fernandoh33 commented Oct 20, 2024

andrewkern commented Oct 20, 2024

Feature request: Ability to use Genotype likelihoods #12

Feature request: Ability to use Genotype likelihoods #12

Comments

alexpiper commented Jul 4, 2020

andrewkern commented Jul 4, 2020

cjbattey commented Jul 5, 2020

alexpiper commented Jul 10, 2020

fernandoh33 commented Oct 20, 2024

andrewkern commented Oct 20, 2024