Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimation of allele freq with metafounders #148

Open
XingerTang opened this issue Jan 4, 2024 · 23 comments
Open

Estimation of allele freq with metafounders #148

XingerTang opened this issue Jan 4, 2024 · 23 comments

Comments

@XingerTang
Copy link
Contributor

@gregorgorjanc @RosCraddock
Originally, while estimating the alternative allele frequency of the whole population, the algorithm looped over the genotypes of all the individuals to calculate the most likely alternative allele frequency.

Now, to estimate the alternative allele frequency of the metafounder, we need to loop over the genotypes of the progeny of the metafounder. Then we need to make a list of progeny for each metafounder.

One way to do that is by going through the pedigree to check the ancestors of each individual, but we might lose some information if we don't do it in the correct order. Is there any better way to do it?

@XingerTang
Copy link
Contributor Author

In PeelingUpdates.getNewtonUpdate

@gregorgorjanc gregorgorjanc changed the title Lists of progeny for each metafounder Estimation of allele freq with metafounders Jan 4, 2024
@gregorgorjanc
Copy link
Member

As discussed today there might be multiple ways to handle this so let's get back to this once we have other engineering done;)

@gregorgorjanc
Copy link
Member

A note. I recall that AW reported that estimating allele freq sometimes made genome-wide imputation worse. I find this surprising. After our look at the code yesterday, I think that maybe this observation was/is due to the way AF is estimated - we collect information from any pedigree member and find AF that would best explain their observed genotype status (via Newton optimisation). This approach ignores that some of these individuals are related (imagine we have lots of data from one family and far less from other families - that family will dominate the result too much - we would ideally weight by relationships between individuals) and importantly, it is not estimating the right AF - this is an estimate of AF that pertains to the whole population spanning all the generations, which is not what we need for peeling - for peeling we need AF in founders (or metafounder(s)). There might not be much difference between these in some cases, but I for sure know that there can be quite a bit of difference in some cases - long pedigrees with selection!

@RosCraddock
Copy link

RosCraddock commented Sep 20, 2024

Notes from meeting on estimating the allele frequencies in AlphaPeel:

Current approach:
Prior to peeling, AlphaPeel estimates the alternative allele frequency based on all available observed genotype data and the genotype penetrance (i.e genotyping error). It does this through the Newton-Rapson method: an optimisation method.
No further update is completed after each peeling cycle.
Alternative method:
In addition to the current approach, update the alternative allele frequency after each peeling cycle (peeling up and peeling down) using the inferred genotype probabilities of individuals in the base population.
With metafounders, this will give multiple alternative allele probabilities (one for each metafounder).
Linear method:
This will be a later consideration if the initial estimation of the alternative allele frequency needs improvement.
Applied prior to the peeling, this differs from the current approach by providing a weighted mean of the alternative allele frequency depending on the relationship between individuals with genotype data.

image

Planned updates to code:

To include the alternative method in addition to the current method.
In tinypeel.py

  • Add new user parameter "-est_and_update_alt_allele_prob" to getArgs, peeling_control_parser.
  • Add est_and_update_alt_allele_prob to the est_alt_allele_prob conditional statement to call for updateMaf() prior to peeling.
  • Add conditional statement at the end of peelingCycle to call new function updateAltAlleleProb() if argument est_and_update_alt_allele_prob is used.
    In PeelingUpdates.py
  • Add new function updateAltAlleleProb()This will update pedigree.AAP for each metafounder using the inferred genotype probabilities of individuals in the base population (instead of only the recorded genotypes). Then it will update the anterior of the base individuals.
    General updates
  • Update functional and accuracy test for est_and_update_alt_allele_prob
  • Update documentation for est_and_update_alt_allele_prob

So est_alt_allele_prob will only estimate the alternative allele frequency using available observed genotype data prior to peeling. est_and_update_alt_allele_prob will estimate the alternative allele frequency prior to peeling with available observed genotype data and will update it after each peeling cycle using the inferred genotype probabilities of individuals in the base population (per metafounder). Hence, with this approach, est_alt_allele_prob will ignore the metafounders. This is something that we may need to address later. For example, by allocating observed genotype data to metafounders for initial estimation per metafounder.

Apologies, this is a little messy. I will review and update this comment later. Just adding the current ideas.

@XingerTang
Copy link
Contributor Author

Notes from meeting on estimating the allele frequencies in AlphaPeel:

Current approach: Prior to peeling, AlphaPeel estimates the alternative allele frequency based on all available observed genotype data and the genotype penetrance (i.e genotyping error). It does this through the Newton-Rapson method: an optimisation method. No further update is completed after each peeling cycle. Alternative method: In addition to the current approach, update the alternative allele frequency after each peeling cycle (peeling up and peeling down) using the inferred genotype probabilities of individuals in the base population. With metafounders, this will give multiple alternative allele probabilities (one for each metafounder). Linear method: This will be a later consideration if the initial estimation of the alternative allele frequency needs improvement. Applied prior to the peeling, this differs from the current approach by providing a weighted mean of the alternative allele frequency depending on the relationship between individuals with genotype data.

image

Planned updates to code:

To include the alternative method in addition to the current method. In tinypeel.py

  • Add new user parameter "-est_and_update_alt_allele_prob" to getArgs, peeling_control_parser.
  • Add est_and_update_alt_allele_prob to the est_alt_allele_prob conditional statement to call for updateMaf() prior to peeling.
  • Add conditional statement at the end of peelingCycle to call new function updateAltAlleleProb() if argument est_and_update_alt_allele_prob is used.
    In PeelingUpdates.py
  • Add new function updateAltAlleleProb()This will update pedigree.AAP for each metafounder using the inferred genotype probabilities of individuals in the base population (instead of only the recorded genotypes). Then it will update the anterior of the base individuals.
    General updates
  • Update functional and accuracy test for est_and_update_alt_allele_prob
  • Update documentation for est_and_update_alt_allele_prob

So est_alt_allele_prob will only estimate the alternative allele frequency using available observed genotype data prior to peeling. est_and_update_alt_allele_prob will estimate the alternative allele frequency prior to peeling with available observed genotype data and will update it after each peeling cycle using the inferred genotype probabilities of individuals in the base population (per metafounder). Hence, with this approach, est_alt_allele_prob will ignore the metafounders. This is something that we may need to address later. For example, by allocating observed genotype data to metafounders for initial estimation per metafounder.

Apologies, this is a little messy. I will review and update this comment later. Just adding the current ideas.

@gregorgorjanc @RosCraddock Thanks for the summary. There is a point I would like to stress.
For the alternative method that updating per peeling cycle, we had a conversation about this a long time ago (#142 (comment)). Reestimating the alternative allele frequency and then reestimating the anterior probabilities of the founders may not lead to improvements in the accuracies of the genotype probabilities but introduce noise into the probabilities. Originally, the genotype probabilities were estimated as the best per individual per loci, but now we are grouping all the progenies of one metafounder together, which may lead to a loss in precision.

@RosCraddock
Copy link

Thank you for this comment, @XingerTang. I can see how re-estimating the alternative allele frequency after each peeling cycle may not lead to any improvement, particularly in pedigrees which are sparsely genotyped and reliant on the original allele frequency estimate. However, I am not sure how this could introduce noise. Is this because accuracy is lowest in the earlier generations? Or potential conflict after grouping and estimating per metafounders?

Part of the reason for this alternative method is to account for the relationships between the genotyped individuals and those in the base population in estimating the alternative allele frequency. This will be particularly important in pedigrees with many generations, selection, and/or high random genetic drift. Re-estimating after each peeling cycle would consider this. Alternatively, we could consider the relationship between the genotyped individual and the base population/metafounder in the original estimate prior to peeling (e.g., with the linear method). However, we thought it would be worth testing the alternative method first to see if that would be sufficient.

@XingerTang
Copy link
Contributor Author

Thank you for this comment, @XingerTang. I can see how re-estimating the alternative allele frequency after each peeling cycle may not lead to any improvement, particularly in pedigrees which are sparsely genotyped and reliant on the original allele frequency estimate. However, I am not sure how this could introduce noise. Is this because accuracy is lowest in the earlier generations? Or potential conflict after grouping and estimating per metafounders?

Part of the reason for this alternative method is to account for the relationships between the genotyped individuals and those in the base population in estimating the alternative allele frequency. This will be particularly important in pedigrees with many generations, selection, and/or high random genetic drift. Re-estimating after each peeling cycle would consider this. Alternatively, we could consider the relationship between the genotyped individual and the base population/metafounder in the original estimate prior to peeling (e.g., with the linear method). However, we thought it would be worth testing the alternative method first to see if that would be sufficient.

@RosCraddock

The reasons that the reestimation may introduce noise are:

  • The genotype probabilities used to generate the alternative allele frequency are different for each individual and each locus. Yet, the alternative allele frequency is all the same across all the progenies of a metafounder. That is, suppose a group of individuals sharing one metafounder have genotypes 1, 0, 0, 0, 0, which corresponds to the alternative allele frequency to be 0.2, this would lead to the re-estimated genotype probabilities to be the original one multiplied by genotype probability distribution generated by the alternative allele frequency, it causes the genotype probabilities at the locus of the individual 1 much be less than the original value 1.

  • The genotype probabilities generated by the alternative allele frequency generated by the alternative allele frequency would be unphased.

  • Even though those loci have genotype probabilities that match the alternative allele frequency, remember that in the AlphaPeel, the genotype probabilities are generated by anterior * penetrance * posterior. By default, the anterior probabilities are all equal, and the penetrance and posterior are updated by the peeling cycle. To implement the reestimation, we are going to generate the alternative allele frequencies from the penetrance and posterior, and this will be used to generate new anterior probabilities. Then any differences in the genotype probabilities would be squared after this reestimation, which will tweak the genotype probability distribution.

Moreover, our original Baum-Welch algorithm can already do the information propagation across generations.

@gregorgorjanc
Copy link
Member

  • The genotype probabilities used to generate the alternative allele frequency are different for each individual and each locus. Yet, the alternative allele frequency is all the same across all the progenies of a metafounder. That is, suppose a group of individuals sharing one metafounder have genotypes 1, 0, 0, 0, 0, which corresponds to the alternative allele frequency to be 0.2, this would lead to the re-estimated genotype probabilities to be the original one multiplied by genotype probability distribution generated by the alternative allele frequency, it causes the genotype probabilities at the locus of the individual 1 much be less than the original value 1.

@XingerTang we should thing of a generative model. In base population genotypes coded as 0, 1, 2 (skipping distinction between the two hets) comes from a binomial distribution with frequency p. If founders come from two backgrounds we can have different p and we need to allow for this possibility to get better estimates of genotype probabilities down the pedigree.

The genotype probabilities generated by the alternative allele frequency generated by the alternative allele frequency would be unphased.

Hmm. My understanding is that we would use metafounder ps to construct anterior term for the founders - with two metafounders we would have two different kind of anterior terms (due to two different ps) - assuming all unknown parents come from just one of metafounders for each individual - there might well be cases where one unknown parent is assigned to one metafounder and another unknown parent to another metafounder - in that case we will have 3 different anterior terms.

Even though those loci have genotype probabilities that match the alternative allele frequency, remember that in the AlphaPeel, the genotype probabilities are generated by anterior * penetrance * posterior. By default, the anterior probabilities are all equal, and the penetrance and posterior are updated by the peeling cycle.

Yes.

To implement the reestimation, we are going to generate the alternative allele frequencies from the penetrance and posterior, and this will be used to generate new anterior probabilities.

Yes.

Then any differences in the genotype probabilities would be squared after this reestimation, which will tweak the genotype probability distribution.

I don't follow this part. We already use p (at the moment just one) which is effectively an estimate of allele frequency in a population and hence also in founders. So, what you are trying to warn us against is already happening - not by updating the p with founder genotype probs, but non-founder observed genotypes, but the principle is the same. The only thing we are now saying is that we will change p based on an estimate in founders (instead of the genotyped individuals atm) and later on we will break this estimation by metafounders (to get multiple ps).

I think we are fine - we will only update p in a hope that this will improve anterior term for the founders and hence their genotype probabilities, which will trickle down the pedigree, and then up, update p, etc.

Let's test this and see how it works. We can always go back if needed.

@gregorgorjanc
Copy link
Member

@RosCraddock @XingerTang discussion in #142 (comment) is relevant though - if we have metafounders as individuals in the pedigree and we get their genotype probabilities, then we can also get allele probability easily - if that is the data structure we have, then we don;t need to average over the founders ... Sorry, I am lagging behind the code. @RosCraddock maybe you print out state of variables for a simple example when you dig into this and we take it from there.

@RosCraddock
Copy link

@XingerTang @gregorgorjanc
For the peeling, we do not include metafounders as (dummy) individuals. Instead, we tag the founders to the allocated metafounder as inputted by the user through the pedigree input. This is then used to allocate the corresponding allele frequency to the founders. Why we did this is explained in issue #142. So I will still need to average over the founders after each peeling.

To briefly summarise the steps going forward for metafounders and allele frequency estimation:

  1. Testing implemented code for metafounders (with accuracy and functional tests)
  2. Conduct a true alternative allele frequency test (similar to issue No improvement with true recombination rate #174) - this will indicate what is the optimum level of accuracy achievable for the inferred genotype probabilities.
  3. Implement and test the "alternative method" i.e re-estimation after each peeling.
  4. Potentially implement a linear method to (try to) optimise the original estimate of the alternative allele frequency.

@gregorgorjanc
Copy link
Member

Noted with thanks! Very good reasoning and a plan forward!!!

@XingerTang
Copy link
Contributor Author

@gregorgorjanc

  • The genotype probabilities used to generate the alternative allele frequency are different for each individual and each locus. Yet, the alternative allele frequency is all the same across all the progenies of a metafounder. That is, suppose a group of individuals sharing one metafounder have genotypes 1, 0, 0, 0, 0, which corresponds to the alternative allele frequency to be 0.2, this would lead to the re-estimated genotype probabilities to be the original one multiplied by genotype probability distribution generated by the alternative allele frequency, it causes the genotype probabilities at the locus of the individual 1 much be less than the original value 1.

@XingerTang we should thing of a generative model. In base population genotypes coded as 0, 1, 2 (skipping distinction between the two hets) comes from a binomial distribution with frequency p. If founders come from two backgrounds we can have different p and we need to allow for this possibility to get better estimates of genotype probabilities down the pedigree.

In my perspective, better estimates of genotype probabilities down the pedigree are already achieved by the Baum-Welch. The only way we can generate better estimates is by using the result genotype probabilities from the Baum-Welch and calculating alternative allele frequency, which certainly would lead to a loss of information than the original one generated by the Baum-Welch. I don't know in which way we could make use of the alternative allele frequency to improve the genotype probabilities.

The genotype probabilities generated by the alternative allele frequency generated by the alternative allele frequency would be unphased.

Hmm. My understanding is that we would use metafounder ps to construct anterior term for the founders - with two metafounders we would have two different kind of anterior terms (due to two different ps) - assuming all unknown parents come from just one of metafounders for each individual - there might well be cases where one unknown parent is assigned to one metafounder and another unknown parent to another metafounder - in that case we will have 3 different anterior terms.

I just realized the founders' genotype probabilities don't have to be phased, my apology.

Even though those loci have genotype probabilities that match the alternative allele frequency, remember that in the AlphaPeel, the genotype probabilities are generated by anterior * penetrance * posterior. By default, the anterior probabilities are all equal, and the penetrance and posterior are updated by the peeling cycle.

Yes.

To implement the reestimation, we are going to generate the alternative allele frequencies from the penetrance and posterior, and this will be used to generate new anterior probabilities.

Yes.

Then any differences in the genotype probabilities would be squared after this reestimation, which will tweak the genotype probability distribution.

I don't follow this part. We already use p (at the moment just one) which is effectively an estimate of allele frequency in a population and hence also in founders. So, what you are trying to warn us against is already happening - not by updating the p with founder genotype probs, but non-founder observed genotypes, but the principle is the same. The only thing we are now saying is that we will change p based on an estimate in founders (instead of the genotyped individuals atm) and later on we will break this estimation by metafounders (to get multiple ps).

I think we are fine - we will only update p in a hope that this will improve anterior term for the founders and hence their genotype probabilities, which will trickle down the pedigree, and then up, update p, etc.

Let's test this and see how it works. We can always go back if needed.

The key is that our current implementation of the reestimation of the genotype probabilities with the aaf generated by Newton's method is done before the peeling, when the first generation knows nothing about the later generation. So in this case, new information would be introduced with the anterior generated by reestimation with the aaf.

The squaring issue for the alternative method affects much more is because we are only going to use the genotype probabilities of the founder generation to generate the alternative allele frequencies which is used to generate new genotype probabilities for the updates of the founders generation . While the squaring issue also occurred in the original implementation, the portion of the squared is much less with many generations genotype probabilities involved.

@gregorgorjanc
Copy link
Member

@gregorgorjanc

  • The genotype probabilities used to generate the alternative allele frequency are different for each individual and each locus. Yet, the alternative allele frequency is all the same across all the progenies of a metafounder. That is, suppose a group of individuals sharing one metafounder have genotypes 1, 0, 0, 0, 0, which corresponds to the alternative allele frequency to be 0.2, this would lead to the re-estimated genotype probabilities to be the original one multiplied by genotype probability distribution generated by the alternative allele frequency, it causes the genotype probabilities at the locus of the individual 1 much be less than the original value 1.

@XingerTang we should thing of a generative model. In base population genotypes coded as 0, 1, 2 (skipping distinction between the two hets) comes from a binomial distribution with frequency p. If founders come from two backgrounds we can have different p and we need to allow for this possibility to get better estimates of genotype probabilities down the pedigree.

In my perspective, better estimates of genotype probabilities down the pedigree are already achieved by the Baum-Welch. The only way we can generate better estimates is by using the result genotype probabilities from the Baum-Welch and calculating alternative allele frequency, which certainly would lead to a loss of information than the original one generated by the Baum-Welch. I don't know in which way we could make use of the alternative allele frequency to improve the genotype probabilities.

The genotype probabilities generated by the alternative allele frequency generated by the alternative allele frequency would be unphased.

Hmm. My understanding is that we would use metafounder ps to construct anterior term for the founders - with two metafounders we would have two different kind of anterior terms (due to two different ps) - assuming all unknown parents come from just one of metafounders for each individual - there might well be cases where one unknown parent is assigned to one metafounder and another unknown parent to another metafounder - in that case we will have 3 different anterior terms.

I just realized the founders' genotype probabilities don't have to be phased, my apology.

Even though those loci have genotype probabilities that match the alternative allele frequency, remember that in the AlphaPeel, the genotype probabilities are generated by anterior * penetrance * posterior. By default, the anterior probabilities are all equal, and the penetrance and posterior are updated by the peeling cycle.

Yes.

To implement the reestimation, we are going to generate the alternative allele frequencies from the penetrance and posterior, and this will be used to generate new anterior probabilities.

Yes.

Then any differences in the genotype probabilities would be squared after this reestimation, which will tweak the genotype probability distribution.

I don't follow this part. We already use p (at the moment just one) which is effectively an estimate of allele frequency in a population and hence also in founders. So, what you are trying to warn us against is already happening - not by updating the p with founder genotype probs, but non-founder observed genotypes, but the principle is the same. The only thing we are now saying is that we will change p based on an estimate in founders (instead of the genotyped individuals atm) and later on we will break this estimation by metafounders (to get multiple ps).
I think we are fine - we will only update p in a hope that this will improve anterior term for the founders and hence their genotype probabilities, which will trickle down the pedigree, and then up, update p, etc.
Let's test this and see how it works. We can always go back if needed.

The key is that our current implementation of the reestimation of the genotype probabilities with the aaf generated by Newton's method is done before the peeling, when the first generation knows nothing about the later generation. So in this case, new information would be introduced with the anterior generated by reestimation with the aaf.

The squaring issue for the alternative method affects much more is because we are only going to use the genotype probabilities of the founder generation to generate the alternative allele frequencies which is used to generate new genotype probabilities for the updates of the founders generation . While the squaring issue also occurred in the original implementation, the portion of the squared is much less with many generations genotype probabilities involved.

We are going in circles convincing each other without any results. Let's try and see;) We are doing this step because we want to accommodate mutliple metafounders, where we will need a way to give different base allele frequencies.

The key is that our current implementation of the reestimation of the genotype probabilities with the aaf generated by Newton's method is done before the peeling, when the first generation knows nothing about the later generation. So in this case, new information would be introduced with the anterior generated by reestimation with the aaf.

But, our current estimation of MAF is based on genotypes from across the pedigree, so there is a "loop" of reusing the same information in the current system as well.

Let's get some tests done and see;)

@RosCraddock
Copy link

RosCraddock commented Oct 8, 2024

@gregorgorjanc @XingerTang I have started the work on estimating alternative allele frequency with metafounders and have a couple of comments/questions that I wanted to note.

  1. The initial alternative allele frequency used for the Newton-Raphson method is particularly important with a small sample size. Hence, I updated the code to allow updates from a user-inputted alternative allele frequency (see here). This may be particularly valuable where it is known that the alternative allele is rare/ low frequency rather than initially assuming a default of 0.5.

  2. The Newton-Raphson method is coded iteratively. But, each iteration (i.e repeat of the getNewtonUpdate()) uses the same values as the first since maf_old is never updated (see code below). I think it does this to keep a conservative estimate where there is only a small sample of genotype data available and prevent the false increase in confidence from repeated use of the same genotype data. However, I wondered whether there might be a benefit for the maf_old to be updated to maf after each iteration instead? Although, this limitation could be addressed by the above point (1.).

  3. Finally, the Newton-Rapson method is improved with a higher number of genotype data. However, it does not consider the proportion of genotype data available for the given pedigree. For example, say we have only two individuals in a pedigree, both are genotyped as 0 at a biallelic locus. Here we have complete information for the given pedigree where the alternative allele frequency would be ~0 (0.01 to consider error). However, AlphaPeel currently estimates the alternative allele frequency as 0.33. This was improved with a lower initial alternative allele frequency (i.e maf_old) of 0.01 rather than the default 0.5.

I am still going ahead with the previously mentioned steps, but wanted to note these comments down!

    ```
maf = 0.5
maf_old = 0.5
iters = 5
converged = False
while not converged:
        delta = getNewtonUpdate(maf_old, peelingInfo, index)
        maf = maf_old + delta
        if maf < 0.01:
            maf = 0.01
        if maf > 0.99:
            maf = 0.99
        if abs(maf - maf_old) < 0.0001:
            converged = True
        iters -= 1
        if iters < 0:
            converged = True

@gregorgorjanc
Copy link
Member

@RosCraddock yeah well spotted on the maf_old not being updated. Doh!? This does look odd! Good that we are checking code functionality step by step;)

@gregorgorjanc
Copy link
Member

@RosCraddock indeed, allele freq will matter for small datasets/pedigrees or when most of the data is away from the founders - the typical prior vs likelihood information setting.

@gregorgorjanc
Copy link
Member

gregorgorjanc commented Oct 8, 2024

@RosCraddock I advise that you develop a small example where you have two families each coming from a different background: say, a male 1 has genotype 0, female 2 has genotype 1 (hence their population allele frequency is (0+1)/(2 * 2)=0.25, then male 3 has genotype 1, and female 4 has genotype 2 (hence their population allele frequency is (1+2)/(2*2)=0.75). Then create two siblings from each of these parents (genotype some) and cross siblings at random to get a 3rd generation (genotype some).

You can then play around estimation of joint base population allele frequency (which would be (0+1+1+2)/(2*4)=0.5) or separate/metafounder allele frequencies (which would hopefully get closer to 0.25 and 0.75). I say "play" as depending on who is genotyped (some 2nd and 3rd generation or also some from the 1st/base generation), you will have more or less information to estimate the base population allele frequency/frequencies.

@RosCraddock
Copy link

RosCraddock commented Oct 22, 2024

I have implemented and started testing the updating of the alternative allele frequency after each peeling cycle, both with and without the Newton-Rapson method. I tested this on a small pedigree of 22 (5 generations) with either one, two, or five metafounders under four genotype missing rates (see Table 1 below). I compared all to a true alternative allele frequency to show the optimal accuracies possible (correlation of true dosage to inferred, individual accuracy as only single-locus). I have also added the default_alt_allele, which uses an allele frequency of 0.5 without updating for all metafounders. The true alternative allele frequency for all founders is 0.45 (close to the default).

Table 1: Individual inference accuracy (via correlation)

  true_alt_allele_prob_file default_alt_allele no_update_est update_est_alt_allele_prob_with_newton update_est_alt_allele_prob_without_newton
all genos, one MF 0.999 0.999 0.999 0.999 0.999
all genos, two MF 0.999 0.999 0.999 0.999 0.999
all genos, five MF 0.999 0.999 0.999 0.999 0.999
4 missing genos, one MF 0.939 0.937 0.938 0.938 0.938
4 missing genos, two MF 0.933 0.937 0.938 0.933 0.931
4 missing genos, five MF 0.937 0.937 0.938 0.935 0.929
11 missing genos, one MF 0.751 0.764 0.756 0.717 0.715
11 missing genos, two MF 0.787 0.764 0.756 0.762 0.754
11 missing genos, five MF 0.870 0.764 0.756 0.760 0.701
17 missing genos, one MF 0.723 0.734 0.705 0.619 0.634
17 missing genos, two MF 0.792 0.734 0.705 0.719 0.673
17 missing genos, five MF 0.925 0.734 0.705 0.797 0.831

General takeaways so far:

  1. With 17 of 22 individuals without genotype data and the true alt allele freq known, over 0.9 inference accuracy is possible in these examples. Achieved with five metafounders (MF).
  2. Re-estimation without MF (i.e one MF) consistently results in a lower inference accuracy compared to without re-estimation. However, with two or more MF, the re-estimation after peeling generally increases the inference accuracy. Thus suggesting the re-estimation should be restricted to where there are two or more MFs.
  3. Inconsistencies between the ranking of re-estimation with or without Newton-Rapson method. Highest accuracies were achieved without (0.831), but not the case for all examples.

These observations are only from a small pedigree, so I will do some further testing on some different examples (including the one described by @gregorgorjanc above). Then, I will review the run_acc_test.py for metafounders with est_alt_allele. However, I am keen to try on the Kennel Club data soon to see if we observe any difference in cross-validation.

@gregorgorjanc
Copy link
Member

@RosCraddock, good job being very systematic! That's the path to learning how these methods work and what is possible in the best setting (your true_alt_allele_prob_file setting), and then how close we come in different settings that are increasingly different from the best setting!!!

Can I clarify that the without Newton estimation method involves the simple/naive empirical estimate of base population allele frequency (for one or more metafounders) by averaging individual allele frequencies of the founders (appropriately grouped by metafounders)?

@RosCraddock
Copy link

@gregorgorjanc Thank you! Yes, that is right!

@RosCraddock
Copy link

I have implemented and started testing the updating of the alternative allele frequency after each peeling cycle, both with and without the Newton-Rapson method. I tested this on a small pedigree of 22 (5 generations) with either one, two, or five metafounders under four genotype missing rates (see Table 1 below). I compared all to a true alternative allele frequency to show the optimal accuracies possible (correlation of true dosage to inferred, individual accuracy as only single-locus). I have also added the default_alt_allele, which uses an allele frequency of 0.5 without updating for all metafounders. The true alternative allele frequency for all founders is 0.45 (close to the default).

Table 1: Individual inference accuracy (via correlation)

  true_alt_allele_prob_file default_alt_allele no_update_est update_est_alt_allele_prob_with_newton update_est_alt_allele_prob_without_newton
all genos, one MF 0.999 0.999 0.999 0.999 0.999
all genos, two MF 0.999 0.999 0.999 0.999 0.999
all genos, five MF 0.999 0.999 0.999 0.999 0.999
4 missing genos, one MF 0.939 0.937 0.938 0.938 0.938
4 missing genos, two MF 0.933 0.937 0.938 0.933 0.931
4 missing genos, five MF 0.937 0.937 0.938 0.935 0.929
11 missing genos, one MF 0.751 0.764 0.756 0.717 0.715
11 missing genos, two MF 0.787 0.764 0.756 0.762 0.754
11 missing genos, five MF 0.870 0.764 0.756 0.760 0.701
17 missing genos, one MF 0.723 0.734 0.705 0.619 0.634
17 missing genos, two MF 0.792 0.734 0.705 0.719 0.673
17 missing genos, five MF 0.925 0.734 0.705 0.797 0.831
General takeaways so far:

  1. With 17 of 22 individuals without genotype data and the true alt allele freq known, over 0.9 inference accuracy is possible in these examples. Achieved with five metafounders (MF).
  2. Re-estimation without MF (i.e one MF) consistently results in a lower inference accuracy compared to without re-estimation. However, with two or more MF, the re-estimation after peeling generally increases the inference accuracy. Thus suggesting the re-estimation should be restricted to where there are two or more MFs.
  3. Inconsistencies between the ranking of re-estimation with or without Newton-Rapson method. Highest accuracies were achieved without (0.831), but not the case for all examples.

These observations are only from a small pedigree, so I will do some further testing on some different examples (including the one described by @gregorgorjanc above). Then, I will review the run_acc_test.py for metafounders with est_alt_allele. However, I am keen to try on the Kennel Club data soon to see if we observe any difference in cross-validation.

I repeated the above with the same pedigree, but different true genotypes and thus alternative allele frequency. Without grouping of the founders, the true alternative allele frequency was 0.1 (as opposed to 0.45 in the previous test). Here, updating after each peeling without Newton seems to be the optimum. Interestingly, in 11 missing genos, five MF a higher accuracy (correlation) was achieved than with the true_alt_allele_prob_file. Although, I suspect that was by chance.

Table 2: Individual inference accuracy (via correlation) with true alternative allele freq of 0.1.

  true_alt_allele_prob_file default_alt_allele no_update_est update_est_alt_allele_prob_with_newton update_est_alt_allele_prob_without_newton
all genos, one MF 0.999 0.999 1.0 1.0 0.999
all genos, two MF 1.0 0.999 1.0 0.999 1.0
all genos, five MF 0.999 0.999 1.0 0.999 0.999
4 missing genos, one MF 0.997 0.937 0.995 0.987 0.997
4 missing genos, two MF 1.0 0.937 0.995 0.989 1.0
4 missing genos, five MF 0.999 0.937 0.995 0.968 0.999
11 missing genos, one MF 0.713 0.480 0.699 0.689 0.711
11 missing genos, two MF 0.817 0.480 0.699 0.725 0.785
11 missing genos, five MF 0.800 0.480 0.699 0.615 0.885
17 missing genos, one MF 0.599 0.372 0.537 0.567 0.594
17 missing genos, two MF 0.610 0.372 0.537 0.526 0.571
17 missing genos, five MF 0.751 0.372 0.537 0.445 0.732

After I developed another small pedigree as described above by @gregorgorjanc. The true alternative allele frequency without any grouping of the founders was 0.5 for Table 3 and 0.25 for Table 4.

Table 3: Individual inference accuracy (via correlation) with true alternative allele freq of 0.5.

  true_alt_allele_prob_file default_alt_allele no_update_est update_est_alt_allele_prob_with_newton update_est_alt_allele_prob_without_newton
all genos, one MF 0.999 0.999 0.999 0.999 0.999
all genos, two MF 0.999 0.999 0.999 0.999 0.999
4 missing genos, one MF 0.932 0.932 0.932 0.932 0.932
4 missing genos, two MF 0.932 0.932 0.932 0.932 0.932
8 missing genos, one MF 0.840 0.840 0.845 0.837 0.834
8 missing genos, two MF 0.858 0.840 0.845 0.846 0.855
10 missing genos, one MF 0.828 0.828 0.822 0.822 0.815
10 missing genos, two MF 0.831 0.828 0.822 0.832 0.797

Table 4: Individual inference accuracy (via correlation) with true alternative allele freq of 0.25.

  true_alt_allele_prob_file default_alt_allele no_update_est update_est_alt_allele_prob_with_newton update_est_alt_allele_prob_without_newton
all genos, one MF 0.999 0.999 0.999 0.999 0.999
all genos, two MF 0.999 0.999 0.999 0.999 0.999
4 missing genos, one MF 0.999 0.996 0.998 0.998 0.999
4 missing genos, two MF 0.996 0.996 0.998 0.995 0.994
8 missing genos, one MF 0.961 0.930 0.943 0.943 0.955
8 missing genos, two MF 0.930 0.930 0.943 0.922 0.905
10 missing genos, one MF 0.937 0.861 0.881 0.871 0.882
10 missing genos, two MF 0.858 0.861 0.881 0.824 0.740

Then I tested on a larger pedigree of 1000 individuals with 500 from one population and 500 from another population. These have around 50% missing rate in the observed genotypes.

Table 5: Marker accuracy

  true_alt_allele_prob_file default_alt_allele no_update_est update_est_alt_allele_prob_with_newton update_est_alt_allele_prob_without_newton
MF_sep_one, 50% geno_missing (randomly) 0.895 0.874 0.894 0.892 0.894
MF_sep_two, 50% geno_missing (randomly) 0.901 0.874 0.894 0.899 0.901

To summarise:
There are slightly mixed results, but it generally seems update_est_alt_allele_prob_without_newton is best (especially when considering runtime as well). So, I will implement the update_est_alt_allele_prob_without_newton (i.e average of founder's allele probabilities after each peeling) with the user input -alt_alele_prob_method.
After, I will review the functional tests and the accuracy test. This needs a little bit of work as the current simulations for testing the accuracy of metafounders I did in May are not adequate.

@RosCraddock
Copy link

@gregorgorjanc @XingerTang

I just ran a functional test in which the user inputs the estimated allele frequencies as 0 for all loci, yet all in the pedigree (albeit a small example) have genotype 1 for all loci. I ran this with the —alt_allele_prob_method to update after peeling, which gave everyone a genotype of 0 while keeping the same inputted alternative allele frequencies of 0. In this example, I am only using a single metafounder. So, I then used the -est_alt_allele_prob, which re-estimates the alternative allele frequencies using the available observed genotypes with the inputted alternative allele frequencies as the starting point before peeling. This gave correct alternative allele frequencies of 0.5 and the correct genotypes/dosages for each individual. However, since -est_alt_allele_prob uses all observed genotypes, regardless of any genetic grouping, this would not necessarily be the case with multiple metafounders.

Referring to the code for estimating the alternative allele frequencies, the values of maf (i.e., the alternative allele frequency) can range from 0.01 to 0.99 and never reach the extremes of 0 and 1 (unlike the user inputs). So, I redid the above, inputting alternative allele frequencies of 0.01 (instead of 0) with —alt_allele_prob_method, and it gave the correct alternative allele frequencies of 0.5 and inferred dosages of 1 (as inputted through the geno_file).

Question: Should we enforce this range (0.01 to 0.99) on the user-inputted alternative allele frequencies, too, or only for estimations (est_alt_allele_prob and alt_allele_prob_method)? I appreciate this is an extreme example, and there may be cases where the user knows it to be 1 or 0 in the founders.

@gregorgorjanc
Copy link
Member

@RosCraddock excellent testing! Yes, when we specify 0 or 1 as allele freq, we get an extreme case that many estimation algorithms will have very hard time to "escape" from - it will depend on the algorithm and data that goes in as you can see from your testing. As you suggest, please do change allele freq to 0.01 if provided lower and to 0.99 if provided higher and issue a warning. Time will tell if this is too restrictive and we could loosen up the limits or make them an argument (with [0.01, 0.99] defaults that a user could change, say).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants