Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement alt_allele_prob option #142

Open
AprilYUZhang opened this issue Dec 11, 2023 · 13 comments
Open

Implement alt_allele_prob option #142

AprilYUZhang opened this issue Dec 11, 2023 · 13 comments
Assignees

Comments

@AprilYUZhang
Copy link

So we can change allele freq in the founders (not setting to 0.5 and not estimating from the data)

@RosCraddock
Copy link

@XingerTang - assigned to this issue.

@XingerTang
Copy link
Contributor

XingerTang commented Jan 4, 2024

@RosCraddock @gregorgorjanc

!!!This comment is DISCARDED

Coding tasks to do:

  • Modification in tinyhouse.pedigree to store the information of the metafounders while reading in the pedigree file
    • Add MetaFounder flag/attribute to Individual class, while reading in the pedigree data, set the flag to True for each individual with their names start with MF_ (can also check if metafounders are actually founders, raise errors if not)
    • Add automatically generated MF_1 individual to the individual list of the Pedigree object after the input pedigree data is read
    • Modify Pedigree.readInPedigree so that for each individual in the pedigree
    if both parents == None None and not MetaFounder:
      set MF_1 as parents
    
  • Modification in alphapeel.peelinginfo to store the corresponding alternative allele frequency for each of the metafounders
    • Add jit_peelingInformation.nMF to store the number of metafounders in the population
    • Add jit_peelingInformation.MFList to store the list of the metafounder individuals (or their ids)
    • Initialize jit_peelingInformation.maf as an nMF $\times$ nLoci numpy matrix
  • Modification in alphapeel.peelinginfo to allow user-defined alternative allele frequency to be used in the calculation
    • Calculate the values based on the provided alternative allele frequency while initializing the anterior probabilities of the metafounders
  • Modification in alphapeel.tinypeel to add the option alt_allele_prob
    • Add option alt_allele_prob
    • Add preference for the case when both alt_allele_prob and est_alt_prob are used (GG: in that case we use alt_allele_prob as a starting value for est_alt_prob)

@gregorgorjanc
Copy link
Member

I spoke with @AprilYUZhang today and she pointed out some bits that I wan't to clarify here.

Current state in AlphaPeel is the following:

  1. Estimate alternative allele probability based on observed genotyped data (using Newton method on genotype probabilities from observed genotypes (something like this), so accounting for genotyping error)
  2. Take 1. and set it for the rest of the program execution
  3. Use 1. to set anterior term for founders (since alternative allele probability is fixed, so are these anterior terms)
  4. Peel down and up a couple of times to propagate observed genotype information across pedigree

There are two issues with the above:
a) in 1. we are estimating alternative allele probability for an "undefined" population (we take any observed genotypes in pedigree), while we really need base population alternative allele probability - while the estimate based on the "undefined" population is not the base population estimate, it probably isn’t miles off, but see also c)
b) as we discussed in person, a) will not do what we need for metafounders, but see also c)
c) once we get the estimate, keeping it fixed might not be what we want - even if we have slightly off estimate from 1. if we use it as a starting value and then update the base population alternative allele probability by estimating it from inferred individual genotype probabilities for just the founders then we could converge to a better solution - this might make the running time of AlphaPeel longer / we might need more peeling runs - at the moment we effectively use a simple estimate and fix it, so given that estimate we then estimate individual genotype probs - this starting value and convergence thing could actually well work for more than one metafounder too, so there is hope for b) too

The above suggests that we would like to end up in this "correct" state:

  1. Estimate alternative allele probability based on observed genotyped data (using Newton method on genotype probabilities from observed genotypes (something like this), so accounting for genotyping error) --> test how the linear model method with genetic groups could serve us better, but note that even a starting value and updates in the founders could work well, so I suggest we do this linear model method last
  2. Take 1. and set it for the rest of the program execution --> I would like us to explore updating base population allele probability with every round of peeling (we start going down and then up, so when we come up, we have genotype probs for founders and we can estimate allele prob there, even separated by multiple metafounders)
  3. Use 1. to set anterior term for founders (since alternative allele probability is fixed, so are these anterior terms) --> implementing change in 2. means we would update anetrior term for founders every iteration too
  4. Peel down and up a couple of times to propagate observed genotype information across pedigree --> hopefully the above changes would not make the algorithm/runtime much slower (as in, that we would need more iterations)

@XingerTang
Copy link
Contributor

I spoke with @AprilYUZhang today and she pointed out some bits that I wan't to clarify here.

Current state in AlphaPeel is the following:

  1. Estimate alternative allele probability based on observed genotyped data (using Newton method on genotype probabilities from observed genotypes (something like this), so accounting for genotyping error)
  2. Take 1. and set it for the rest of the program execution
  3. Use 1. to set anterior term for founders (since alternative allele probability is fixed, so are these anterior terms)
  4. Peel down and up a couple of times to propagate observed genotype information across pedigree

There are two issues with the above: a) in 1. we are estimating alternative allele probability for an "undefined" population (we take any observed genotypes in pedigree), while we really need base population alternative allele probability - while the estimate based on the "undefined" population is not the base population estimate, it probably isn’t miles off, but see also c) b) as we discussed in person, a) will not do what we need for metafounders, but see also c) c) once we get the estimate, keeping it fixed might not be what we want - even if we have slightly off estimate from 1. if we use it as a starting value and then update the base population alternative allele probability by estimating it from inferred individual genotype probabilities for just the founders then we could converge to a better solution - this might make the running time of AlphaPeel longer / we might need more peeling runs - at the moment we effectively use a simple estimate and fix it, so given that estimate we then estimate individual genotype probs - this starting value and convergence thing could actually well work for more than one metafounder too, so there is hope for b) too

The above suggests that we would like to end up in this "correct" state:

  1. Estimate alternative allele probability based on observed genotyped data (using Newton method on genotype probabilities from observed genotypes (something like this), so accounting for genotyping error) --> test how the linear model method with genetic groups could serve us better, but note that even a starting value and updates in the founders could work well, so I suggest we do this linear model method last
  2. Take 1. and set it for the rest of the program execution --> I would like us to explore updating base population allele probability with every round of peeling (we start going down and then up, so when we come up, we have genotype probs for founders and we can estimate allele prob there, even separated by multiple metafounders)
  3. Use 1. to set anterior term for founders (since alternative allele probability is fixed, so are these anterior terms) --> implementing change in 2. means we would update anetrior term for founders every iteration too
  4. Peel down and up a couple of times to propagate observed genotype information across pedigree --> hopefully the above changes would not make the algorithm/runtime much slower (as in, that we would need more iterations)

@gregorgorjanc Thank you for summarizing this! There is just one point I would like to clarify. In steps 2 and 3 of the "correct" state, you mentioned that we would update the estimation of alternative allele probability every peeling cycle, and use the updated allele probability to reestimate the anterior terms. But, we had a conversation about the information contained in the updated alternative allele probability, which is the same as the information contained in the anterior terms after each peeling cycle. If we reestimate anterior terms based on the updated alternative allele probability, it would be the same as the one before the reestimation. So we probably would only do the estimation at the very beginning of the whole peeling process for the peeling accuracy and the reestimation at the very end of the peeling process for the more accurate alternative allele probability output.

@gregorgorjanc
Copy link
Member

@gregorgorjanc Thank you for summarizing this! There is just one point I would like to clarify. In steps 2 and 3 of the "correct" state, you mentioned that we would update the estimation of alternative allele probability every peeling cycle, and use the updated allele probability to reestimate the anterior terms. But, we had a conversation about the information contained in the updated alternative allele probability, which is the same as the information contained in the anterior terms after each peeling cycle. If we reestimate anterior terms based on the updated alternative allele probability, it would be the same as the one before the reestimation. So we probably would only do the estimation at the very beginning of the whole peeling process for the peeling accuracy and the reestimation at the very end of the peeling process for the more accurate alternative allele probability output

@XingerTang right, I keep forgetting that with the addition of metafounders the founders of the new internal pedigree are the metafounders which are “parents” of all our actual founding individuals! Let’s see … so, these metafounders will have anterior, penetrance, and “posterior” terms. When we have a starting allele prob (passed by user or estimated from the data) we should use that for the anterior term of the metafounder(s). Then we peel down and up the pedigree. Once we come up, we will have estimated individual genotype probabilities for the metafounder(s) by combining the anterior and “posterior” terms (the “posterior” term will collect all the information from all descendants of each metafounder) while penetrance will always be unknown for metafounders (unless we have some prior information). These estimated individual genotype probabilities for the metafounder(s) are in fact estimated base population genotype probabilities and we can simply convert these to estimate the base population allele frequency (possibly for more than one metafounder). Having this estimate, we can update the anterior term of the metafounder(s) and repeat peeling down and up. There will be a cycle/loop of information flow so we will have to test how it works in terms of accuracy and runtime till convergence (we might need to add actual convergence metric!). How does this sound?

@XingerTang
Copy link
Contributor

@gregorgorjanc Sure, it sounds doable.

@XingerTang
Copy link
Contributor

@gregorgorjanc @AprilYUZhang @RosCraddock

I just noticed a big issue behind our current implementation of the metafounder:

In the old AlphaPeel, for those 0s in the pedigree input file, each of them is used to generate a specific dummy individual, which has its own set of information stored. However, if we are going to replace these 0s with the main metafounder then there is only one dummy individual being created in this case.

The problem is that even though the individuals may share the same alternative allele frequency, it doesn't mean that they can share their parents. For example, its possible for a locus of an individual to have a genotype 0 and the same loci of another individual to have a genotype 2, while both of the individuals have the alternative allele frequency of 0.5 at that locus. But in that case, the individuals cannot share the same parent.

One way to solve the problem is that generate a dummy individual for each of the cases a metafounder is used, the only known information now for those dummy individuals from the same metafounder is that they share the same alternative allele frequency and nothing more.

@XingerTang
Copy link
Contributor

@gregorgorjanc @AprilYUZhang @RosCraddock

I just noticed a big issue behind our current implementation of the metafounder:

In the old AlphaPeel, for those 0s in the pedigree input file, each of them is used to generate a specific dummy individual, which has its own set of information stored. However, if we are going to replace these 0s with the main metafounder then there is only one dummy individual being created in this case.

The problem is that even though the individuals may share the same alternative allele frequency, it doesn't mean that they can share their parents. For example, its possible for a locus of an individual to have a genotype 0 and the same loci of another individual to have a genotype 2, while both of the individuals have the alternative allele frequency of 0.5 at that locus. But in that case, the individuals cannot share the same parent.

One way to solve the problem is that generate a dummy individual for each of the cases a metafounder is used, the only known information now for those dummy individuals from the same metafounder is that they share the same alternative allele frequency and nothing more.

A correction on the reason why individuals can't share their parents. The information of an individual may be passed to another unrelated individual through their fake common parent metafounder such that it would affect their genotype probability distributions during the peeling.

@XingerTang
Copy link
Contributor

XingerTang commented Aug 22, 2024

@RosCraddock
The following are the steps to build basic input/output functionality of the alternative allele probability

  • Modification in alphapeel.tinypeel to add input options and output options
    • Add input option main_metafounder to input_parser with default MF_1
    • Add input option alt_allele_prob_file to input_parser
    • Add input option est_alt_allele_prob to peeling_control_parser
    • Add output option alt_allele_prob to output_parser
  • Modification in alphapeel.tinypeel to update alternative allele probability before peeling if the input of the option alt_allele_prob_file is not None
    • In runPeelingCycles(), before the peeling, add an iteration through all the individuals in the pedigree, check if the individual is a founder, if the individual is a founder, then update its anterior based on the MetaFounder attribute of the individual and the AAF attribute of the pedigree.
  • Add function to write alternative allele frequency output in alphapeel.peelingIO.
  • Call the function from the above step in the main() function in the alphapeel.tinypeel after the peeling completed if the output option alt_allele_prob_file is used.

@XingerTang
Copy link
Contributor

@RosCraddock
Copy link

Thank you, Evie. I will make those changes over the next few days. For completeness of this issue, I will summarise the change in the approach and reasoning for doing so (i.e our meeting minutes from yesterday):

  • Firstly, as @XingerTang noted above, by creating dummy individuals for all metafounders, we may inadvertently introduce assumed sibling relationships that impact the distribution of the genotypes, particularly when phased.
  • Secondly, questions arose about how AlphaPeel would handle the selfing from metafounder assignment, especially with phasing. One solution was to create maternal and paternal metafounder dummies; however, this still left us with the first issue.

Solution: Hence, we decided not to create dummy individuals for the metafounders but to use the user-assigned metafounders as a tag for assigning the matching alternative allele frequency to the base (founder) individuals. This addresses the above two issues ensuring the metafounders only inform the alternative allele frequency and no assumed sibling relationships. Additionally, this solution will minimise storage use compared to others, hopefully maintaining AlphaPeel's runtimes.
This will first be applied using the standard alternative allele frequency of 0.5 or the user-defined allele frequencies for each metafounder. We will then review how to estimate the alternative allele frequencies.

@gregorgorjanc
Copy link
Member

@RosCraddock @XingerTang with the pr #148 now merged in, please revise the tasks listed above. Are any tasks still outstanding?

@RosCraddock
Copy link

Thank you for merging, @gregorgorjanc! All tasks relating to pr #175 have been completed. However, some comments on this issue are still relevant for the second stage (i.e the estimation of the alternative allele frequency).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants