Skip to content

Commit 67edd8e

Browse files
committed
Document algorithms
1 parent c600a48 commit 67edd8e

File tree

2 files changed

+68
-16
lines changed

2 files changed

+68
-16
lines changed

README.rst

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -117,8 +117,8 @@ Loading SNPs('resources/663.23andme.305.txt.gz')
117117

118118
Now we can perform some analysis between the ``User662`` and ``User663`` datasets.
119119

120-
Find Discordant SNPs
121-
''''''''''''''''''''
120+
`Find Discordant SNPs <https://lineage.readthedocs.io/en/latest/lineage.html#lineage.Lineage.find_discordant_snps>`_
121+
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
122122
First, let's find discordant SNPs (i.e., SNP data that is not consistent with Mendelian
123123
inheritance):
124124

@@ -136,8 +136,8 @@ the prompt, although the same output is available in the CSV file.
136136

137137
Not counting mtDNA SNPs, there are 37 discordant SNPs between these two datasets.
138138

139-
Find Shared DNA
140-
'''''''''''''''
139+
`Find Shared DNA <https://lineage.readthedocs.io/en/latest/lineage.html#lineage.Lineage.find_shared_dna>`_
140+
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
141141
``lineage`` uses the probabilistic recombination rates throughout the human genome from the
142142
`International HapMap Project <https://www.genome.gov/10001688/international-hapmap-project/>`_
143143
and the `1000 Genomes Project <https://www.internationalgenome.org>`_ to compute the shared DNA
@@ -180,8 +180,8 @@ details the shared segments of DNA on one chromosome and a plot that illustrates
180180

181181
.. image:: https://raw.githubusercontent.com/apriha/lineage/master/docs/images/shared_dna_User662_User663_HapMap2.png
182182

183-
Find Shared Genes
184-
'''''''''''''''''
183+
`Find Shared Genes <https://lineage.readthedocs.io/en/latest/lineage.html#lineage.Lineage.find_shared_dna>`_
184+
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
185185
The `Central Dogma of Molecular Biology <https://en.wikipedia.org/wiki/Central_dogma_of_molecular_biology>`_
186186
states that genetic information flows from DNA to mRNA to proteins: DNA is transcribed into
187187
mRNA, and mRNA is translated into a protein. It's more complicated than this (it's biology

src/lineage/__init__.py

Lines changed: 62 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -290,6 +290,36 @@ def find_shared_dna(
290290
291291
All output is saved to the output directory as `CSV` or `PNG` files.
292292
293+
Notes
294+
-----
295+
The code is commented throughout to help describe the algorithm and its operation.
296+
297+
To summarize, the algorithm first computes the genetic distance in cMs between SNPs
298+
common to all individuals using the specified genetic map.
299+
300+
Then, individuals are compared for whether they share one or two alleles for each SNP in
301+
common; in this manner, where all individuals share one chromosome, for example, there
302+
will be several SNPs in a row where at least one allele is shared between individuals for
303+
each SNP. The ``cM_threshold`` is then applied to each of these "matching segments" to
304+
determine whether the segment could be a potential shared DNA segment (i.e., whether each
305+
segment has a cM value greater than the threshold).
306+
307+
The matching segments that passed the ``cM_threshold`` are then checked to see if they
308+
are adjacent to another matching segment, and if so, the segments are stitched together,
309+
and the single SNP separating the segments is flagged as potentially discrepant. (This
310+
means that multiple smaller matching segments passing the ``cM_threshold`` could be
311+
stitched, identifying the SNP between each segment as discrepant.)
312+
313+
Next, the ``snp_threshold`` is applied to each segment to ensure there are enough SNPs in
314+
the segment and the segment is not only a few SNPs in a region with a high recombination
315+
rate; for each segment that passes this test, we have a segment of shared DNA, and the
316+
total cMs for this segment are computed.
317+
318+
Finally, discrepant SNPs are checked to ensure that only SNPs internal to a shared DNA
319+
segment are reported as discrepant (i.e., don't report a discrepant SNP if it was part of a
320+
segment that didn't pass the ``snp_threshold``). Currently, no action other than reporting
321+
is taken on discrepant SNPs.
322+
293323
Parameters
294324
----------
295325
individuals : iterable of Individuals
@@ -330,15 +360,18 @@ def find_shared_dna(
330360
two_chrom_discrepant_snps (pandas.Index)
331361
discrepant SNPs discovered while finding shared DNA on two chromosomes
332362
"""
363+
# initialize all objects to be returned to be empty to start
333364
one_chrom_shared_dna = pd.DataFrame()
334365
two_chrom_shared_dna = pd.DataFrame()
335366
one_chrom_shared_genes = pd.DataFrame()
336367
two_chrom_shared_genes = pd.DataFrame()
337368
one_chrom_discrepant_snps = pd.Index([])
338369
two_chrom_discrepant_snps = pd.Index([])
339370

371+
# ensure that all individuals have SNPs that are mapped relative to Build 37
340372
self._remap_snps_to_GRCh37(individuals)
341373

374+
# return if there aren't enough individuals to compare
342375
if len(individuals) < 2:
343376
logger.warning("find_shared_dna requires two or more individuals...")
344377
return self._find_shared_dna_return_helper(
@@ -350,6 +383,7 @@ def find_shared_dna(
350383
two_chrom_discrepant_snps,
351384
)
352385

386+
# load the specified genetic map (one genetic map for each chromosome)
353387
genetic_map_dfs = self._resources.get_genetic_map(genetic_map)
354388

355389
if len(genetic_map_dfs) == 0:
@@ -362,26 +396,34 @@ def find_shared_dna(
362396
two_chrom_discrepant_snps,
363397
)
364398

365-
cols = ["genotype{}".format(str(i)) for i in range(len(individuals))]
399+
# generate a list of dynamically named columns for each individual's genotype
400+
# (e.g., genotype0, genotype1, etc).
401+
cols = [f"genotype{str(i)}" for i in range(len(individuals))]
366402

403+
# set the reference SNPs to compare to be that of the first individual
367404
df = individuals[0].snps
368405
df = df.rename(columns={"genotype": cols[0]})
369406

407+
# build-up a dataframe of SNPs that are common to all individuals
370408
for i, ind in enumerate(individuals[1:]):
371409
# join SNPs for all individuals
372410
df = df.join(ind.snps["genotype"], how="inner")
373411
df = df.rename(columns={"genotype": cols[i + 1]})
374412

413+
# set a flag for if one individuals is male (i.e., only one chromosome match on the X
414+
# chromosome is possible in the non-PAR region)
375415
one_x_chrom = self._is_one_individual_male(individuals)
376416

417+
# create tasks to compute the genetic distances (cMs) between each SNP on each chromosome
377418
tasks = []
378-
379419
chroms_to_drop = []
380420
for chrom in df["chrom"].unique():
381421
if chrom not in genetic_map_dfs.keys():
382422
chroms_to_drop.append(chrom)
383423
continue
384424

425+
# each task requires the genetic map for the chromosome and the positions of all SNPs
426+
# in common on that chromosome
385427
tasks.append(
386428
{
387429
"genetic_map": genetic_map_dfs[chrom],
@@ -390,19 +432,24 @@ def find_shared_dna(
390432
}
391433
)
392434

393-
# drop chromosomes without genetic distance data
435+
# drop chromosomes without genetic distance data (e.g., chroms MT, PAR, etc.)
394436
for chrom in chroms_to_drop:
395437
df = df.drop(df.loc[df["chrom"] == chrom].index)
396438

397-
# determine the genetic distance between each SNP using the HapMap Phase II genetic map
439+
# determine the genetic distance between each SNP using the specified genetic map
398440
snp_distances = map(self._compute_snp_distances, tasks)
399441
snp_distances = pd.concat(snp_distances)
442+
443+
# extract the column "cM_from_prev_snp" from the result and add that to the dataframe
444+
# of SNPs common to all individuals; now we have the genetic distance between each SNP
400445
df["cM_from_prev_snp"] = snp_distances["cM_from_prev_snp"]
401446

447+
# now we apply a mask for whether all individuals match on one or two chromosomes...
448+
# first, set all rows for these columns to True
402449
df["one_chrom_match"] = True
403450
df["two_chrom_match"] = True
404-
405-
# determine where individuals share an allele on one chromosome
451+
# determine where individuals share an allele on one chromosome (i.e., set to False when
452+
# at least one allele doesn't match for all individuals)
406453
for genotype1, genotype2 in combinations(cols, 2):
407454
df.loc[
408455
~df[genotype1].isnull()
@@ -414,7 +461,8 @@ def find_shared_dna(
414461
"one_chrom_match",
415462
] = False
416463

417-
# determine where individuals share alleles on both chromosomes
464+
# determine where individuals share alleles on two chromosomes (i.e., set to False when
465+
# two alleles don't match for all individuals)
418466
for genotype1, genotype2 in combinations(cols, 2):
419467
df.loc[
420468
~df[genotype1].isnull()
@@ -430,12 +478,14 @@ def find_shared_dna(
430478
# genotype columns are no longer required for calculation
431479
df = df.drop(cols, axis=1)
432480

481+
# find shared DNA on one chrom
433482
one_chrom_shared_dna, one_chrom_discrepant_snps = self._find_shared_dna_helper(
434483
df[["chrom", "pos", "cM_from_prev_snp", "one_chrom_match"]],
435484
cM_threshold,
436485
snp_threshold,
437486
one_x_chrom,
438487
)
488+
# find shared DNA on two chroms
439489
two_chrom_shared_dna, two_chrom_discrepant_snps = self._find_shared_dna_helper(
440490
df[["chrom", "pos", "cM_from_prev_snp", "two_chrom_match"]],
441491
cM_threshold,
@@ -646,7 +696,7 @@ def _is_one_individual_male(self, individuals):
646696
return False
647697

648698
def _compute_snp_distances(self, task):
649-
""" Compute genetic distance between SNPs.
699+
""" Compute genetic distance in cMs between SNPs.
650700
651701
Parameters
652702
----------
@@ -732,8 +782,8 @@ def _compute_shared_dna(self, task):
732782
"two_chrom_match",
733783
] = False
734784

735-
# get consecutive strings of trues
736-
# http://stackoverflow.com/a/17151327
785+
# get consecutive strings of Trues, for where there's a one or two chrom match between
786+
# individuals, depending on the task; http://stackoverflow.com/a/17151327
737787
a = df.loc[(df["chrom"] == chrom)][match_col].values
738788
a = np.r_[a, False]
739789
a_rshifted = np.roll(a, 1)
@@ -744,8 +794,10 @@ def _compute_shared_dna(self, task):
744794
a_ends = np.nonzero(ends)[0]
745795
a_ends = np.reshape(a_ends, (len(a_ends), 1))
746796

797+
# get the matching segments
747798
matches = np.hstack((a_starts, a_ends))
748799

800+
# compute total cMs for each matching segment
749801
c = np.r_[0, df.loc[(df["chrom"] == chrom)]["cM_from_prev_snp"].cumsum()][
750802
matches
751803
]

0 commit comments

Comments
 (0)