Skip to content

Commit e58bd2b

Browse files
authored
Merge pull request #94 from apriha/develop
v4.1.1
2 parents 5a47dc0 + 67edd8e commit e58bd2b

File tree

4 files changed

+83
-32
lines changed

4 files changed

+83
-32
lines changed

README.rst

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -117,8 +117,8 @@ Loading SNPs('resources/663.23andme.305.txt.gz')
117117

118118
Now we can perform some analysis between the ``User662`` and ``User663`` datasets.
119119

120-
Find Discordant SNPs
121-
''''''''''''''''''''
120+
`Find Discordant SNPs <https://lineage.readthedocs.io/en/latest/lineage.html#lineage.Lineage.find_discordant_snps>`_
121+
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
122122
First, let's find discordant SNPs (i.e., SNP data that is not consistent with Mendelian
123123
inheritance):
124124

@@ -136,8 +136,8 @@ the prompt, although the same output is available in the CSV file.
136136

137137
Not counting mtDNA SNPs, there are 37 discordant SNPs between these two datasets.
138138

139-
Find Shared DNA
140-
'''''''''''''''
139+
`Find Shared DNA <https://lineage.readthedocs.io/en/latest/lineage.html#lineage.Lineage.find_shared_dna>`_
140+
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
141141
``lineage`` uses the probabilistic recombination rates throughout the human genome from the
142142
`International HapMap Project <https://www.genome.gov/10001688/international-hapmap-project/>`_
143143
and the `1000 Genomes Project <https://www.internationalgenome.org>`_ to compute the shared DNA
@@ -180,8 +180,8 @@ details the shared segments of DNA on one chromosome and a plot that illustrates
180180

181181
.. image:: https://raw.githubusercontent.com/apriha/lineage/master/docs/images/shared_dna_User662_User663_HapMap2.png
182182

183-
Find Shared Genes
184-
'''''''''''''''''
183+
`Find Shared Genes <https://lineage.readthedocs.io/en/latest/lineage.html#lineage.Lineage.find_shared_dna>`_
184+
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
185185
The `Central Dogma of Molecular Biology <https://en.wikipedia.org/wiki/Central_dogma_of_molecular_biology>`_
186186
states that genetic information flows from DNA to mRNA to proteins: DNA is transcribed into
187187
mRNA, and mRNA is translated into a protein. It's more complicated than this (it's biology

src/lineage/__init__.py

Lines changed: 62 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -290,6 +290,36 @@ def find_shared_dna(
290290
291291
All output is saved to the output directory as `CSV` or `PNG` files.
292292
293+
Notes
294+
-----
295+
The code is commented throughout to help describe the algorithm and its operation.
296+
297+
To summarize, the algorithm first computes the genetic distance in cMs between SNPs
298+
common to all individuals using the specified genetic map.
299+
300+
Then, individuals are compared for whether they share one or two alleles for each SNP in
301+
common; in this manner, where all individuals share one chromosome, for example, there
302+
will be several SNPs in a row where at least one allele is shared between individuals for
303+
each SNP. The ``cM_threshold`` is then applied to each of these "matching segments" to
304+
determine whether the segment could be a potential shared DNA segment (i.e., whether each
305+
segment has a cM value greater than the threshold).
306+
307+
The matching segments that passed the ``cM_threshold`` are then checked to see if they
308+
are adjacent to another matching segment, and if so, the segments are stitched together,
309+
and the single SNP separating the segments is flagged as potentially discrepant. (This
310+
means that multiple smaller matching segments passing the ``cM_threshold`` could be
311+
stitched, identifying the SNP between each segment as discrepant.)
312+
313+
Next, the ``snp_threshold`` is applied to each segment to ensure there are enough SNPs in
314+
the segment and the segment is not only a few SNPs in a region with a high recombination
315+
rate; for each segment that passes this test, we have a segment of shared DNA, and the
316+
total cMs for this segment are computed.
317+
318+
Finally, discrepant SNPs are checked to ensure that only SNPs internal to a shared DNA
319+
segment are reported as discrepant (i.e., don't report a discrepant SNP if it was part of a
320+
segment that didn't pass the ``snp_threshold``). Currently, no action other than reporting
321+
is taken on discrepant SNPs.
322+
293323
Parameters
294324
----------
295325
individuals : iterable of Individuals
@@ -330,15 +360,18 @@ def find_shared_dna(
330360
two_chrom_discrepant_snps (pandas.Index)
331361
discrepant SNPs discovered while finding shared DNA on two chromosomes
332362
"""
363+
# initialize all objects to be returned to be empty to start
333364
one_chrom_shared_dna = pd.DataFrame()
334365
two_chrom_shared_dna = pd.DataFrame()
335366
one_chrom_shared_genes = pd.DataFrame()
336367
two_chrom_shared_genes = pd.DataFrame()
337368
one_chrom_discrepant_snps = pd.Index([])
338369
two_chrom_discrepant_snps = pd.Index([])
339370

371+
# ensure that all individuals have SNPs that are mapped relative to Build 37
340372
self._remap_snps_to_GRCh37(individuals)
341373

374+
# return if there aren't enough individuals to compare
342375
if len(individuals) < 2:
343376
logger.warning("find_shared_dna requires two or more individuals...")
344377
return self._find_shared_dna_return_helper(
@@ -350,6 +383,7 @@ def find_shared_dna(
350383
two_chrom_discrepant_snps,
351384
)
352385

386+
# load the specified genetic map (one genetic map for each chromosome)
353387
genetic_map_dfs = self._resources.get_genetic_map(genetic_map)
354388

355389
if len(genetic_map_dfs) == 0:
@@ -362,26 +396,34 @@ def find_shared_dna(
362396
two_chrom_discrepant_snps,
363397
)
364398

365-
cols = ["genotype{}".format(str(i)) for i in range(len(individuals))]
399+
# generate a list of dynamically named columns for each individual's genotype
400+
# (e.g., genotype0, genotype1, etc).
401+
cols = [f"genotype{str(i)}" for i in range(len(individuals))]
366402

403+
# set the reference SNPs to compare to be that of the first individual
367404
df = individuals[0].snps
368405
df = df.rename(columns={"genotype": cols[0]})
369406

407+
# build-up a dataframe of SNPs that are common to all individuals
370408
for i, ind in enumerate(individuals[1:]):
371409
# join SNPs for all individuals
372410
df = df.join(ind.snps["genotype"], how="inner")
373411
df = df.rename(columns={"genotype": cols[i + 1]})
374412

413+
# set a flag for if one individuals is male (i.e., only one chromosome match on the X
414+
# chromosome is possible in the non-PAR region)
375415
one_x_chrom = self._is_one_individual_male(individuals)
376416

417+
# create tasks to compute the genetic distances (cMs) between each SNP on each chromosome
377418
tasks = []
378-
379419
chroms_to_drop = []
380420
for chrom in df["chrom"].unique():
381421
if chrom not in genetic_map_dfs.keys():
382422
chroms_to_drop.append(chrom)
383423
continue
384424

425+
# each task requires the genetic map for the chromosome and the positions of all SNPs
426+
# in common on that chromosome
385427
tasks.append(
386428
{
387429
"genetic_map": genetic_map_dfs[chrom],
@@ -390,19 +432,24 @@ def find_shared_dna(
390432
}
391433
)
392434

393-
# drop chromosomes without genetic distance data
435+
# drop chromosomes without genetic distance data (e.g., chroms MT, PAR, etc.)
394436
for chrom in chroms_to_drop:
395437
df = df.drop(df.loc[df["chrom"] == chrom].index)
396438

397-
# determine the genetic distance between each SNP using the HapMap Phase II genetic map
439+
# determine the genetic distance between each SNP using the specified genetic map
398440
snp_distances = map(self._compute_snp_distances, tasks)
399441
snp_distances = pd.concat(snp_distances)
442+
443+
# extract the column "cM_from_prev_snp" from the result and add that to the dataframe
444+
# of SNPs common to all individuals; now we have the genetic distance between each SNP
400445
df["cM_from_prev_snp"] = snp_distances["cM_from_prev_snp"]
401446

447+
# now we apply a mask for whether all individuals match on one or two chromosomes...
448+
# first, set all rows for these columns to True
402449
df["one_chrom_match"] = True
403450
df["two_chrom_match"] = True
404-
405-
# determine where individuals share an allele on one chromosome
451+
# determine where individuals share an allele on one chromosome (i.e., set to False when
452+
# at least one allele doesn't match for all individuals)
406453
for genotype1, genotype2 in combinations(cols, 2):
407454
df.loc[
408455
~df[genotype1].isnull()
@@ -414,7 +461,8 @@ def find_shared_dna(
414461
"one_chrom_match",
415462
] = False
416463

417-
# determine where individuals share alleles on both chromosomes
464+
# determine where individuals share alleles on two chromosomes (i.e., set to False when
465+
# two alleles don't match for all individuals)
418466
for genotype1, genotype2 in combinations(cols, 2):
419467
df.loc[
420468
~df[genotype1].isnull()
@@ -430,12 +478,14 @@ def find_shared_dna(
430478
# genotype columns are no longer required for calculation
431479
df = df.drop(cols, axis=1)
432480

481+
# find shared DNA on one chrom
433482
one_chrom_shared_dna, one_chrom_discrepant_snps = self._find_shared_dna_helper(
434483
df[["chrom", "pos", "cM_from_prev_snp", "one_chrom_match"]],
435484
cM_threshold,
436485
snp_threshold,
437486
one_x_chrom,
438487
)
488+
# find shared DNA on two chroms
439489
two_chrom_shared_dna, two_chrom_discrepant_snps = self._find_shared_dna_helper(
440490
df[["chrom", "pos", "cM_from_prev_snp", "two_chrom_match"]],
441491
cM_threshold,
@@ -646,7 +696,7 @@ def _is_one_individual_male(self, individuals):
646696
return False
647697

648698
def _compute_snp_distances(self, task):
649-
""" Compute genetic distance between SNPs.
699+
""" Compute genetic distance in cMs between SNPs.
650700
651701
Parameters
652702
----------
@@ -732,8 +782,8 @@ def _compute_shared_dna(self, task):
732782
"two_chrom_match",
733783
] = False
734784

735-
# get consecutive strings of trues
736-
# http://stackoverflow.com/a/17151327
785+
# get consecutive strings of Trues, for where there's a one or two chrom match between
786+
# individuals, depending on the task; http://stackoverflow.com/a/17151327
737787
a = df.loc[(df["chrom"] == chrom)][match_col].values
738788
a = np.r_[a, False]
739789
a_rshifted = np.roll(a, 1)
@@ -744,8 +794,10 @@ def _compute_shared_dna(self, task):
744794
a_ends = np.nonzero(ends)[0]
745795
a_ends = np.reshape(a_ends, (len(a_ends), 1))
746796

797+
# get the matching segments
747798
matches = np.hstack((a_starts, a_ends))
748799

800+
# compute total cMs for each matching segment
749801
c = np.r_[0, df.loc[(df["chrom"] == chrom)]["cM_from_prev_snp"].cumsum()][
750802
matches
751803
]

src/lineage/resources.py

Lines changed: 13 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,7 @@ def get_genetic_map(self, genetic_map):
133133

134134
def get_genetic_map_HapMapII_GRCh37(self):
135135
""" Get International HapMap Consortium HapMap Phase II genetic map for Build 37.
136+
[#InternationalHapMapConsortium]_ [#Auton2010]_
136137
137138
Returns
138139
-------
@@ -141,10 +142,10 @@ def get_genetic_map_HapMapII_GRCh37(self):
141142
142143
References
143144
----------
144-
1. "The International HapMap Consortium (2007). A second generation human haplotype
145-
map of over 3.1 million SNPs. Nature 449: 851-861."
146-
2. "The map was generated by lifting the HapMap Phase II genetic map from build 35 to
147-
GRCh37. The original map was generated using LDhat as described in the 2007 HapMap
145+
.. [#InternationalHapMapConsortium] "The International HapMap Consortium (2007). A second generation human
146+
haplotype map of over 3.1 million SNPs. Nature 449: 851-861."
147+
.. [#Auton2010] "The map was generated by lifting the HapMap Phase II genetic map from build
148+
35 to GRCh37. The original map was generated using LDhat as described in the 2007 HapMap
148149
paper (Nature, 18th Sept 2007). The conversion from b35 to GRCh37 was achieved using
149150
the UCSC liftOver tool. Adam Auton, 08/12/2010"
150151
"""
@@ -157,11 +158,12 @@ def get_genetic_map_HapMapII_GRCh37(self):
157158
return self._genetic_map
158159

159160
def get_genetic_map_1000G_GRCh37(self, pop):
160-
""" Get population-specific 1000 Genomes Project genetic map.
161+
""" Get population-specific 1000 Genomes Project genetic map. [#Auton2013]_ [#Auton2015]_
161162
162163
Notes
163164
-----
164-
From `README_omni_recombination_20130507 <ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130507_omni_recombination_rates/README_omni_recombination_20130507>`_:
165+
From `README_omni_recombination_20130507
166+
<ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130507_omni_recombination_rates/README_omni_recombination_20130507>`_ [#Auton2013]_ :
165167
166168
Genetic maps generated from the 1000G phased OMNI data.
167169
@@ -185,9 +187,6 @@ def get_genetic_map_1000G_GRCh37(self, pop):
185187
estimate of 4Ne, allowing the population based rates to be converted to
186188
per-generation rates.
187189
188-
Adam Auton ([email protected])
189-
May 7th, 2013
190-
191190
Returns
192191
-------
193192
dict
@@ -196,11 +195,11 @@ def get_genetic_map_1000G_GRCh37(self, pop):
196195
197196
References
198197
----------
199-
1. Adam Auton ([email protected]), May 7th, 2013
200-
2. The 1000 Genomes Project Consortium., Corresponding authors., Auton, A. et al. A
201-
global reference for human genetic variation. Nature 526, 68–74 (2015).
198+
.. [#Auton2013] Adam Auton, May 7th, 2013
199+
.. [#Auton2015] The 1000 Genomes Project Consortium., Corresponding authors.,
200+
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
202201
https://doi.org/10.1038/nature15393
203-
3. https://github.com/joepickrell/1000-genomes-genetic-maps
202+
.. [#Pickrell] https://github.com/joepickrell/1000-genomes-genetic-maps
204203
"""
205204
if self._genetic_map_name != pop:
206205
self._genetic_map = self._load_genetic_map_1000G_GRCh37(
@@ -415,7 +414,7 @@ def _load_cytoBand(filename):
415414
df : pandas.DataFrame
416415
cytoBand table
417416
"""
418-
# adapted from chromosome plotting code (see [1]_)
417+
# adapted from chromosome plotting code (see get_cytoBand_hg19 Ref. 1.)
419418
df = pd.read_csv(
420419
filename, names=["chrom", "start", "end", "name", "gie_stain"], sep="\t"
421420
)

src/lineage/visualization.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,11 @@
22
33
Notes
44
-----
5-
Adapted from Ryan Dale's GitHub Gist for plotting chromosome features. [1]_
5+
Adapted from Ryan Dale's GitHub Gist for plotting chromosome features. [#Dale]_
66
77
References
88
----------
9-
.. [1] Ryan Dale, GitHub Gist,
9+
.. [#Dale] Ryan Dale, GitHub Gist,
1010
https://gist.github.com/daler/c98fc410282d7570efc3#file-ideograms-py
1111
1212
"""

0 commit comments

Comments
 (0)