Skip to content

Commit f6ad1d3

Browse files
authored
Merge pull request #116 from sbslee/0.22.0-dev
0.22.0 dev
2 parents 112fbe8 + 7ff4301 commit f6ad1d3

16 files changed

+147
-60
lines changed

.readthedocs.yml

+6-1
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,12 @@
55
# Required
66
version: 2
77

8+
# Set the OS, Python version and other tools you might need
9+
build:
10+
os: ubuntu-22.04
11+
tools:
12+
python: "3.7"
13+
814
# Build documentation in the docs/ directory with Sphinx
915
sphinx:
1016
configuration: docs/conf.py
@@ -19,6 +25,5 @@ sphinx:
1925

2026
# Optionally set the version of Python and requirements required to build your docs
2127
python:
22-
version: 3.7
2328
install:
2429
- requirements: docs/requirements.txt

CHANGELOG.rst

+7
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,13 @@
11
Changelog
22
*********
33

4+
0.22.0 (2023-12-11)
5+
-------------------
6+
7+
* :issue:`100`: Add new method :meth:`sdk.utils.get_bundle_path` to enable customization of the ``pypgx-bundle`` directory's location instead of the user's home directory.
8+
* :issue:`114`: Fix bug in :meth:`api.core.get_recommendation` method where string ``'None'`` was treated as missing value by ``pandas.read_csv`` version 2.0 or higher.
9+
* :issue:`113`: Fix bug in :meth:`api.utils.estimate_phase_beagle` method where Beagle's expectation-maximization algorithm estimated a parameter value that was outside the permitted range.
10+
411
0.21.0 (2023-08-25)
512
-------------------
613

README.rst

+13-3
Original file line numberDiff line numberDiff line change
@@ -229,19 +229,29 @@ structural variant classifier files in PyPGx are moved to the
229229
(only those files are moved; other files such as ``allele-table.csv`` and
230230
``variant-table.csv`` are intact). Therefore, the user must clone the
231231
``pypgx-bundle`` repository with matching PyPGx version to their home
232-
directory in order for PyPGx to correctly access the moved files:
232+
directory in order for PyPGx to correctly access the moved files (i.e. replace
233+
``x.x.x`` with the version number of PyPGx you're using, such as ``0.18.0``):
233234

234235
.. code-block:: text
235236
236237
$ cd ~
237-
$ git clone --branch 0.12.0 --depth 1 https://github.com/sbslee/pypgx-bundle
238+
$ git clone --branch x.x.x --depth 1 https://github.com/sbslee/pypgx-bundle
238239
239240
This is undoubtedly annoying, but absolutely necessary for portability
240241
reasons because PyPGx has been growing exponentially in file size due to the
241242
increasing number of genes supported and their variation complexity, to the
242243
point where it now exceeds upload size limit for PyPI (100 Mb). After removal
243244
of those files, the size of PyPGx has reduced from >100 Mb to <1 Mb.
244245

246+
Starting with version 0.22.0, you can now specify a custom location for the
247+
``pypgx-bundle`` directory instead of using the home directory. This can be
248+
achieved by setting the bundle location using the ``PYPGX_BUNDLE`` environment
249+
variable:
250+
251+
.. code-block:: text
252+
253+
$ export PYPGX_BUNDLE=/path/to/pypgx-bundle
254+
245255
Structural variation detection
246256
==============================
247257

@@ -756,7 +766,7 @@ For getting help on the CLI:
756766
test-cnv-caller Test CNV caller for target gene.
757767
train-cnv-caller Train CNV caller for target gene.
758768
759-
optional arguments:
769+
options:
760770
-h, --help Show this help message and exit.
761771
-v, --version Show the version number and exit.
762772

docs/cli.rst

+6-6
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ For getting help on the CLI:
6565
test-cnv-caller Test CNV caller for target gene.
6666
train-cnv-caller Train CNV caller for target gene.
6767
68-
optional arguments:
68+
options:
6969
-h, --help Show this help message and exit.
7070
-v, --version Show the version number and exit.
7171
@@ -409,7 +409,7 @@ estimate-phase-beagle
409409
-h, --help Show this help message and exit.
410410
--panel PATH VCF file (compressed or uncompressed) corresponding to a
411411
reference haplotype panel. By default, the 1KGP panel in
412-
the ~/pypgx-bundle directory will be used.
412+
the pypgx-bundle directory will be used.
413413
--impute Perform imputation of missing genotypes.
414414
415415
filter-samples
@@ -700,7 +700,7 @@ predict-cnv
700700
Optional arguments:
701701
-h, --help Show this help message and exit.
702702
--cnv-caller PATH Archive file with the semantic type Model[CNV]. By
703-
default, a pre-trained CNV caller in the ~/pypgx-bundle
703+
default, a pre-trained CNV caller in the pypgx-bundle
704704
directory will be used.
705705
706706
prepare-depth-of-coverage
@@ -813,7 +813,7 @@ run-chip-pipeline
813813
(choices: 'GRCh37', 'GRCh38').
814814
--panel PATH VCF file corresponding to a reference haplotype panel
815815
(compressed or uncompressed). By default, the 1KGP
816-
panel in the ~/pypgx-bundle directory will be used.
816+
panel in the pypgx-bundle directory will be used.
817817
--impute Perform imputation of missing genotypes.
818818
--force Overwrite output directory if it already exists.
819819
--samples TEXT [TEXT ...]
@@ -911,7 +911,7 @@ run-ngs-pipeline
911911
(choices: 'GRCh37', 'GRCh38').
912912
--panel PATH VCF file corresponding to a reference haplotype panel
913913
(compressed or uncompressed). By default, the 1KGP panel
914-
in the ~/pypgx-bundle directory will be used.
914+
in the pypgx-bundle directory will be used.
915915
--force Overwrite output directory if it already exists.
916916
--samples TEXT [TEXT ...]
917917
Specify which samples should be included for analysis
@@ -926,7 +926,7 @@ run-ngs-pipeline
926926
--do-not-plot-allele-fraction
927927
Do not plot allele fraction profile.
928928
--cnv-caller PATH Archive file with the semantic type Model[CNV]. By
929-
default, a pre-trained CNV caller in the ~/pypgx-bundle
929+
default, a pre-trained CNV caller in the pypgx-bundle
930930
directory will be used.
931931
932932
[Example] To genotype the CYP3A5 gene, which does not have SV, from WGS data:

docs/create.py

+12-2
Original file line numberDiff line numberDiff line change
@@ -256,19 +256,29 @@
256256
(only those files are moved; other files such as ``allele-table.csv`` and
257257
``variant-table.csv`` are intact). Therefore, the user must clone the
258258
``pypgx-bundle`` repository with matching PyPGx version to their home
259-
directory in order for PyPGx to correctly access the moved files:
259+
directory in order for PyPGx to correctly access the moved files (i.e. replace
260+
``x.x.x`` with the version number of PyPGx you're using, such as ``0.18.0``):
260261
261262
.. code-block:: text
262263
263264
$ cd ~
264-
$ git clone --branch 0.12.0 --depth 1 https://github.com/sbslee/pypgx-bundle
265+
$ git clone --branch x.x.x --depth 1 https://github.com/sbslee/pypgx-bundle
265266
266267
This is undoubtedly annoying, but absolutely necessary for portability
267268
reasons because PyPGx has been growing exponentially in file size due to the
268269
increasing number of genes supported and their variation complexity, to the
269270
point where it now exceeds upload size limit for PyPI (100 Mb). After removal
270271
of those files, the size of PyPGx has reduced from >100 Mb to <1 Mb.
271272
273+
Starting with version 0.22.0, you can now specify a custom location for the
274+
``pypgx-bundle`` directory instead of using the home directory. This can be
275+
achieved by setting the bundle location using the ``PYPGX_BUNDLE`` environment
276+
variable:
277+
278+
.. code-block:: text
279+
280+
$ export PYPGX_BUNDLE=/path/to/pypgx-bundle
281+
272282
Structural variation detection
273283
==============================
274284

docs/faq.rst

+21
Original file line numberDiff line numberDiff line change
@@ -143,3 +143,24 @@ CYP2D6*21, PyPGx will first check which of the two haplotypes contains
143143
2851C>T and 4181G>C and then assign 2580_2581insC to that haplotype. Note
144144
that the phase-by-extension algorithm can handle multiallelic sites in
145145
addition to biallelic sites.
146+
147+
Genotyping multiple genes
148+
=========================
149+
150+
Many users have asked if it's possible to genotype multiple genes
151+
simultaneously using a pipeline command (e.g. :command:`run-ngs-pipeline`).
152+
The short answer is no; all the genotyping pipelines are designed to
153+
investigate a single gene at a time. However, one can easily loop through the
154+
target genes to achieve the same results:
155+
156+
.. code-block:: text
157+
158+
for gene in `pypgx create-regions-bed --target-genes | awk '{print $4}'`
159+
do
160+
pypgx run-ngs-pipeline \
161+
$gene \
162+
grch37-$gene-pipeline \
163+
--variants grch37-variants.vcf.gz \
164+
--depth-of-coverage grch37-depth-of-coverage.zip \
165+
--control-statistics grch37-control-statistics-VDR.zip
166+
done

pypgx/api/core.py

+5-1
Original file line numberDiff line numberDiff line change
@@ -504,7 +504,11 @@ def get_paralog(gene):
504504
>>> import pypgx
505505
>>> pypgx.get_paralog('CYP2D6')
506506
'CYP2D7'
507+
>>> pypgx.get_paralog('CYP2D7')
508+
'CYP2D6'
507509
>>> pypgx.get_paralog('CYP2B6')
510+
'CYP2B7'
511+
>>> pypgx.get_paralog('CYP2E1')
508512
''
509513
"""
510514
df = load_gene_table()
@@ -1286,7 +1290,7 @@ def load_recommendation_table():
12861290
4 tacrolimus CYP3A5 Indeterminate None None None
12871291
"""
12881292
b = BytesIO(pkgutil.get_data(__name__, 'data/recommendation-table.csv'))
1289-
return pd.read_csv(b)
1293+
return pd.read_csv(b, na_filter=False)
12901294

12911295
def load_variant_table():
12921296
"""

pypgx/api/pipeline.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ def run_chip_pipeline(
3232
Reference genome assembly.
3333
panel : str, optional
3434
VCF file corresponding to a reference haplotype panel (compressed or
35-
uncompressed). By default, the 1KGP panel in the ``~/pypgx-bundle``
35+
uncompressed). By default, the 1KGP panel in the ``pypgx-bundle``
3636
directory will be used.
3737
impute : bool, default: False
3838
If True, perform imputation of missing genotypes.
@@ -166,7 +166,7 @@ def run_ngs_pipeline(
166166
Reference genome assembly.
167167
panel : str, optional
168168
VCF file corresponding to a reference haplotype panel (compressed or
169-
uncompressed). By default, the 1KGP panel in the ``~/pypgx-bundle``
169+
uncompressed). By default, the 1KGP panel in the ``pypgx-bundle``
170170
directory will be used.
171171
force : bool, default : False
172172
Overwrite output directory if it already exists.
@@ -184,7 +184,7 @@ def run_ngs_pipeline(
184184
Do not plot allele fraction profile.
185185
cnv_caller : str or pypgx.Archive, optional
186186
Archive file or object with the semantic type Model[CNV]. By default,
187-
a pre-trained CNV caller in the ``~/pypgx-bundle`` directory will be
187+
a pre-trained CNV caller in the ``pypgx-bundle`` directory will be
188188
used.
189189
"""
190190
if not core.is_target_gene(gene):

pypgx/api/utils.py

+46-36
Original file line numberDiff line numberDiff line change
@@ -794,7 +794,7 @@ def estimate_phase_beagle(
794794
VCF's contig names.
795795
panel : str, optional
796796
VCF file corresponding to a reference haplotype panel (compressed or
797-
uncompressed). By default, the 1KGP panel in the ``~/pypgx-bundle``
797+
uncompressed). By default, the 1KGP panel in the ``pypgx-bundle``
798798
directory will be used.
799799
impute : bool, default: False
800800
If True, perform imputation of missing genotypes.
@@ -819,8 +819,7 @@ def estimate_phase_beagle(
819819
metadata['Program'] = 'Beagle'
820820

821821
if panel is None:
822-
home = os.path.expanduser('~')
823-
panel = f'{home}/pypgx-bundle/1kgp/{assembly}/{gene}.vcf.gz'
822+
panel = f'{sdk.get_bundle_path()}/1kgp/{assembly}/{gene}.vcf.gz'
824823

825824
has_chr_prefix = pyvcf.has_chr_prefix(panel)
826825

@@ -839,6 +838,31 @@ def estimate_phase_beagle(
839838
if metadata['Platform'] == 'Chip':
840839
vf1 = vf1.filter_gsa()
841840

841+
def run_beagle(vf1, em):
842+
with tempfile.TemporaryDirectory() as t:
843+
vf1.to_file(f'{t}/input.vcf')
844+
command = [
845+
'java', '-Xmx2g', '-jar', beagle,
846+
f'gt={t}/input.vcf',
847+
f'chrom={region}',
848+
f'ref={panel}',
849+
f'out={t}/output',
850+
f'impute={str(impute).lower()}',
851+
f'em={em}'
852+
]
853+
subprocess.run(
854+
command,
855+
check=True,
856+
stdout=subprocess.DEVNULL,
857+
stderr=subprocess.PIPE
858+
)
859+
vf3 = pyvcf.VcfFrame.from_file(f'{t}/output.vcf.gz')
860+
if common_samples:
861+
vf = vf3.rename({f'{x}_TEMP': x for x in common_samples})
862+
if has_chr_prefix:
863+
vf = vf3.update_chr_prefix('remove')
864+
return vf3
865+
842866
# Beagle will throw an error if there is only one marker overlapping with
843867
# the reference panel in a given window. This typically occurs when the
844868
# input VCF has very few markers or only one marker. Therefore, these
@@ -863,39 +887,26 @@ def estimate_phase_beagle(
863887
common_samples = list(set(vf1.samples) & set(vf2.samples))
864888
if common_samples:
865889
vf1 = vf1.rename({x: f'{x}_TEMP' for x in common_samples})
866-
with tempfile.TemporaryDirectory() as t:
867-
vf1.to_file(f'{t}/input.vcf')
868-
command = [
869-
'java', '-Xmx2g', '-jar', beagle,
870-
f'gt={t}/input.vcf',
871-
f'chrom={region}',
872-
f'ref={panel}',
873-
f'out={t}/output',
874-
f'impute={str(impute).lower()}'
875-
]
876-
try:
877-
subprocess.run(
878-
command,
879-
check=True,
880-
stdout=subprocess.DEVNULL,
881-
stderr=subprocess.PIPE
882-
)
883-
vf3 = pyvcf.VcfFrame.from_file(f'{t}/output.vcf.gz')
884-
if common_samples:
885-
vf3 = vf3.rename({f'{x}_TEMP': x for x in common_samples})
886-
if has_chr_prefix:
887-
vf3 = vf3.update_chr_prefix('remove')
890+
891+
try:
892+
vf3 = run_beagle(vf1, em='true')
893+
except subprocess.CalledProcessError as e:
894+
message = e.stderr.decode()
888895
# Beagle may throw an error even when multiple overlapping markers
889896
# exist because they are too distant from each other -- that is,
890897
# located in separate haplotype windows.
891-
except subprocess.CalledProcessError as e:
892-
message = e.stderr.decode()
893-
if "Window has only one position" in message:
894-
warnings.warn("Beagle: Window has only one position")
895-
vf3 = pyvcf.VcfFrame([], vf1.df[0:0])
896-
else:
897-
print(message)
898-
raise e
898+
if "Window has only one position" in message:
899+
warnings.warn("Beagle: Window has only one position")
900+
vf3 = pyvcf.VcfFrame([], vf1.df[0:0])
901+
# Beagle will throw an error if the expectation-maximization
902+
# algorithm estimates a parameter value outside the permitted
903+
# range. When this happens, we skip the expectation-maximization.
904+
elif "IllegalArgumentException: 1.0" in message:
905+
warnings.warn("Beagle: Expectation-maximization skipped")
906+
vf3 = run_beagle(vf1, em='false')
907+
else:
908+
print(message)
909+
raise e
899910

900911
return sdk.Archive(metadata, vf3)
901912

@@ -1203,7 +1214,7 @@ def predict_cnv(copy_number, cnv_caller=None):
12031214
Archive file or object with the semantic type CovFrame[CopyNumber].
12041215
cnv_caller : str or pypgx.Archive, optional
12051216
Archive file or object with the semantic type Model[CNV]. By default,
1206-
a pre-trained CNV caller in the ``~/pypgx-bundle`` directory will be
1217+
a pre-trained CNV caller in the ``pypgx-bundle`` directory will be
12071218
used.
12081219
12091220
Returns
@@ -1218,8 +1229,7 @@ def predict_cnv(copy_number, cnv_caller=None):
12181229

12191230
gene = copy_number.metadata['Gene']
12201231
assembly = copy_number.metadata['Assembly']
1221-
home = os.path.expanduser('~')
1222-
model_file = f'{home}/pypgx-bundle/cnv/{assembly}/{gene}.zip'
1232+
model_file = f'{sdk.get_bundle_path()}/cnv/{assembly}/{gene}.zip'
12231233

12241234
if cnv_caller is None:
12251235
cnv_caller = sdk.Archive.from_file(model_file)

pypgx/cli/estimate_phase_beagle.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ def create_parser(subparsers):
4141
help=
4242
"""VCF file (compressed or uncompressed) corresponding to a
4343
reference haplotype panel. By default, the 1KGP panel in
44-
the ~/pypgx-bundle directory will be used."""
44+
the pypgx-bundle directory will be used."""
4545
)
4646
parser.add_argument(
4747
'--impute',

pypgx/cli/predict_cnv.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ def create_parser(subparsers):
3939
metavar='PATH',
4040
help=
4141
"""Archive file with the semantic type Model[CNV]. By
42-
default, a pre-trained CNV caller in the ~/pypgx-bundle
42+
default, a pre-trained CNV caller in the pypgx-bundle
4343
directory will be used."""
4444
)
4545

pypgx/cli/run_chip_pipeline.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ def create_parser(subparsers):
5757
help=
5858
"""VCF file corresponding to a reference haplotype panel
5959
(compressed or uncompressed). By default, the 1KGP
60-
panel in the ~/pypgx-bundle directory will be used."""
60+
panel in the pypgx-bundle directory will be used."""
6161
)
6262
parser.add_argument(
6363
'--impute',

0 commit comments

Comments
 (0)