Skip to content

Commit 68fac2e

Browse files
authored
Merge pull request #85 from qiyunlab/2.0b3
2.0b3
2 parents 49a4894 + 31e28a2 commit 68fac2e

File tree

13 files changed

+458
-269
lines changed

13 files changed

+458
-269
lines changed

CHANGELOG.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,27 @@
11
Change Log
22
==========
33

4-
## Version 2.0b3 (12/27/2020)
4+
## Version 2.0b3 (2021, ongoing)
55

66
### Added
77
- Predicted HGT list now includes potential donors. Users can optionally specify a taxonomic rank at which they will be reported.
88
- A quick-start guide added to the homepage.
9+
- Let the database construction task resume from an interrupted task and use already downloaded files, while detecting and skipping corrupt files.
10+
- Added an option `--above` to sample genomes at all ranks from the designated one to phylum during database construction.
11+
- Added an option `--typemater` to include NCBI-defined type materials in genome sampling during database construction.
12+
- Added an option `--manual` to export URLs of sampled genomes during database construction, and let the user download them manually.
913

1014
### Changed
15+
- Updated pre-built default database to 2021-11-21 (after NCBI RefSeq 209)
1116
- Repository transferred from [DittmarLab](https://github.com/DittmarLab) to [qiyunlab](https://github.com/qiyunlab).
17+
- Updated recommended dependency versions, however the program should continue to be compatible with previous versions.
1218
- Minor tweaks with no visible impact on program behavior.
1319

20+
### Fixed
21+
- Fixed an issue with the NCBI FTP server connection during database construction. NCBI now recommends rsync over ftp. Therefore the protocol has been updated accordingly.
22+
- Fixed compatibility with latest scikit-learn (1.0.1).
23+
- Fixed compatibility with latest DIAMOND (2.0.13).
24+
1425

1526
## Version 2.0b2 (5/3/2020)
1627

README.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ References
3030
Set up a Conda environment and install dependencies:
3131

3232
```bash
33-
conda create -n hgtector python=3 pyyaml pandas matplotlib scikit-learn bioconda::diamond
33+
conda create -n hgtector -c conda-forge python=3 pyyaml pandas matplotlib scikit-learn bioconda::diamond
3434
conda activate hgtector
3535
```
3636

@@ -40,13 +40,15 @@ Install HGTector2:
4040
pip install git+https://github.com/qiyunlab/HGTector.git
4141
```
4242

43-
Build a reference database using the default protocol:
43+
Then you will be able to type `hgtector` to run the program. Here are more details of [installation](doc/install.md).
44+
45+
Build a reference [database](doc/database.md) using the default protocol:
4446

4547
```bash
4648
hgtector database -o db_dir --default
4749
```
4850

49-
This will retrieve the latest genomic data from NCBI. If this does not work (e.g., due to network issues), or you need some customization, please read the [database](doc/database.md) page.
51+
Or [download](https://www.dropbox.com/s/tszxy9etp52id3u/hgtdb_20211121.tar.xz?dl=0) a pre-built database as of 2021-11-21, and [compile](doc/database.md#Manual-compiling) it.
5052

5153
Prepare input file(s). They should be multi-Fasta files of amino acid sequences (faa). Each file represents the whole protein set of a complete or partial genome.
5254

@@ -69,7 +71,7 @@ It is recommended that you read the [first run](doc/1strun.md), [second run](doc
6971

7072
## License
7173

72-
Copyright (c) 2013-2020, [Qiyun Zhu](mailto:[email protected]) and [Katharina Dittmar](mailto:[email protected]). Licensed under [BSD 3-clause](http://opensource.org/licenses/BSD-3-Clause). See full license [statement](LICENSE).
74+
Copyright (c) 2013-2021, [Qiyun Zhu](mailto:[email protected]) and [Katharina Dittmar](mailto:[email protected]). Licensed under [BSD 3-clause](http://opensource.org/licenses/BSD-3-Clause). See full license [statement](LICENSE).
7375

7476

7577
## Citation

doc/1strun.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ A small example is provided in the subdirectory [example](../example). The input
1111
1212
Let's analyze this small example using HGTector.
1313

14-
**Note**: It has been increasingly infeasible as of 2020 to run remote search through the NCBI BLAST server. If you experience an very slow run when going through this tutorial, please skip and move on to [second run](2ndrun.md).
14+
**Note**: Automatic remote BLAST search using URL API is inefficient as of 2021 and has been [deprecated](https://ncbi.github.io/blast-cloud/dev/api.html) by NCBI. Therefore this tutorial is for reference only. Unless you want to wait for long hours, please skip this tutorial and move on to the [second run](2ndrun.md).
1515

1616

1717
## Search

doc/2ndrun.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ Here we will analyze a single genome of [_Escherichia coli_ O55:H7](https://www.
3737
Download the whole protein set of _E. coli_ O55:H7 (5,278 protein sequences in total) from the NCBI server:
3838

3939
```bash
40-
wget -O o55h7.faa.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/025/165/GCF_000025165.1_ASM2516v1/GCF_000025165.1_ASM2516v1_protein.faa.gz
40+
wget -O o55h7.faa.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/025/165/GCF_000025165.1_ASM2516v1/GCF_000025165.1_ASM2516v1_protein.faa.gz
4141
```
4242

4343
You don't need to unzip the file, as HGTector can automatically parse compressed files in the format of gzip, bzip2 and lzma.

doc/database.md

Lines changed: 53 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
Database
22
========
33

4+
## Index
5+
- [Overview](#overview) | [Default protocol](#default-protocol) | [Pre-built database](#pre-built-database) | [Database files](#database-files) | [Considerations](#considerations) | [Command-line reference](#command-line-reference)
6+
47
## Overview
58

69
The `database` command is an automated workflow for sampling reference genomes, downloading non-redundant protein sequences, and building local databases for sequence homology search. It provides various options for flexible customization of the database, to address specific research goals including HGT prediction or other general purposes.
@@ -9,33 +12,36 @@ The `database` command is an automated workflow for sampling reference genomes,
912
hgtector database -o <output_dir> <parameters...>
1013
```
1114

12-
### Default protocol
15+
The workflow consists of the following steps:
1316

14-
HGTector provides a default protocol for database building.
17+
1. Download NCBI taxonomy database (taxdump).
18+
2. Download NCBI RefSeq assembly summary.
19+
3. Sample genomes based on various properties and taxonomic information.
20+
4. Download protein sequences associated with sampled genomes.
21+
5. Compile local databases using DIAMOND and/or BLAST.
1522

16-
```bash
17-
hgtector database -o <output_dir> --default
18-
```
1923

20-
This will download all protein sequences of NCBI RefSeq genomes of bacteria, archaea, fungi and protozoa, keep one genome per species, plus all NCBI-defined reference and representative genomes. Finally it will attempt to compile the database using DIAMOND, if available. The command is equivalent to:
24+
## Default protocol
25+
Database files
26+
This will download all protein sequences of NCBI RefSeq genomes of **bacteria**, **archaea**, **fungi** and **protozoa**, keep _one genome per species_ that has a Latinate name, plus one genome per taxonomic group at higher ranks, regardless whether that genome has a Latinate species name, plus all NCBI-defined **reference**, **representative** and **type material** genomes (prioritized during taxonomy-based sampling, and added afterwards if not sampled). Finally it will attempt to compile the database using DIAMOND, if available. The command is equivalent to:
2127

2228
```bash
23-
hgtector database --output <output_dir> --cats microbe --sample 1 --rank species --reference --representative --compile diamond
29+
hgtector database --output <output_dir> --cats microbe --sample 1 --rank species_latin --above --reference --represent --typemater --compile diamond
2430
```
2531

26-
A pre-built default database as of 2019-10-21 is available for [download](https://www.dropbox.com/s/qdnfgzdcjadlm4i/hgtdb_20191021.tar.xz?dl=0). It needs to be [compiled](#Manual-compiling) using choice of aligner.
2732

28-
### Procedures
33+
## Pre-built database
2934

30-
The workflow consists of the following steps:
35+
A database built using the default protocol on 2021-11-21 is available for [download](https://www.dropbox.com/s/tszxy9etp52id3u/hgtdb_20211121.tar.xz?dl=0) \([MD5](https://www.dropbox.com/s/kdopz946pk088wr/hgtdb_20211121.tar.xz.md5?dl=0)\). It needs to be [compiled](#Manual-compiling) using choice of aligner.
3136

32-
1. Download NCBI taxonomy database (taxdump).
33-
2. Download NCBI RefSeq assembly summary.
34-
3. Sample genomes based on various properties and taxonomic information.
35-
4. Download protein sequences associated with sampled genomes.
36-
5. Compile local databases using DIAMOND and/or BLAST.
37+
This database, sampled from NCBI RefSeq after release, 209 contains 68,977,351 unique protein sequences from 21,754 microbial genomes, representing 3 domains, 74 phyla, 145 classes, 337 orders, 783 families, 3,753 genera and 15,932 species.
3738

38-
### Database files
39+
Building this database used a maximum of 63 GB memory. Searching this database using DIAMOND v2.0.13 requires ~7 GB memory.
40+
41+
A previous version of the database built on 2019-10-21 is available [here](https://www.dropbox.com/s/qdnfgzdcjadlm4i/hgtdb_20191021.tar.xz?dl=0).
42+
43+
44+
## Database files
3945

4046
File or directory | Description
4147
--- | ---
@@ -60,6 +66,9 @@ The protein-to-TaxID map is already integrated into the compiled databases, so o
6066

6167
Feel free to delete (e.g., `download/`) or compress the intermediate files (e.g., `db.faa`) to save disk space.
6268

69+
70+
## Considerations
71+
6372
### More examples
6473

6574
```bash
@@ -80,12 +89,31 @@ hgtector database -g gids.txt -o .
8089

8190
This will only download genomes specified in the file `gids.txt`. Useful for controlled tests.
8291

92+
### Clean up
93+
94+
After the database is successfully built, you may consider compressing `db.faa` and deleting `download/` (or just `download/faa/`) to save disk space. HGTector won't do this automatically.
95+
8396
### Break and resume
8497

8598
Should any of the download steps be interrupted by e.g., a network failure, one can resume the downloading process by re-executing the same command. The program will skip the already downloaded files in this new run. In some instances, one may need to manually remove the last file from the failed run (because that file may be corrupt), before re-running the program.
8699

87100
If one wants to overwrite downloaded files (e.g., upgrading), add `--overwrite` to the command.
88101

102+
### Manual downloading
103+
104+
One may want to download genomes manually in a more controled manner, instead of letting HGTector running for hours to days to retrieve them one after another before moving to the next step. In this case, add `--manual` to the command, and the program will generate `urls.txt`, a list of URLs of the sampled genomes, and quit.
105+
106+
Then one can choose the most appropriate method to download them. For example, one may use the [rsync protocol](https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#protocols), as recommended by NCBI:
107+
108+
```bash
109+
while read url
110+
do
111+
rsync -Ltq rsync${url#https}/${url##*/}_protein.faa.gz download/faa/
112+
done < urls.txt
113+
```
114+
115+
After all genomes (protein sequences) are downloaded to `download/faa/`, one may restart the program without `--manual`, and the program will take the downloaded files and move to the next step.
116+
89117
### Manual compiling
90118

91119
The sampling & downloading steps (1-4) require extensive network traffics (usually several hours, if the bottleneck is not on the recipient side) but little local computing load; whereas the compling step (5) requires extensive local computing power, but no networking.
@@ -95,12 +123,11 @@ Therefore, it is a reasonable plan to only download database files without compi
95123
For DIAMOND:
96124

97125
```bash
98-
echo $'accession\taccession.version\ttaxid' > prot.accession2taxid
99-
zcat taxon.map.gz | awk -v OFS='\t' '{split($1, a, "."); print a[1], $1, $2}' >> prot.accession2taxid
126+
echo $'accession.version\ttaxid' | cat - <(zcat taxon.map.gz) > prot.accession2taxid.FULL
100127

101-
diamond makedb --threads 16 --in db.faa --taxonmap prot.accession2taxid --taxonnodes taxdump/nodes.dmp --taxonnames taxdump/names.dmp --db diamond/db
128+
diamond makedb --threads 16 --in db.faa --taxonmap prot.accession2taxid.FULL --taxonnodes taxdump/nodes.dmp --taxonnames taxdump/names.dmp --db diamond/db
102129

103-
rm prot.accession2taxid
130+
rm prot.accession2taxid.FULL
104131
```
105132

106133
For BLAST:
@@ -174,17 +201,19 @@ Option | Default | Description
174201

175202
Option | Default | Description
176203
--- | --- | ---
177-
`-s`, `--sample` | 0 | Sample up to this number of genomes per taxonomic group at the given rank. "0" is for all (disable sampling).
178-
`-r`, `--rank` | species | Taxonomic rank at which subsampling will be performed.
204+
`-s`, `--sample` | 0 | Sample up to this number of genomes per taxonomic group at the given rank. "0" is for all (disable sampling). Prior to sampling, genomes will be sorted by NCBI genome category: reference > representative > type material, then by assembly level: complete genome or chromosome > scaffolds > contigs. Sampling will start from the top of the list.
205+
`-r`, `--rank` | species | Taxonomic rank at which subsampling will be performed. Can be any taxonomic rank defined in the NCBI taxonomy database. A special case is "species_latin", which will sample from species that have Latinate names.
206+
`--above` | - | Sampling will also be performed on ranks from the one given by `-r` to phylum (low to high). They will not overlap the already sampled ones. For example, if two _E. coli_ genomes are already sampled, no more genome will be added when sampling in genus _Escherichia_. This flag is useful in the case of `-r species_latin`, because some ranks above species may be undersampled.
179207

180208
### Genome sampling
181209

182210
Option | Default | Description
183211
--- | --- | ---
184212
`--genbank` | - | By default the program only downloads RefSeq genomes (`GCF`). This flag will let the program also download GenBank genomes (`GCA`). But RefSeq has higher priority than GenBank if the same genome is hosted by both catalogs.
185213
`--complete` | - | Only include complete genomes, i.e., `assembly_level` is `Complete Genome` or `Chromosome`.
186-
`--reference` | - | Add NCBI-defined reference genomes to selection (after taxon sampling).
187-
`--representative` | - | Add NCBI-defined representative genomes to selection (after taxon sampling).
214+
`--reference` | - | Include NCBI-defined reference genomes.
215+
`--represent` | - | Include NCBI-defined representative genomes.
216+
`--typemater` | - | Include NCBI-defined type material genomes.
188217

189218
### Taxonomic filter
190219

doc/install.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ HGTector is written in Python 3. One needs at least Python 3.6 to run the progra
1818
### Option 1: Through Conda (recommended)
1919

2020
```bash
21-
conda create -n hgtector python=3 pyyaml pandas matplotlib scikit-learn
21+
conda create -n hgtector -c conda-forge python=3 pyyaml pandas matplotlib scikit-learn
2222
conda activate hgtector
2323
pip install git+https://github.com/qiyunlab/HGTector.git
2424
```
@@ -49,7 +49,7 @@ conda install -c bioconda diamond blast
4949

5050
HGTector has a command `database` for automated database construction. It defaults to the **NCBI** RefSeq microbial genomes and taxonomy. Meanwhile, we also provide instructions for using **GTDB** and custom databases. See [details](database.md).
5151

52-
A standard database built using the default protocol on 2019-10-21 is available for [download](https://www.dropbox.com/s/qdnfgzdcjadlm4i/hgtdb_20191021.tar.xz?dl=0), together with [instruction](database.md#Manual-compiling) for compiling.
52+
A standard database built using the default protocol on 2021-11-21 is available for [download](https://www.dropbox.com/s/tszxy9etp52id3u/hgtdb_20211121.tar.xz?dl=0) \([MD5](https://www.dropbox.com/s/kdopz946pk088wr/hgtdb_20211121.tar.xz.md5?dl=0)\), together with [instruction](database.md#Manual-compiling) for compiling.
5353

5454
A small, pre-compiled test database is also available for [download](https://www.dropbox.com/s/46v3uc708rvc5rc/ref107.tar.xz?dl=0).
5555

@@ -80,8 +80,8 @@ conda env remove -n hgtector --all
8080

8181
## Compatibility
8282

83-
If in the future some dependencies have changes that are not compatible with the current release of HGTector, the following "safe" command can be used to install the current versions of dependencies (note: DIAMOND version is too tricky to specify).
83+
If in the future some dependencies have changes that are not compatible with the current release of HGTector, the following "safe" command can be used to install the current versions of dependencies.
8484

8585
```bash
86-
conda create -n hgtector -c bioconda python=3.7.7 pyyaml=5.3.1 pandas=1.0.3 matplotlib=3.1.3 scikit-learn=0.22.1 diamond blast=2.9.0
86+
conda create -n hgtector-dev -c conda-forge python=3.10.0 pyyaml=6.0 pandas=1.3.4 matplotlib=3.5.0 scikit-learn=1.0.1 bioconda::diamond=2.0.13 bioconda::blast=2.12.0
8787
```

hgtector/analyze.py

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -887,9 +887,8 @@ def perform_kde(self, data):
887887
https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/
888888
"""
889889
bw = self.bandwidth
890-
data_ = data[:, np.newaxis]
891890
scaler = StandardScaler()
892-
data_ = scaler.fit_transform(data[:, np.newaxis])
891+
data_ = scaler.fit_transform(data.reshape(-1, 1))
893892
estimator = KernelDensity(kernel='gaussian')
894893

895894
# grid search optimization
@@ -914,8 +913,8 @@ def perform_kde(self, data):
914913

915914
# get density function
916915
x, y = self.density_func(data_, kde)
917-
x = scaler.inverse_transform(x)
918-
y = scaler.inverse_transform(y)
916+
x = scaler.inverse_transform(x.reshape(-1, 1)).reshape(-1)
917+
y = scaler.inverse_transform(y.reshape(-1, 1)).reshape(-1)
919918
return x, y, bw
920919

921920
@staticmethod
@@ -1127,7 +1126,7 @@ def smart_kde(self, group):
11271126
"""
11281127
data = self.df[group].values
11291128
scaler = StandardScaler()
1130-
data_ = scaler.fit_transform(data[:, np.newaxis])
1129+
data_ = scaler.fit_transform(data.reshape(-1, 1))
11311130
estimator = KernelDensity(kernel='gaussian')
11321131
bwspace = np.logspace(0, -1, self.bw_steps)
11331132

@@ -1137,8 +1136,8 @@ def smart_kde(self, group):
11371136
setattr(estimator, 'bandwidth', bw)
11381137
kde = estimator.fit(data_)
11391138
x, y = self.density_func(data_, kde)
1140-
x = scaler.inverse_transform(x)
1141-
y = scaler.inverse_transform(y)
1139+
x = scaler.inverse_transform(x.reshape(-1, 1)).reshape(-1)
1140+
y = scaler.inverse_transform(y.reshape(-1, 1)).reshape(-1)
11421141
try:
11431142
peak, valley = self.first_hill(x, y)
11441143
except ValueError:
@@ -1177,7 +1176,8 @@ def calc_cluster_props(self):
11771176
# calculate centroid
11781177
clf = NearestCentroid()
11791178
clf.fit(data_, self.df['hgt'])
1180-
cent = scaler.inverse_transform(clf.centroids_[1])
1179+
cent = scaler.inverse_transform(clf.centroids_[1].reshape(
1180+
-1, data_.shape[1])).reshape(-1)
11811181
return cent
11821182

11831183
def refine_cluster(self, cent):

hgtector/config.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -101,8 +101,8 @@ blast:
101101
remote:
102102

103103
# remote search database
104-
# options: nr, refseq_protein, swissprot, pdb, etc.
105-
db: nr
104+
# options: nr, refseq_select_prot, refseq_protein, swissprot, pdb, etc.
105+
db: refseq_select_prot
106106

107107
# remote search algorithm
108108
# options: blastp psiBlast deltaBlast kmerBlastp phiBlast, etc.
@@ -114,13 +114,13 @@ remote:
114114
# typically cannot exceed 8,000 characters)
115115
retries: 5 # maximum number of retries per search
116116
delay: 60 # seconds between two search requests
117-
timeout: 900 # seconds before program gives up waiting
117+
timeout: 1800 # seconds before program gives up waiting
118118

119119
# entrez query text
120120
entrez: all [filter] NOT(environmental samples[filter] OR metagenomes[orgn]) txid131567[orgn]
121121

122122
# extra URL arguments for remote search
123-
extrargs: "&FILTER=m%20S"
123+
extrargs: "&WORD_SIZE=6&FILTER=m%20S"
124124

125125

126126
## Fetch information from remote server

0 commit comments

Comments
 (0)