You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Repository transferred from [DittmarLab](https://github.com/DittmarLab) to [qiyunlab](https://github.com/qiyunlab).
17
+
- Updated recommended dependency versions, however the program should continue to be compatible with previous versions.
12
18
- Minor tweaks with no visible impact on program behavior.
13
19
20
+
### Fixed
21
+
- Fixed an issue with the NCBI FTP server connection during database construction. NCBI now recommends rsync over ftp. Therefore the protocol has been updated accordingly.
22
+
- Fixed compatibility with latest scikit-learn (1.0.1).
23
+
- Fixed compatibility with latest DIAMOND (2.0.13).
Build a reference database using the default protocol:
43
+
Then you will be able to type `hgtector` to run the program. Here are more details of [installation](doc/install.md).
44
+
45
+
Build a reference [database](doc/database.md) using the default protocol:
44
46
45
47
```bash
46
48
hgtector database -o db_dir --default
47
49
```
48
50
49
-
This will retrieve the latest genomic data from NCBI. If this does not work (e.g., due to network issues), or you need some customization, please read the [database](doc/database.md) page.
51
+
Or [download](https://www.dropbox.com/s/tszxy9etp52id3u/hgtdb_20211121.tar.xz?dl=0) a pre-built database as of 2021-11-21, and [compile](doc/database.md#Manual-compiling) it.
50
52
51
53
Prepare input file(s). They should be multi-Fasta files of amino acid sequences (faa). Each file represents the whole protein set of a complete or partial genome.
52
54
@@ -69,7 +71,7 @@ It is recommended that you read the [first run](doc/1strun.md), [second run](doc
69
71
70
72
## License
71
73
72
-
Copyright (c) 2013-2020, [Qiyun Zhu](mailto:[email protected]) and [Katharina Dittmar](mailto:[email protected]). Licensed under [BSD 3-clause](http://opensource.org/licenses/BSD-3-Clause). See full license [statement](LICENSE).
74
+
Copyright (c) 2013-2021, [Qiyun Zhu](mailto:[email protected]) and [Katharina Dittmar](mailto:[email protected]). Licensed under [BSD 3-clause](http://opensource.org/licenses/BSD-3-Clause). See full license [statement](LICENSE).
Copy file name to clipboardExpand all lines: doc/1strun.md
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ A small example is provided in the subdirectory [example](../example). The input
11
11
12
12
Let's analyze this small example using HGTector.
13
13
14
-
**Note**: It has been increasingly infeasible as of 2020 to run remote search through the NCBI BLAST server. If you experience an very slow run when going through this tutorial, please skip and move on to [second run](2ndrun.md).
14
+
**Note**: Automatic remote BLAST search using URL API is inefficient as of 2021 and has been [deprecated](https://ncbi.github.io/blast-cloud/dev/api.html) by NCBI. Therefore this tutorial is for reference only. Unless you want to wait for long hours, please skip this tutorial and move on to the[second run](2ndrun.md).
The `database` command is an automated workflow for sampling reference genomes, downloading non-redundant protein sequences, and building local databases for sequence homology search. It provides various options for flexible customization of the database, to address specific research goals including HGT prediction or other general purposes.
@@ -9,33 +12,36 @@ The `database` command is an automated workflow for sampling reference genomes,
9
12
hgtector database -o <output_dir><parameters...>
10
13
```
11
14
12
-
### Default protocol
15
+
The workflow consists of the following steps:
13
16
14
-
HGTector provides a default protocol for database building.
17
+
1. Download NCBI taxonomy database (taxdump).
18
+
2. Download NCBI RefSeq assembly summary.
19
+
3. Sample genomes based on various properties and taxonomic information.
20
+
4. Download protein sequences associated with sampled genomes.
21
+
5. Compile local databases using DIAMOND and/or BLAST.
15
22
16
-
```bash
17
-
hgtector database -o <output_dir> --default
18
-
```
19
23
20
-
This will download all protein sequences of NCBI RefSeq genomes of bacteria, archaea, fungi and protozoa, keep one genome per species, plus all NCBI-defined reference and representative genomes. Finally it will attempt to compile the database using DIAMOND, if available. The command is equivalent to:
24
+
## Default protocol
25
+
Database files
26
+
This will download all protein sequences of NCBI RefSeq genomes of **bacteria**, **archaea**, **fungi** and **protozoa**, keep _one genome per species_ that has a Latinate name, plus one genome per taxonomic group at higher ranks, regardless whether that genome has a Latinate species name, plus all NCBI-defined **reference**, **representative** and **type material** genomes (prioritized during taxonomy-based sampling, and added afterwards if not sampled). Finally it will attempt to compile the database using DIAMOND, if available. The command is equivalent to:
A pre-built default database as of 2019-10-21 is available for [download](https://www.dropbox.com/s/qdnfgzdcjadlm4i/hgtdb_20191021.tar.xz?dl=0). It needs to be [compiled](#Manual-compiling) using choice of aligner.
27
32
28
-
### Procedures
33
+
##Pre-built database
29
34
30
-
The workflow consists of the following steps:
35
+
A database built using the default protocol on 2021-11-21 is available for [download](https://www.dropbox.com/s/tszxy9etp52id3u/hgtdb_20211121.tar.xz?dl=0)\([MD5](https://www.dropbox.com/s/kdopz946pk088wr/hgtdb_20211121.tar.xz.md5?dl=0)\). It needs to be [compiled](#Manual-compiling) using choice of aligner.
31
36
32
-
1. Download NCBI taxonomy database (taxdump).
33
-
2. Download NCBI RefSeq assembly summary.
34
-
3. Sample genomes based on various properties and taxonomic information.
35
-
4. Download protein sequences associated with sampled genomes.
36
-
5. Compile local databases using DIAMOND and/or BLAST.
37
+
This database, sampled from NCBI RefSeq after release, 209 contains 68,977,351 unique protein sequences from 21,754 microbial genomes, representing 3 domains, 74 phyla, 145 classes, 337 orders, 783 families, 3,753 genera and 15,932 species.
37
38
38
-
### Database files
39
+
Building this database used a maximum of 63 GB memory. Searching this database using DIAMOND v2.0.13 requires ~7 GB memory.
40
+
41
+
A previous version of the database built on 2019-10-21 is available [here](https://www.dropbox.com/s/qdnfgzdcjadlm4i/hgtdb_20191021.tar.xz?dl=0).
42
+
43
+
44
+
## Database files
39
45
40
46
File or directory | Description
41
47
--- | ---
@@ -60,6 +66,9 @@ The protein-to-TaxID map is already integrated into the compiled databases, so o
60
66
61
67
Feel free to delete (e.g., `download/`) or compress the intermediate files (e.g., `db.faa`) to save disk space.
This will only download genomes specified in the file `gids.txt`. Useful for controlled tests.
82
91
92
+
### Clean up
93
+
94
+
After the database is successfully built, you may consider compressing `db.faa` and deleting `download/` (or just `download/faa/`) to save disk space. HGTector won't do this automatically.
95
+
83
96
### Break and resume
84
97
85
98
Should any of the download steps be interrupted by e.g., a network failure, one can resume the downloading process by re-executing the same command. The program will skip the already downloaded files in this new run. In some instances, one may need to manually remove the last file from the failed run (because that file may be corrupt), before re-running the program.
86
99
87
100
If one wants to overwrite downloaded files (e.g., upgrading), add `--overwrite` to the command.
88
101
102
+
### Manual downloading
103
+
104
+
One may want to download genomes manually in a more controled manner, instead of letting HGTector running for hours to days to retrieve them one after another before moving to the next step. In this case, add `--manual` to the command, and the program will generate `urls.txt`, a list of URLs of the sampled genomes, and quit.
105
+
106
+
Then one can choose the most appropriate method to download them. For example, one may use the [rsync protocol](https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#protocols), as recommended by NCBI:
After all genomes (protein sequences) are downloaded to `download/faa/`, one may restart the program without `--manual`, and the program will take the downloaded files and move to the next step.
116
+
89
117
### Manual compiling
90
118
91
119
The sampling & downloading steps (1-4) require extensive network traffics (usually several hours, if the bottleneck is not on the recipient side) but little local computing load; whereas the compling step (5) requires extensive local computing power, but no networking.
@@ -95,12 +123,11 @@ Therefore, it is a reasonable plan to only download database files without compi
`-s`, `--sample` | 0 | Sample up to this number of genomes per taxonomic group at the given rank. "0" is for all (disable sampling).
178
-
`-r`, `--rank` | species | Taxonomic rank at which subsampling will be performed.
204
+
`-s`, `--sample` | 0 | Sample up to this number of genomes per taxonomic group at the given rank. "0" is for all (disable sampling). Prior to sampling, genomes will be sorted by NCBI genome category: reference > representative > type material, then by assembly level: complete genome or chromosome > scaffolds > contigs. Sampling will start from the top of the list.
205
+
`-r`, `--rank` | species | Taxonomic rank at which subsampling will be performed. Can be any taxonomic rank defined in the NCBI taxonomy database. A special case is "species_latin", which will sample from species that have Latinate names.
206
+
`--above` | - | Sampling will also be performed on ranks from the one given by `-r` to phylum (low to high). They will not overlap the already sampled ones. For example, if two _E. coli_ genomes are already sampled, no more genome will be added when sampling in genus _Escherichia_. This flag is useful in the case of `-r species_latin`, because some ranks above species may be undersampled.
179
207
180
208
### Genome sampling
181
209
182
210
Option | Default | Description
183
211
--- | --- | ---
184
212
`--genbank` | - | By default the program only downloads RefSeq genomes (`GCF`). This flag will let the program also download GenBank genomes (`GCA`). But RefSeq has higher priority than GenBank if the same genome is hosted by both catalogs.
185
213
`--complete` | - | Only include complete genomes, i.e., `assembly_level` is `Complete Genome` or `Chromosome`.
HGTector has a command `database` for automated database construction. It defaults to the **NCBI** RefSeq microbial genomes and taxonomy. Meanwhile, we also provide instructions for using **GTDB** and custom databases. See [details](database.md).
51
51
52
-
A standard database built using the default protocol on 2019-10-21 is available for [download](https://www.dropbox.com/s/qdnfgzdcjadlm4i/hgtdb_20191021.tar.xz?dl=0), together with [instruction](database.md#Manual-compiling) for compiling.
52
+
A standard database built using the default protocol on 2021-11-21 is available for [download](https://www.dropbox.com/s/tszxy9etp52id3u/hgtdb_20211121.tar.xz?dl=0)\([MD5](https://www.dropbox.com/s/kdopz946pk088wr/hgtdb_20211121.tar.xz.md5?dl=0)\), together with [instruction](database.md#Manual-compiling) for compiling.
53
53
54
54
A small, pre-compiled test database is also available for [download](https://www.dropbox.com/s/46v3uc708rvc5rc/ref107.tar.xz?dl=0).
If in the future some dependencies have changes that are not compatible with the current release of HGTector, the following "safe" command can be used to install the current versions of dependencies (note: DIAMOND version is too tricky to specify).
83
+
If in the future some dependencies have changes that are not compatible with the current release of HGTector, the following "safe" command can be used to install the current versions of dependencies.
0 commit comments