v0.11.0-alpha
Pre-release
Pre-release
·
93 commits
to master
since this release
Changes
- new command
taxonkit create-taxdump
: Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB. #56
Usage:
Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB
Input format:
0. For GTDB taxonomy file, just use --gtdb
1. The input file should be tab-delimited
2. At least one column is needed, please specify the filed index:
1) Kingdom/Superkingdom/Domain, -K/--field-kingdom
2) Phylum, -P/--field-phylum
3) Class, -C/--field-class
4) Order, -O/--field-order
5) Family, -F/--field-family
6) Genus, -G/--field-genus
7) Species (needed), -S/--field-species
8) Subspecies, -T/--field-subspecies
For GTDB, we use the assembly accession (without version number).
3. The column containing the genome/assembly accession is recommended to
generate TaxId mapping file (taxid.map, id -> taxid).
-A/--field-accession, field contaning genome/assembly accession
--field-accession-re, regular expression to extract the accession
Attentions:
1. Names should be distinct in taxa of different rank.
But for these missing some taxon nodes, using names of parent nodes is allowed:
GB_GCA_018897955.1 d__Archaea;p__EX4484-52;c__EX4484-52;o__EX4484-52;f__LFW-46;g__LFW-46;s__LFW-46 sp018897155
It can also detect duplicate names with different ranks, e.g.,
The Class and Genus have the same name B47-G6, and the Order and Family between them have different names.
In this case, we reassign a new TaxId by increasing the TaxId until it being distinct.
GB_GCA_003663585.1 d__Archaea;p__Thermoplasmatota;c__B47-G6;o__B47-G6B;f__47-G6;g__B47-G6;s__B47-G6 sp003663585
Usage:
taxonkit create-taxdump [flags]
Flags:
-A, --field-accession int field index of assembly accession (genome ID), for outputting taxid.map
--field-accession-re string regular expression to extract assembly accession (default
"^\\w\\w_(.+)$")
-C, --field-class int field index of class
-F, --field-family int field index of family
-G, --field-genus int field index of genus
-K, --field-kingdom int field index of kingdom
-O, --field-order int field index of order
-P, --field-phylum int field index of phylum
-S, --field-species int field index of species (needed)
-T, --field-subspecies int field index of subspecies
--force overwrite existed output directory
--gtdb input files are GTDB taxonomy file
--gtdb-re-subs string regular expression to extract assembly accession as the subspecies
(default "^\\w\\w_GC[AF]_(.+)\\.\\d+$")
-h, --help help for create-taxdump
--line-chunk-size int number of lines to process for each thread, and 4 threads is fast
enough. (default 5000)
--null strings null value of taxa (default [,NULL,NA])
-x, --old-taxdump-dir string taxdump directory of older version
--out-dir string output directory
--rank-names strings names of the 8 ranks, order maters (default
[superkingdom,phylum,class,order,family,genus,species,no rank])