Skip to content

v0.11.0-alpha

Pre-release
Pre-release
Compare
Choose a tag to compare
@shenwei356 shenwei356 released this 21 Apr 05:53
· 93 commits to master since this release

Changes

  • new command taxonkit create-taxdump: Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB. #56

Usage:

Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB

Input format: 
  0. For GTDB taxonomy file, just use --gtdb
  1. The input file should be tab-delimited
  2. At least one column is needed, please specify the filed index:
     1) Kingdom/Superkingdom/Domain,     -K/--field-kingdom
     2) Phylum,                          -P/--field-phylum
     3) Class,                           -C/--field-class
     4) Order,                           -O/--field-order
     5) Family,                          -F/--field-family
     6) Genus,                           -G/--field-genus
     7) Species (needed),                -S/--field-species
     8) Subspecies,                      -T/--field-subspecies
        For GTDB, we use the assembly accession (without version number).
  3. The column containing the genome/assembly accession is recommended to
     generate TaxId mapping file (taxid.map, id -> taxid).
     -A/--field-accession,    field contaning genome/assembly accession        
     --field-accession-re,    regular expression to extract the accession 

Attentions:
  1. Names should be distinct in taxa of different rank.
     But for these missing some taxon nodes, using names of parent nodes is allowed:

       GB_GCA_018897955.1      d__Archaea;p__EX4484-52;c__EX4484-52;o__EX4484-52;f__LFW-46;g__LFW-46;s__LFW-46 sp018897155

     It can also detect duplicate names with different ranks, e.g.,
     The Class and Genus have the same name B47-G6, and the Order and Family between them have different names.
     In this case, we reassign a new TaxId by increasing the TaxId until it being distinct.

       GB_GCA_003663585.1      d__Archaea;p__Thermoplasmatota;c__B47-G6;o__B47-G6B;f__47-G6;g__B47-G6;s__B47-G6 sp003663585

Usage:
  taxonkit create-taxdump [flags] 

Flags:
  -A, --field-accession int         field index of assembly accession (genome ID), for outputting taxid.map
      --field-accession-re string   regular expression to extract assembly accession (default
                                    "^\\w\\w_(.+)$")
  -C, --field-class int             field index of class
  -F, --field-family int            field index of family
  -G, --field-genus int             field index of genus
  -K, --field-kingdom int           field index of kingdom
  -O, --field-order int             field index of order
  -P, --field-phylum int            field index of phylum
  -S, --field-species int           field index of species (needed)
  -T, --field-subspecies int        field index of subspecies
      --force                       overwrite existed output directory
      --gtdb                        input files are GTDB taxonomy file
      --gtdb-re-subs string         regular expression to extract assembly accession as the subspecies
                                    (default "^\\w\\w_GC[AF]_(.+)\\.\\d+$")
  -h, --help                        help for create-taxdump
      --line-chunk-size int         number of lines to process for each thread, and 4 threads is fast
                                    enough. (default 5000)
      --null strings                null value of taxa (default [,NULL,NA])
  -x, --old-taxdump-dir string      taxdump directory of older version
      --out-dir string              output directory
      --rank-names strings          names of the 8 ranks, order maters (default
                                    [superkingdom,phylum,class,order,family,genus,species,no rank])