CDSKIT (/sidieskit/) is a Python program that processes DNA sequences, especially protein-coding sequences. Many functions of this program are designed to handle DNA sequences using codons (sets of three nucleotides) as the unit, and therefore, edits the coding sequences without causing a frameshift. All sequence formats supported by Biopython are available in this tool for both inputs and outputs.
# Installation with pip
pip install git+https://github.com/kfuku52/cdskit
# This should show complete options if installation is successful
cdskit -h
See Wiki for detailed descriptions.
-
accession2fasta: Retrieving fasta sequences from a list of GenBank accessions -
aggregate: Extracting the longest sequences combined with a sequence name regex -
backtrim: Back-translating a trimmed protein alignment -
gapjust: Adjusting consecutive Ns to the fixed length -
hammer: Removing less-occupied codon columns from a gappy alignment -
intersection: Dropping non-overlapping sequence labels between two sequences files or between a sequence file and a gff file -
label: Modifying sequence labels -
mask: Masking ambiguous and/or stop codons -
pad: Making nucleotide sequences in-frame by head and tail paddings -
parsegb: Converting the GenBank format -
printseq: Print a subset of sequences with a regex -
rmseq: Removing a subset of sequences by using a sequence name regex and by detecting problematic sequence characters -
split: Splitting 1st, 2nd, and 3rd codon positions -
stats: Printing sequence statistics
CDSKIT is designed for data flow through standard input and output. Streamlined processing may be combined with other sequence processing tools, such as SeqKit, with pipes (|).
# Example
seqkit seq input.fasta.gz | cdskit pad | cdskit mask | seqkit translate | cdskit aggregate -x ":.*" > output.fasta
There is no published paper on CDSKIT itself, but we used and cited CDSKIT in several papers including Fukushima & Pollock (2023, Nat Ecol Evol 7: 155-170).
This program is BSD-licensed (3 clause). See LICENSE for details.
