GitHub - kamome1201/cdskit: Processing protein-coding DNA sequences in frame

Overview

CDSKIT (/sidieskit/) is a Python program that processes DNA sequences, especially protein-coding sequences. Many functions of this program are designed to handle DNA sequences using codons (sets of three nucleotides) as the unit, and therefore, edits the coding sequences without causing a frameshift. All sequence formats supported by Biopython are available in this tool for both inputs and outputs.

Installation

# Installation with pip
pip install git+https://github.com/kfuku52/cdskit

# This should show complete options if installation is successful
cdskit -h

Subcommands

See Wiki for detailed descriptions.

accession2fasta: Retrieving fasta sequences from a list of GenBank accessions
aggregate: Extracting the longest sequences combined with a sequence name regex
backtrim: Back-translating a trimmed protein alignment
gapjust: Adjusting consecutive Ns to the fixed length
hammer: Removing less-occupied codon columns from a gappy alignment
intersection: Dropping non-overlapping sequence labels between two sequences files or between a sequence file and a gff file
label: Modifying sequence labels
mask: Masking ambiguous and/or stop codons
pad: Making nucleotide sequences in-frame by head and tail paddings
parsegb: Converting the GenBank format
printseq: Print a subset of sequences with a regex
rmseq: Removing a subset of sequences by using a sequence name regex and by detecting problematic sequence characters
split: Splitting 1st, 2nd, and 3rd codon positions
stats: Printing sequence statistics

Streamlined analysis

CDSKIT is designed for data flow through standard input and output. Streamlined processing may be combined with other sequence processing tools, such as SeqKit, with pipes (|).

# Example 
seqkit seq input.fasta.gz | cdskit pad | cdskit mask | seqkit translate | cdskit aggregate -x ":.*"  > output.fasta

Citation

There is no published paper on CDSKIT itself, but we used and cited CDSKIT in several papers including Fukushima & Pollock (2023, Nat Ecol Evol 7: 155-170).

Licensing

This program is BSD-licensed (3 clause). See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
.github/workflows		.github/workflows
cdskit		cdskit
data		data
img		img
logo		logo
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Overview

Installation

Subcommands

Streamlined analysis

Citation

Licensing

About

Uh oh!

Releases

Packages

Languages

License

kamome1201/cdskit

Folders and files

Latest commit

History

Repository files navigation

Overview

Installation

Subcommands

Streamlined analysis

Citation

Licensing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages