Skip to content

cdskit rmseq

Kenji Fukushima edited this page Mar 6, 2023 · 2 revisions

cdskit rmseq removes a subset of sequences by using a sequence name regex and by detecting problematic sequence characters.

Example

Command

cdskit rmseq -s input.fasta --seqname "Arabidopsis_thaliana.*" --problematic_percent 50 -o output.fasta

input.fasta

>Aquilegia_coerulea_1
AGAGTTCAATATGCTTTGAGTCGAATTCGTAACAATGCTAGAAATCTTCTTACTCTTGAT
>Aquilegia_coerulea_2
AGAGTTCAATATGCTTTAAGTCGAATTCGAAACAATGCTAGAAATCTTCTCACTCTGGAT
>Aquilegia_coerulea_3
AGAGTTCAATATGCTTTAAGTCGAATTCGTAACAATGCAAGAAATCTTCTTACACTTGAT
>Hylocereus_undatus_1
AGGGTCCAATATGTTCTGAGCCGTATCCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>Hylocereus_undatus_2
AGGGTTCAATACGTTCTGAGCCGTATCCGTAATGCTGCAAGGCATCTTCTTACCCTGGAT
>Hylocereus_undatus_3
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTGCGGCAAGGCACCTTCTCACTCTGGAT
>Arabidopsis_thaliana_1
AGAGTTCAATATACACTTAGCAGAATCCGTAATGCTGCAAGAGAACTCTTAACTCTTGAT
>Arabidopsis_thaliana_2
AGAGTGCAGTACTCTCTTAGCCGTATCCGTAATGCTGCTAGAGATCTTTTGACTCTTGAT

output.fasta

>Aquilegia_coerulea_1
AGAGTTCAATATGCTTTGAGTCGAATTCGTAACAATGCTAGAAATCTTCTTACTCTTGAT
>Aquilegia_coerulea_2
AGAGTTCAATATGCTTTAAGTCGAATTCGAAACAATGCTAGAAATCTTCTCACTCTGGAT
>Aquilegia_coerulea_3
AGAGTTCAATATGCTTTAAGTCGAATTCGTAACAATGCAAGAAATCTTCTTACACTTGAT
>Hylocereus_undatus_2
AGGGTTCAATACGTTCTGAGCCGTATCCGTAATGCTGCAAGGCATCTTCTTACCCTGGAT

Clone this wiki locally