NanoTRF: a software tool to de novo search high-copy tandem repeats in Oxford Nanopore Technologies (ONT) plant DNA sequencing data
Download the latest release:
wget https://github.com/Kirovez/nanoTRF/archive/refs/tags/nanoTRF.tar.gz
tar -zxvf nanoTRF.tar.gz
- Introduction
- Installation
- Getting Started
- Usage
- Commands and options
- Input
- Output
- Authors
- Acknowledgement
- License
NanoTRF is a software tool to de novo search high-copy tandem repeats designed for raw long-read sequences. It works with Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) sequencing data.
On Linux/Unix, NanoTRF can be installed via creating an environment from an environment.yml file:
conda env create -f nanoTRF.yml
For running NanoTRF, please activate the conda environment:
conda activate nanoTRF
Your environment is ready to be used!
To generate consensus sequences in FASTA format file (with usage default optional arguments):
python3 ./nanoTRF.py -r ./test_seq/test_4th_Linum.fasta -pTH TideHunter -o ./test/ -mad 0.01
If TideHunter output table (run with -f option) was generated before then you can pass this file via -T option and NanoTRF will skip TideHunter step
python3 ./nanoTRF.py -r ./test_seq/test_4th_Linum.fasta -pTH TideHunter -o ./test/ -mad 0.01 -T TH.tab
usage: nanoTRF.py [-h] [-r READS] [-pTH PATH_TH] [-T RUN_TH] [-cap CAP3] [-diamond DIAMOND] [-o OUT_DIRECTORY]
[-b BLAST] [-mb MAKEDB] [-w WORDSIZE] [-w_f WORDSIZE_F] [-ev EVALUE] [-mid MIN_ID]
[-bld QUERY_SBJ_LENGTH_DIFFERENCES_ALLOWED] [-mad MIN_ABUNDANCY_TO_DRAW] [-m MIN_COPY]
[-nano NANO_TRF] [-tab NANO_TAB] [-rexdb_fasta REXDB_FASTA] [-rexdb_tab REXDB_TAB] [-th THREADS]
[-lg LOG_FILE] [-mOVe MIN_OVERLAP] [-ca PERC_ABUND] [-c] [-maskws MASK_BLAST_WORD_SIZE]
[-maskcov MASK_BLAST_QUERY_COVERAGE] [-maskiden MASK_BLAST_IDENTITY]
A tool to clustering sequences in fasta file and searching consensus among the many sequences for each cluster
optional arguments:
-h, --help show this help message and exit
-r READS, --reads READS
Path to FastQ or Fasta file
-pTH PATH_TH, --path_TH PATH_TH
Path to the location of the TideHunter
-T RUN_TH, --run_th RUN_TH
If you do not want to run TideHunter again and you have table file (-f 2 option in Tide
Hunter), type the path to this file here
-cap CAP3, --cap3 CAP3
Path to the location of the Cap3
-diamond DIAMOND, --diamond DIAMOND
Path to the location of DIAMOND
-o OUT_DIRECTORY, --out_directory OUT_DIRECTORY
Path to work directory for output files where will be saved
-b BLAST, --blast BLAST
Path to blastn executabled
-mb MAKEDB, --makedb MAKEDB
Path to makeblastdb executable
-w WORDSIZE, --wordsize WORDSIZE
Word size for wordfinder algorithm (length of best perfect match)
-w_f WORDSIZE_F, --wordsize_f WORDSIZE_F
Word size for Reblusting(length of best perfect match)
-ev EVALUE, --evalue EVALUE
Expectation value (E) threshold for saving hits
-mid MIN_ID, --min_id MIN_ID
minimum identity between monomers to be selected for clustering
-bld QUERY_SBJ_LENGTH_DIFFERENCES_ALLOWED, --query_sbj_length_differences_allowed QUERY_SBJ_LENGTH_DIFFERENCES_ALLOWED
maximum differences in length between query and subject
-mad MIN_ABUNDANCY_TO_DRAW, --min_abundancy_to_draw MIN_ABUNDANCY_TO_DRAW
Minimum genome abundancy for cluster of repeats to be drawn
-m MIN_COPY, --min_copy MIN_COPY
The minimum number of TRs copy in the data
-nano NANO_TRF, --nano_trf NANO_TRF
File name with consensus sequences, default name - nanoTRF.fasta
-tab NANO_TAB, --nano_tab NANO_TAB
Table file with the TRs abundancy
-rexdb_fasta REXDB_FASTA, --rexdb_fasta REXDB_FASTA
Fasta file with the RExDB protein sequences
-rexdb_tab REXDB_TAB, --rexdb_tab REXDB_TAB
Table file with the RExDB annotation
-th THREADS, --threads THREADS
Number of threads for running the module Blast and TideHunter
-lg LOG_FILE, ---log_file LOG_FILE
This file list analysis parameters, modules and files, contains messages generated on the
various stages of the NanoTRF work. It allows tracking events that happens when NanoTRF runs.
Default =loging.log
-mOVe MIN_OVERLAP, --min_Overlap MIN_OVERLAP
Number of overlapping nucleotides between repeats in one cluster
-ca PERC_ABUND, --perc_abund PERC_ABUND
Minimum value of the TR cluster abundancy
-c, --cleanup Remove unncessary large files and directories from working directory
-maskws MASK_BLAST_WORD_SIZE, --mask_blast_word_size MASK_BLAST_WORD_SIZE
word size of blastn masking of raw reads by cluster contig sequences
-maskcov MASK_BLAST_QUERY_COVERAGE, --mask_blast_query_coverage MASK_BLAST_QUERY_COVERAGE
query (contig sequence) coverage in blastn masking of raw reads by cluster contig sequences
-maskiden MASK_BLAST_IDENTITY, --mask_blast_identity MASK_BLAST_IDENTITY
minimum identity between query (contig sequence) and raw reads in blastn masking of raw reads
by cluster contig sequences
NanoTRF works with FASTA and FASTQ formats.
NanoTRF generates output in tabular format:
№ | Column name | Description |
---|---|---|
1 | Cluster | Name and cluster number |
2 | min.Contig.Cap3.Length | Min length of the contigs assembled by Cap3 |
3 | max.Contig.Cap3.Length | Max length of the contigs assembled by Cap3 |
4 | Genome.portion | Cluster abundancy in the genome (%) |
5 | Contig1.sequence | Sequence of Contig1 from consensus.fasta |
6 | Subrepeats.seq | Sequences of any detected subrepeats in Contig 1 by second run of TideHunter |
7 | Subrepeats.len | Length of subrepeats sequences |
8 | Annotation | Transposon domains and number of reads with similarity to |
NanoTRF generates 'consensus.fasta' file which contains TRs consensus sequences assembled by Cap3.
This file containes the information from tabular file and some pictures including graph layout, read coverage histogram and read coverage pie chart
This folder contains information for each cluster including consensus contig files, reads from each cluster and figures used for html generation
Ilya Kirov [email protected]
Elizaveta Kolganova [email protected]
The project was financially supported by Russian Foundation for Basic Research (RFBR project № 17-00-00336)
This project is licensed under the MIT License