This is a sequencing data processing pipeline mainly implemented with sing-tube Long Fragment Reads (stLFR) technology.
Benchmark repository: benchmark4stcLFR
Note: the biogit is an internal website and only accessable from intranet at present.
Prerequisites
- python >= 3.6
- perl >= 5
- metabbq ( dev repo ) - "METAgenome Bead Barcode Quantification", which is a launcher to initiate workdir and calling sub functions.
cOMG ( dev repo )(replaced by fastp)- fastp (dev repo ) - dev version mandatory since I've modified
fastp
a new module to handle the split barcodes process - Mash (dev repo ) - dev version mandatory since I've modified it to fit stLFR data
- Community ( dev repo ) - Louvain method: Finding communities in large networks
- Snakemake - a pythonic workflow system.
- blast - The classic alignment tool finding regions of similarity between biological sequences.
- Assemble methods
I recommend to install above tools in an virtual env via conda:
- create and install part of them:
conda create -n metaseq -c bioconda -c conda-forge snakemake pigz megahit blast
source activate metaseq
- According to the corresponding documents, install
fastp
,SPAdes
andcommunity
, etc. under envmetaseq
Make sure above commands (executables) can be found in the PATH
.
Get the launcher: metabbq
3. Install metaSeq
pipeline to get metabbq
:
cd /path/to/your/dir
git clone https://github.com/ZeweiSong/metaSeq.git
export PATH="/path/to/your/dir/metaSeq":$PATH
I haven't yet write any testing module to check abve prerequesites. At present you may need to test it yourself.
Prepare configs
cd instance
metabbq cfg
This command will create a default.cfg
in your current dir.
You should modifed it to let the launcher know the required files and parameters
Initiating a project
Prepare an input.list
file to describe the sample name and input sequence file path.
metabbq -i input.list -c default.cfg -V
By default, the metabbq
will create a directory with the name of {sample} and a sub-directory named input
under it.
metabbq smk -j -np {sample}/clean/BB.stat
# -j make the jobs execuated paralled under suitable cores/threads
# -n mean dry-run with a preview of "what needs to be run". Remove it to really run the pipeline.
You need to select a assemble tool in the configure file and the corresponding output file name in following:
metabbq smk -j -np {sample}/summary.BC.megahit.contig.fasta
metabbq smk -j -np {sample}/summary.BC.idba.contig.fasta
metabbq smk -j -np {sample}/summary.BC.spades.contig.fasta
You need to select a assemble tool in the configure file and the corresponding output file name in following:
metabbq smk -j -np {sample}/summary.BI.megahit.contig.fasta
metabbq smk -j -np {sample}/summary.BI.idba.contig.fasta
metabbq smk -j -np {sample}/summary.BI.spades.contig.fasta
Feedback are welcome to submit in the issue page.