This is an integrated pipeline for eukaryotic genome assembly and gene annotation. It currently supports PacBio HiFi reads and RNA-seq reads, both of which are required as inputs. See this page for details on the expected outputs.
Before running the workflow, make sure the following software is installed:
Follow the steps below to set up and run the workflow:
Clone this repository to your local machine:
git clone https://github.com/mkrg01/genome_assembly_pipeline.git
cd genome_assembly_pipeline
See config/README.md for details on preparing input files and adjusting configuration parameters.
Run the workflow from the repository root directory. Replace /path/to/repo with the actual path to your local repository:
cd /path/to/repo
snakemake --sdm conda apptainer --singularity-args "--bind $(pwd)" --cores 64 all
Tip
You can run the pipeline in a stepwise manner. Replace all with one of the command below.
assembly_all: Runs rules up to the generation of the Hifiasm assembly and its associated metrics.remove_organelle_all: Runs rules up to the organelle removal step and its associated metrics.remove_contamination_all: Runs rules up to the contamination removal step by FCS and its associated metrics.softmask_all: Runs rules up to softmasking by RepeatMasker.gene_prediction_all: Runs rules up to gene prediction and related metrics (equivalent toall).
You do not need to start from step 1 — for example, if you run remove_contamination_all first, the rules related to assembly_all and remove_organelle_all will be executed automatically.
Note
Adjust the --cores value based on your available computational resources.
Note
All rules except those with FCS wrapper scripts (fcs.py, run_fcsadaptor.sh) run in containers. These wrapper scripts internally call the main FCS functions, which are executed inside containers.
The output will be generated in the results directory.