MoGAAAP (Modular Genome Assembly, Annotation and Assessment Pipeline)

This repository contains a Snakemake pipeline for the assembly, annotation and quality assessment of HiFi-based assemblies. Although developed for a project in lettuce, the pipeline is designed to work with any eukaryotic organism. The pipeline will work with HiFi, ONT data and Hi-C, although only HiFi is required.

Additionally, MoGAAAP is set up in a modular way, allowing for any combination of assembly, annotation and quality assessment steps. Therefore, MoGAAAP can also be used to e.g. only assess the quality of already assembled genomes, or only to annotate an unannoated genome assembly.

A test dataset is provided in the test_data/ directory, including instructions.

Setup

To install MoGAAAP, you'll need a Linux machine with conda installed.

Conda

If not installed already, conda can be installed by following these instructions:

# install miniforge
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh
# follow instructions and let it run `conda init`

# set default channels
source ~/.bashrc
conda config --set auto_activate_base false
source ~/.bashrc
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict

Install MoGAAAP

MoGAAAP can be installed using bioconda as follows:

conda create -c conda-forge -c bioconda -n mogaaap mogaaap

Click here for manual installation

If you prefer to install MoGAAAP manually, you can do so by following these instructions:

conda create -c conda-forge -c bioconda -n mogaaap snakemake=8 apptainer=1.3 poetry
conda activate mogaaap

Now that the dependencies are installed, the pipeline can be installed via:

git clone https://github.com/dirkjanvw/MoGAAAP.git
cd MoGAAAP/
poetry install

Then make sure to activate the environment before running the pipeline:

conda activate mogaaap

To check if MoGAAAP is installed correctly, run:

MoGAAAP --help

Note

MoGAAAP uses apptainer for handling some software dependencies. Although apptainer has been installed as part of the conda environment, there are some environment variables that need to be set for it to work correctly. This has to be set in your .bashrc (don't forget to source the file after changing):

APPTAINER_BIND: To bind the paths inside the container to the paths on your system; make sure all relevant paths are included (working directory, database directory, etc.).

Optionally, you can also set these:

APPTAINER_NV: To use the GPU inside the container; only required if you have a GPU.
APPTAINER_CACHEDIR: To store the cache of the container outside of your home directory.

More information can be found here.

Download databases

Next, download the databases that are required for the pipeline to run. Please be aware that these databases are large and require a lot of storage space. At the time of writing (February 2025), the total size of the databases is around 900GB.

Database	Download instructions
GXDB database	Follow "Download the database" instructions on FCS GitHub wiki (I only tested the Cloud instructions)
Kraken2 nt database	Download `core_nt` from this list
OMA database	Download `LUCA.h5` from this list

A helper script is provided to automate the download of the databases. As the locations of the databases are hardcoded, please leave an issue on the GitHub page if any of the locations no longer works.

MoGAAAP download_databases -d databases/ all

Initialise the pipeline

Before running the pipeline, the pipeline has to be initialised. This will create a working directory for MoGAAAP to run in.

MoGAAAP init -d working_directory

In the rest of this README, we will use working_directory as the working directory.

Configure the pipeline

Sample sheet

Before configuring the pipeline, a sample sheet (in TSV format) has to be created. In this TSV file, each row represents a sample. An example of this file can be found in working_directory/config/example.tsv (and more examples are available in working_directory/config/examples/) but it is recommendable to create a new TSV file with the same header. The sample TSV sheet has the following columns to fill in (one row per sample):

Column name	Required?	Description
`accessionId`	Required	The accession ID of the sample. This name has to be unique.
`assemblyLocation`	`*`	The path to a scaffolded assembly in FASTA format. Pipeline will skip assembly if this is provided.
`annotationLocation`	Optional	The path to an annotation in GFF3 format. Only allowed if `assemblyLocation` is provided. Pipeline will skip annotation if this is provided.
`hifi`	`*`	The path to the HiFi reads in FASTQ or FASTA format. Multiple libraries can be provided by separating them with a semicolon.
`ont`	Optional	The path to the ONT reads in FASTQ or FASTA format. Multiple libraries can be provided by separating them with a semicolon.
`illumina_1`	Optional	The path to the forward Illumina reads in FASTQ format.
`illumina_2`	Optional	The path to the reverse Illumina reads in FASTQ format.
`hic_1`	Optional	The path to the forward Hi-C reads in FASTQ format.
`hic_2`	Optional	The path to the reverse Hi-C reads in FASTQ format.
`haplotypes`	Required	The expected number of haplotypes in the assembly. Use 1 for (near) homozygous accessions and 2 for heterozygous accessions. NB: currently only 1 or 2 is supported.
`speciesName`	Required	A name for the species that is used by Helixer to name novel genes.
`taxId`	Required	The NCBI taxonomy ID of the species.
`referenceId`	Required	A unique identifier for the reference genome for which genome (FASTA), annotation (GFF3) and chromosome names are provided in the `config/config.yaml` file.

*: At least one of these fields is required.

NB: If the assemblyLocation is provided, no assembly will be performed and the provided assembly will be assumed to be a scaffolded assembly. Therefore, hifi and illumina_1/illumina_2 will only be used for the quality assessment module, and ont and hic_1/hic_2 are not allowed.

Configuration YAML

We can now create the configuration YAML file for MoGAAAP. An example of this file is provided in working_directory/config/example.yaml and also see working_directory/config/examples/ for more examples of filled-in configuration files. There are two ways of generating a configuration YAML file:

Copy the example file and fill in the fields.
Run the following command to generate a configuration YAML file:

MoGAAAP configure  # see --help for all options that are required

While the command can be a great help, it is recommended to always double-check the generated configuration YAML file and compare it to the example file.

Run the pipeline

Available modules

Several modules are available in this pipeline (will be referred to later as ${MODULE}):

assemble: This module will assemble the reads into contigs and subsequently scaffold the contigs using ntJoin against a provided reference genome.
- contig: This is the first step of the assemble module and will only assemble the reads into contigs. Only use this if you want to inspect the contig assembly before scaffolding.
annotate: This module will generate a provisional annotation of the assembly using liftoff and helixer, and optionally analyse user-defined queries.
- annotate_genes: This only performs the gene annotation step of the annotate module without performing custom analysis.
- annotate_custom: This only performs the custom analysis step of the annotate module without performing gene annotation.
qa: This module will perform quality assessment of the scaffolded assembly and the provisional annotation.
all: This module will run all the above modules (DEFAULT).

It is advisable to run the pipeline module by module for a new set of assemblies and critically look at the results of each module before continuing. All modules except for annotate have visual output that can be inspected in an HTML report file (see at Reporting). For more information about these modules, see Explaining the pipeline.

Running the pipeline

Before running the pipeline, it is good practice to do a dry-run to check if everything is set up correctly:

MoGAAAP run -d working_directory ${MODULE} -n

If no errors are shown, MoGAAAP can be run using the following command:

MoGAAAP run -d working_directory ${MODULE}

It is generally recommendable to set the --cores and --memory flags to what is available on your system. It is important to realise that Snakemake cannot enforce these limits, so it is up to the user to make sure that the limits are not exceeded. In general, the number of cores will generally not exceed the limit, but for memory it is advisable to set the limit to a value that is lower than the available memory on your system (about half). Please also check out the other options available by running MoGAAAP run --help.

Note

Please note that the pipeline can also be run manually by running the Snakemake command directly within the working directory. If you're familiar with the Snakemake CLI, this is a good way to use MoGAAAP after the initial setup. A good starting point is to run snakemake -n to see what the pipeline would do without actually running it.

Output

Next to the report generated by Snakemake, the most important outputs of the pipeline are the genome assembly and annotation. These can be found in the directory working_directory/final_output. (All temporary files are stored in the working_directory/results directory.)

The working_directory/final_output directory contains the following files:

File name	Description
`${accessionId}.contigs.fa`	The contigs produced by the `assemble` module.
`${accessionId}.full.fa`	The scaffolded assembly produced by the `assemble` module.
`${accessionId}.nuclear.fa`	The scaffolded assembly produced by the `assemble` module, but only nuclear contigs as obtained from the `annotate` module.
`${accessionId}.${organelle}.fa`	The scaffolded assembly produced by the `assemble` module, but only organellar contigs as obtained from the `annotate` module.
`${accessionId}.full.gff`	The provisional annotation produced by the `annotate` module; belongs to `${accessionId}.full.fa`.
`${accessionId}.full.coding.gff`	The provisional annotation produced by the `annotate` module, but only coding genes; belongs to `${accessionId}.full.fa`.

Explaining the pipeline

Assembling a genome from raw data to a final usable resource is a process that is hard to automate. We believe that this process always necessesitates human curation. However, large parts can easily be automated, which is why we created this pipeline. This pipeline performs the assembly, scaffolding and renaming of genomic data as well as an initial provisional structural annotation. Importantly, both genome assembly and annotation are subjected to quality assessment, providing a direct starting point for the curation of the assembly. Furthermore, each part of the process (module) can be run separately after which its output can be inspected before continuing to the next step.

graph TD;
    assemble-->annotate;
    assemble-->qa;
    annotate-->qa;

Assemble module

Contigging

In the assemble module, HiFi reads are assembled using hifiasm. If ONT reads are given, these are used in the assembly process using the --ul parameter of hifiasm. Also, if Hi-C reads are given, these are used in the assembly process. Since the output of hifiasm is a GFA file, we next convert the (consensus) primary contigs GFA to a FASTA file. In case the user has indicated that the accession is heterozygous, the two haplotype assemblies as outputted by hifiasm are converted to FASTA files instead. Finally, we produce an alignment of the (contig) assembly against the provided reference genome using nucmer. To prevent spurious alignments, we slightly increased the -l and -g parameter of nucmer.

As alternative to hifiasm, we also implemented verkko as it is known to work well with HiFi (and ONT and Hi-C) data. For heterozygous accessions, hapdup is used to try separating the haplotypes based on HiFi alignment to the Verkko assembly. Please bear in mind that the pipeline was developed with hifiasm in mind, so although these other assemblers will technically work, the pipeline may not be optimally set up for them. In some preliminary tests, we found that verkko doesn't work well with heterozygous accessions, resulting in a partly phased assembly.

Next steps

Since the next step after assembly is the scaffolding process, there has to be collinearity between the assembly and the reference genome. This can be checked in the dotplot created from the nucmer alignment. If there is no sign of collinearity between the two, reference-guided scaffolding will be hard/impossible. The only solution in that case would be to choose another (more closely related) reference genome.

Scaffolding

Scaffolding is performed using ntJoin, which uses a minimizer-based reference-guided scaffolding method. If by visual inspection collinearity between the assembly and reference genome was found, the scaffolding module generally runs without issues. Should any error occur, please read the corresponding log file of the step that produced the error. In most cases, the error may be resolved by choosing different values for the ntjoin_k and ntjoin_w in the configuration YAML file. In our experience, increasing the value for ntjoin_w resolves most issues when no correct scaffolding is produced. Alternatively, scaffolding can be done using ragtag by changing the scaffolder parameter in the configuration YAML file.

After scaffolding, the sequences in the scaffolded assembly are renamed to reflect their actual chromosome names according to the reference genome. Finally, nucmer is run again to produce an alignment plot for visual inspection of the scaffolding process. If Hi-C reads were provided, the Hi-C contact map is also produced for visual inspection of the ntJoin scaffolding process.

NB: It's important to stress that Hi-C reads are not used in the scaffolding process itself, but only for visual inspection. The reason for this is that none of the currently available algorithms for Hi-C scaffolding can guarantee a correct assembly, and we believe that the reference-guided scaffolding is more reliable for automated pipelines.

Next steps

As the assembly as outputted by this module is used as starting point for the annotate and qa modules, it is crucial it matches the expectations in terms of size and chromosome number. Please carefully look at the nucmer alignment plot to check that the assembly looks as expected before continuing to a next module.

Annotate module

Gene annotation

Proper evidence-based structural gene annotation would take too long and is a problem that is not solved for fast automation yet. Therefore, we implemented a "quick-and-dirty" provisional annotation in this pipeline by combing the results of liftoff and helixer. helixer will run on the GPU if it's available, otherwise it will run on CPU (which is known to be a lot slower). In case of overlap in features between liftoff and helixer, we take the liftoff annotation.

Next steps

Although this module generally runs for the longest time, no visual output is produced; only GFF3 file (one unfiltered and one with only coding genes) belonging to the scaffolded assembly. This GFF3 file, together with the FASTA file from the scaffold module are the only inputs for the final module: QA.

Custom annotation

The purpose of the analyse module is to create a genome-wide overview of the newly created assembly. It does this by running several user-defined queries (including organelle search) against the assembly using BLAST as well as a search for the telomere repeat sequence (please see this note in the FAQ for more information on telomere identification). The results of these queries are visualised in an HTML report file. Next to this, the user-defined queries are used to create a template circos configuration and plot.

Next steps

Typically, user-defined queries are used to either identify unwanted sequences (known contaminants or organelles) or to identify sequences that are expected to be present in the assembly (e.g. resistance genes). The resulting tables in the report may be used to filter out unwanted sequences from the assembly, and to check the quality of the assembly based on the expected sequences (such as making sure the telomere repeat sequence is present at the tails of the chromosomes).

Circos plots are notoriously hard to automate and that is no different for this pipeline. Although a circos plot should always be produced, it typically doesn't look quite right yet. Feel free to copy the config files produced by this pipeline and adjust to your own plotting needs.

QA module

Overview

This final quality assessment module is the most important for human curation of the genome. The quality assessment steps in this module can be roughly divided into two categories: individual and grouped. Individual quality assessment steps include k-mer completeness (merqury), k-mer contamination (kraken2), NCBI contamination (fcs-gx), adapter contamination (fcs-adaptor) and read mapping (bwa-mem2). Grouped quality assessment steps include BUSCO completeness (busco), OMA completeness (omark), k-mer distances (kmer-db), mash distances (mash), minimizer collinearity (ntsynt), k-mer phylogeny (SANS), k-mer pangenome growth (pangrowth), gene pangenome growth (pantools) and general statistics. These groups are meant to give a comparative overview of the assembly and annotation. Any groups can be defined in the configuration YAML file and a genome may occur in multiple groups.

Next steps

The report (see Reporting) produced by this module is the most useful output of the pipeline for human curation. It contains visual output for each of the quality assessment steps performed in this module including a description on how to interpret the results. It also calculates various statistics that are reported on in the report. Importantly, the qa module does not do any filtering of the assembly or annotation, only reporting. Next steps could include (but are not limited to) removal of contaminants, discovery of sample swaps, cleaning the input data, etc.

Citation

If you use MoGAAAP in your work, please cite this work as:

in prep.

FAQ

Q: Where can I see what my pipeline is doing?

A: The pipeline will print the commands it is running to the terminal. Alternatively, you could look at htop or top to see what processes are running.

Q: Pipeline crashes at renaming the chromosomes

A: This issue typically arises when the assembly and reference genome are not collinear because of an evolutionary distance that is too large. In this case, the pipeline is not able to accurately discern which reference chromosome corresponds to which assembly scaffold. This lack of collinearity should also be visible in the contig MUMmerplots created by the assemble module. The only solution within MoGAAAP in this case would be to choose another (more closely related) reference genome and re-run the assemble module; or use ragtag as scaffolder instead of ntJoin.

Q: Pipeline creates scaffolds that are obviously wrong

A: This issue can have multiple causes, but the most common one is that the ntJoin parameters are not set correctly. From our own experience, increasing the value for ntjoin_w resolves most issues when no correct scaffolding is produced. Alternatively, scaffolding can be done using ragtag by changing the scaffolder parameter in the configuration YAML file. Also, it is important to keep in mind that this pipeline is not meant to create a perfect assembly, but to provide a starting point for human curation. So feel free to adjust the pipeline or the assembly to your own needs!

Q: Pipeline keeps crashing at kraken2

A: This is likely due to not enough available memory. Kraken2 is a memory-heavy process that needs to fit its entire database in memory. This requires up to 500 GB of RAM that has to be available to the process.

Q: Pipeline crashes mid-run

A: This is intended Snakemake behaviour: if any job fails, the pipeline will only finish current running jobs and then exit. As for the reason of stopping, please check the log file of the job that failed. The name of the log file will be printed in the terminal output of the pipeline. If the error is not clear, please open an issue on this GitHub page.

Q: The pipeline cannot find software X

A: This would indicate an error on our side of determining the correct dependencies for the pipeline. If the error persists, please report it as an issue on this GitHub page.

Q: A job that uses singularity fails for no apparent reason

A: This is likely due to missing environment variables for Singularity/Apptainer. See the note under Install MoGAAAP for more information on how to set these environment variables.

Q: Report HTML cuts off the top of the page

A: This is a known issue of the Snakemake report HTML that should only happen when running MoGAAAP manually with snakemake. The wrapper MoGAAAP command should not have this issue. The current workaround is to run:

sed -E 's/([^l]) h-screen/\1/g' report.html > report_fixed.html

Q: My chromosomes are not named correctly

A: Please double check the names of the chromosomes in the reference genome you provided and the names of the chromosomes in the configuration YAML file. We use strict matching to rename the chromosomes, so the names have to be exactly the same.

Q: Should I use BLASTN or seqtk for the telomere search?

A: While seqtk is more accurate in the boundaries of the telomere search, it cannot identify telomeres that are not at the ends of the chromosomes. Therefore, we recommend to also run BLASTN with a fasta file containing 100x the telomere repeat sequence for identification of telomeres that are not at the ends of the chromosomes.

Contact

If the above information does not answer your question or solve your issue, feel free to open an issue on this GitHub page.

Name		Name	Last commit message	Last commit date
Latest commit History 726 Commits
.github/workflows		.github/workflows
.idea		.idea
MoGAAAP		MoGAAAP
test_data		test_data
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

License

dirkjanvw/MoGAAAP

Folders and files

Latest commit

History

Repository files navigation

MoGAAAP (Modular Genome Assembly, Annotation and Assessment Pipeline)

Index

Setup

Conda

Install MoGAAAP

Download databases

Initialise the pipeline

Configure the pipeline

Sample sheet

Configuration YAML

Run the pipeline

Available modules

Running the pipeline

Output

Explaining the pipeline

Assemble module

Contigging

Next steps

Scaffolding

Next steps

Annotate module

Gene annotation

Next steps

Custom annotation

Next steps

QA module

Overview

Next steps

Citation

FAQ

Q: Where can I see what my pipeline is doing?

Q: Pipeline crashes at renaming the chromosomes

Q: Pipeline creates scaffolds that are obviously wrong

Q: Pipeline keeps crashing at kraken2

Q: Pipeline crashes mid-run

Q: The pipeline cannot find software X

Q: A job that uses singularity fails for no apparent reason

Q: Report HTML cuts off the top of the page

Q: My chromosomes are not named correctly

Q: Should I use BLASTN or seqtk for the telomere search?

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 13

Packages 0

Uh oh!

Uh oh!

Languages

Packages