🐹 MARMOT

Metagenomic Alignment and Reporting for Monitoring Of Threats MARMOT is a modular pipeline for detecting potential pathogens from Nanopore sequencing data. It performs quality filtering, alignment, and summary reporting — making it easy to go from raw reads to actionable microbial insights.

Updating the Reference Database

The emergent pathogen database should be updated regularly. The latest update was on 08/05/24. The reference database can be updated using the following scripts, usually on local machine with interent access. Conda environment - pathogen_database build_reference_database.sh These scripts can be found in https://github.com/Mia-FGB/MARMOT_ref_db/tree/main

Usage

To use the pipeline, ensure the config_template.sh script is configured correctly with the following parameters:

Config file

Parameter	Explanation	Example
`sample`	Name of the sample being processed. Used for labelling and organisation.	`CF_2023`
`location`	Path to raw / rebasecalled reads. Should be the level above `fastq`.	`/ei/projects/9/9742f7cc-c169-405d-bf27-cd520e26f0be/data/raw/RL_24hCubTests_21092023/RL_24hCubTests_21092023/20230921_1707_X4_FAW72641_f261fc8c`
`filter_length`	Reads shorter than this length will not be included in the analysis.	`300`
`reference_database`	Path to the reference database generated by `build_reference_database.sh`.	`/ei/projects/9/9742f7cc-c169-405d-bf27-cd520e26f0be/data/results/phibase/pathogen_database_080524.fa`
`scratch_dir`	A directory for temporary large files to be stored. Can be the same as `output_dir`.	`/ei/projects/9/9742f7cc-c169-405d-bf27-cd520e26f0be/scratch/pipeline_test`
`output_dir`	Directory for output files.	`/ei/projects/9/9742f7cc-c169-405d-bf27-cd520e26f0be/scratch/pipeline_test`
`barcode_list`	List of barcodes to process. Can be a sequence or specific list.	`$(seq -w 01 88)` or `("11" "02" "05" "08")`
`concatenated`	Set to `"yes"` if the reads are already concatenated into one file, else `"no"`.	`yes`
`contig_stats`	Set to `"yes"` to calculate contig stats for the barcode, else `"no"`.	`yes`
`genome_lengths_file`	File with TaxaIDs and genome lengths, generated by `build_reference_database.sh`.	`/ei/projects/9/9742f7cc-c169-405d-bf27-cd520e26f0be/data/results/phibase/080524_genome_lengths.tsv`
`risk_table_file`	File with DEFRA risk information, generated by `generate_risk_table.py` after creating database.	`/ei/projects/9/9742f7cc-c169-405d-bf27-cd520e26f0be/data/results/Pathogen_Database/Pathogen_Database_042025_v2/risk_table.csv`
`barcode_labels`	Tab delimited file with barcode number and SampleID to be used as labels in graph creation	`/ei/projects/9/9742f7cc-c169-405d-bf27-cd520e26f0be/data/results/Experiment/barcode_labels.tsv`

Submitting the Job Array

After configuring the file, submit your batch using the wrapper script:

./submit_with_config.sh config_template.sh

This wrapper will:

Determine the number of barcodes from your config
Create and submit a SLURM job array with the correct range (e.g. --array=01-88)
Organise log files under logs/<job_name>/
Submit the write_files.sh script to run after all the barcodes are complete, will populate the summarised files
Create plots from the summarised output with the create_risk_plots.R script

Output Structure

The pipeline generates the following output:

A directory for each barcode
- Containing result files
- Log directory with output/error messages for the individual jobs.
- Created in the specified output directory
A log directory storing output/error messages for each submission batch job.
- Created where the job is submitted from
Above the barcode directories:
- In the specified output directory there are a number of summary files
- lcaparse_summary.txt: Contains the Barcode, Read Count and % of reads assigned to each Taxa
- lcaparse_perread.txt:
- percent_reads_retained_length_filter.txt: Percentage of reads retained after the removal of reads below the length filter threshold
- no_fail_reads.txt: Number of fail reads per barcode from the nanopore sequeincing run
  - Only created if fastq_fail directory is in the location input
- genome_coverage_all.txt
- read_numbers.tsv
Graphical Outputs Different graphical outputs inlcuded in for defra_risk/, pathogen_graphs/, and heatmap/ folders. Described in more detail below.

Example Output Files for a Barcode Directory

Each barcode directory contains the following example files:

02_barcode_percent_retained.txt: Percentage of reads retained after filtering.
02_contig_stats_filtered_300.txt: Contig statistics for reads filtered to a minimum length of 300bp.
02_read_no.tsv: Total pass reads before and after filtering
02_mapped.paf: Mapped reads in Pairwise Alignment Format (PAF).

Scripts Overview

`submit_with_config.sh`

Creates a temporary script using the config file to submit submit_batch_jobs.sh with specified parameters. Also submits the write_files.sh script to run once all the barcodes are processed.

Inputs:

Config file

`submit_batch_jobs.sh`

Runs single_barcode_process.sh for each barcode provided.

Outputs:

Unzipped fastq file of all pass reads
Unzipped fastq file of all reads > filter length

`single_barcode_process.sh`

Submits a sequence of sbatch jobs in the following order:

prep_reads.sh
minimap2
paf_parse.py
run_lcaparse.sh
pathogen_genome_coverage_from_paf.py

Inputs:

Provided via submit_batch_jobs.sh:

Barcode number
Location
Filter length

Outputs:

All outputs from prep_reads.sh, minimap2, and paf_parse.py
percent_retained_results.txt: Percentage of reads retained after filtering
fail_number_all.txt: Number of failed reads per barcode
number_ignored_reads_from_parse.txt: Reads ignored due to multiple mapping or barcode issues

`prep_reads.sh`

Unzips and concatenates pass reads, generates contig statistics, and applies length filtering.

Inputs:

Barcode
Raw read location
Filter length (e.g., 300)

Outputs:

Located in barcodeXX directory:

XX_contig_stats.txt
XX_barcode.fastq (all pass reads)
XX_num_fail.txt (number of failed reads, only possible if there is a fastq_fail directory) In Scratch_Dir, above barcode directory
XX_barcode_300bp.fastq (Reads filtered >300bp)

`paf_parse.py`

Processes mapped .paf files, filtering out low-quality mappings (MQ < 5) and handling multiple mappings intelligently.

Inputs:

XX_mapped.paf

Outputs:

XX_taxaID_count.txt (tab-delimited TaxaID & counts)
XX_ignored_reads_query_ID.txt (query, MQ values, multiple Taxa IDs)
XX_number_ignored_reads.txt (number of ignored reads)

`create_risk_plots.R`

A R script to create plots from the pipeline summary outputs. To run on the HPC will need to set up an environment.

Install Mamba (if not already installed): Details on research computing website for instillation on NRP HPC For detailed instructions on installing Mamba on the NRP HPC, see the NRP Research Computing guide to Using Mamba.
Create the environment:
You can create the environment from the environment.yml file:

mamba env create -f environment.yml

Activate the environment:

mamba activate r-marmot_env

Inputs:

Path to output directory (from config) which contains these files generated by the earlier write_files.sh script
- lcaparse_perread.txt: Read-level taxonomy assignments.
- genome_coverage_all.txt: Genome coverage percentages per taxon.
- read_numbers.tsv: Total and filtered read counts per barcode.
Risk table, generated with generate_risk_table.py script whilst the reference database is created

Outputs:

Summary tables (in the input directory):

genus_summary.tsv: Normalized genus-level read counts per barcode.
RedRisk_ReadIDs_all.tsv: All read IDs matching high-risk species.
RedRisk_ReadIDs_noWidespread.tsv: As above, but excluding widespread UK-present species.
RedRisk_Species_Summary.tsv: Read counts and genome coverage for red risk species. Plots and graphs (inside Graphs/): defra_risk/:
HP100k_risk.svg: Stacked bar chart of reads per 100k for species with DEFRA-assigned risk categories.
HP100k_all.svg: Includes all species, even those without DEFRA classifications.
Filtered_HP100k_risk.svg: Same as above but normalized to filtered read count.
Filtered_HP100k_all.svg: All species, normalized to filtered read count.
*_facet.svg variants: Same as above, but faceted by presence in the UK. pathogen_graphs/:
Individual bar plots per target genus (e.g., Fusarium_HP100k.svg) showing abundance across barcodes.
Facetted plots for all genera:
facet_pathogen_HP100k_fixed.svg
facet_pathogen_Filtered_HP100k_free_y.svg Genome coverage plots, also in pathogen_graphs/::
One per genus with error bars.
Facetted plots across genera:
facet_coverage_fixed.svg, facet_coverage_free_y.svg heatmap/:
Heatmaps of genus-level abundance:
genus_heatmap_HP100k_min50.svg
genus_heatmap_Filtered_HP100k_min100.svg Colors are scaled log-wise using the viridis palette.

License & Citation

Please cite this repository if you use it in your research. License details can be found in the repository.

For any issues or contributions, please submit a GitHub issue or pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
Scripts		Scripts
.DS_Store		.DS_Store
.gitattributes		.gitattributes
README.md		README.md
environment.yml		environment.yml
merged_lcaparse_summary.txt		merged_lcaparse_summary.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🐹 MARMOT

Table of Contents

Updating the Reference Database

Usage

Config file

Submitting the Job Array

Output Structure

Example Output Files for a Barcode Directory

Scripts Overview

`submit_with_config.sh`

Inputs:

`submit_batch_jobs.sh`

Outputs:

`single_barcode_process.sh`

Inputs:

Outputs:

`prep_reads.sh`

Inputs:

Outputs:

`paf_parse.py`

Inputs:

Outputs:

`create_risk_plots.R`

Inputs:

Outputs:

License & Citation

About

Uh oh!

Releases

Packages

Languages

Mia-FGB/MARMOT

Folders and files

Latest commit

History

Repository files navigation

🐹 MARMOT

Table of Contents

Updating the Reference Database

Usage

Config file

Submitting the Job Array

Output Structure

Example Output Files for a Barcode Directory

Scripts Overview

submit_with_config.sh

Inputs:

submit_batch_jobs.sh

Outputs:

single_barcode_process.sh

Inputs:

Outputs:

prep_reads.sh

Inputs:

Outputs:

paf_parse.py

Inputs:

Outputs:

create_risk_plots.R

Inputs:

Outputs:

License & Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`submit_with_config.sh`

`submit_batch_jobs.sh`

`single_barcode_process.sh`

`prep_reads.sh`

`paf_parse.py`

`create_risk_plots.R`

Packages