Skip to content

Mia-FGB/MARMOT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

82 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿน MARMOT

Metagenomic Alignment and Reporting for Monitoring Of Threats MARMOT is a modular pipeline for detecting potential pathogens from Nanopore sequencing data. It performs quality filtering, alignment, and summary reporting โ€” making it easy to go from raw reads to actionable microbial insights.

Table of Contents

Updating the Reference Database

The emergent pathogen database should be updated regularly. The latest update was on 08/05/24. The reference database can be updated using the following scripts, usually on local machine with interent access. Conda environment - pathogen_database build_reference_database.sh These scripts can be found in https://github.com/Mia-FGB/MARMOT_ref_db/tree/main

Usage

To use the pipeline, ensure the config_template.sh script is configured correctly with the following parameters:

Config file

Parameter Explanation Example
sample Name of the sample being processed. Used for labelling and organisation. CF_2023
location Path to raw / rebasecalled reads. Should be the level above fastq. /ei/projects/9/9742f7cc-c169-405d-bf27-cd520e26f0be/data/raw/RL_24hCubTests_21092023/RL_24hCubTests_21092023/20230921_1707_X4_FAW72641_f261fc8c
filter_length Reads shorter than this length will not be included in the analysis. 300
reference_database Path to the reference database generated by build_reference_database.sh. /ei/projects/9/9742f7cc-c169-405d-bf27-cd520e26f0be/data/results/phibase/pathogen_database_080524.fa
scratch_dir A directory for temporary large files to be stored. Can be the same as output_dir. /ei/projects/9/9742f7cc-c169-405d-bf27-cd520e26f0be/scratch/pipeline_test
output_dir Directory for output files. /ei/projects/9/9742f7cc-c169-405d-bf27-cd520e26f0be/scratch/pipeline_test
barcode_list List of barcodes to process. Can be a sequence or specific list. $(seq -w 01 88) or ("11" "02" "05" "08")
concatenated Set to "yes" if the reads are already concatenated into one file, else "no". yes
contig_stats Set to "yes" to calculate contig stats for the barcode, else "no". yes
genome_lengths_file File with TaxaIDs and genome lengths, generated by build_reference_database.sh. /ei/projects/9/9742f7cc-c169-405d-bf27-cd520e26f0be/data/results/phibase/080524_genome_lengths.tsv
risk_table_file File with DEFRA risk information, generated by generate_risk_table.py after creating database. /ei/projects/9/9742f7cc-c169-405d-bf27-cd520e26f0be/data/results/Pathogen_Database/Pathogen_Database_042025_v2/risk_table.csv
barcode_labels Tab delimited file with barcode number and SampleID to be used as labels in graph creation /ei/projects/9/9742f7cc-c169-405d-bf27-cd520e26f0be/data/results/Experiment/barcode_labels.tsv

Submitting the Job Array

After configuring the file, submit your batch using the wrapper script:

./submit_with_config.sh config_template.sh

This wrapper will:

  • Determine the number of barcodes from your config
  • Create and submit a SLURM job array with the correct range (e.g. --array=01-88)
  • Organise log files under logs/<job_name>/
  • Submit the write_files.sh script to run after all the barcodes are complete, will populate the summarised files
  • Create plots from the summarised output with the create_risk_plots.R script

Output Structure

The pipeline generates the following output:

  • A directory for each barcode

    • Containing result files
    • Log directory with output/error messages for the individual jobs.
    • Created in the specified output directory
  • A log directory storing output/error messages for each submission batch job.

    • Created where the job is submitted from
  • Above the barcode directories:

    • In the specified output directory there are a number of summary files
    • lcaparse_summary.txt: Contains the Barcode, Read Count and % of reads assigned to each Taxa
    • lcaparse_perread.txt:
    • percent_reads_retained_length_filter.txt: Percentage of reads retained after the removal of reads below the length filter threshold
    • no_fail_reads.txt: Number of fail reads per barcode from the nanopore sequeincing run
      • Only created if fastq_fail directory is in the location input
    • genome_coverage_all.txt
    • read_numbers.tsv
  • Graphical Outputs Different graphical outputs inlcuded in for defra_risk/, pathogen_graphs/, and heatmap/ folders. Described in more detail below.

Example Output Files for a Barcode Directory

Each barcode directory contains the following example files:

  • 02_barcode_percent_retained.txt: Percentage of reads retained after filtering.
  • 02_contig_stats_filtered_300.txt: Contig statistics for reads filtered to a minimum length of 300bp.
  • 02_read_no.tsv: Total pass reads before and after filtering
  • 02_mapped.paf: Mapped reads in Pairwise Alignment Format (PAF).

Scripts Overview

submit_with_config.sh

Creates a temporary script using the config file to submit submit_batch_jobs.sh with specified parameters. Also submits the write_files.sh script to run once all the barcodes are processed.

Inputs:

  • Config file

submit_batch_jobs.sh

Runs single_barcode_process.sh for each barcode provided.

Outputs:

  • Unzipped fastq file of all pass reads
  • Unzipped fastq file of all reads > filter length

single_barcode_process.sh

Submits a sequence of sbatch jobs in the following order:

  1. prep_reads.sh
  2. minimap2
  3. paf_parse.py
  4. run_lcaparse.sh
  5. pathogen_genome_coverage_from_paf.py

Inputs:

Provided via submit_batch_jobs.sh:

  • Barcode number
  • Location
  • Filter length

Outputs:

  • All outputs from prep_reads.sh, minimap2, and paf_parse.py
  • percent_retained_results.txt: Percentage of reads retained after filtering
  • fail_number_all.txt: Number of failed reads per barcode
  • number_ignored_reads_from_parse.txt: Reads ignored due to multiple mapping or barcode issues

prep_reads.sh

Unzips and concatenates pass reads, generates contig statistics, and applies length filtering.

Inputs:

  • Barcode
  • Raw read location
  • Filter length (e.g., 300)

Outputs:

Located in barcodeXX directory:

  • XX_contig_stats.txt
  • XX_barcode.fastq (all pass reads)
  • XX_num_fail.txt (number of failed reads, only possible if there is a fastq_fail directory) In Scratch_Dir, above barcode directory
  • XX_barcode_300bp.fastq (Reads filtered >300bp)

paf_parse.py

Processes mapped .paf files, filtering out low-quality mappings (MQ < 5) and handling multiple mappings intelligently.

Inputs:

  • XX_mapped.paf

Outputs:

  • XX_taxaID_count.txt (tab-delimited TaxaID & counts)
  • XX_ignored_reads_query_ID.txt (query, MQ values, multiple Taxa IDs)
  • XX_number_ignored_reads.txt (number of ignored reads)

create_risk_plots.R

A R script to create plots from the pipeline summary outputs. To run on the HPC will need to set up an environment.

  1. Install Mamba (if not already installed): Details on research computing website for instillation on NRP HPC For detailed instructions on installing Mamba on the NRP HPC, see the NRP Research Computing guide to Using Mamba.

  2. Create the environment:
    You can create the environment from the environment.yml file:

mamba env create -f environment.yml
  1. Activate the environment:
mamba activate r-marmot_env

Inputs:

  • Path to output directory (from config) which contains these files generated by the earlier write_files.sh script
    • lcaparse_perread.txt: Read-level taxonomy assignments.
    • genome_coverage_all.txt: Genome coverage percentages per taxon.
    • read_numbers.tsv: Total and filtered read counts per barcode.
  • Risk table, generated with generate_risk_table.py script whilst the reference database is created

Outputs:

Summary tables (in the input directory):

  • genus_summary.tsv: Normalized genus-level read counts per barcode.
  • RedRisk_ReadIDs_all.tsv: All read IDs matching high-risk species.
  • RedRisk_ReadIDs_noWidespread.tsv: As above, but excluding widespread UK-present species.
  • RedRisk_Species_Summary.tsv: Read counts and genome coverage for red risk species. Plots and graphs (inside Graphs/): defra_risk/:
  • HP100k_risk.svg: Stacked bar chart of reads per 100k for species with DEFRA-assigned risk categories.
  • HP100k_all.svg: Includes all species, even those without DEFRA classifications.
  • Filtered_HP100k_risk.svg: Same as above but normalized to filtered read count.
  • Filtered_HP100k_all.svg: All species, normalized to filtered read count.
  • *_facet.svg variants: Same as above, but faceted by presence in the UK. pathogen_graphs/:
  • Individual bar plots per target genus (e.g., Fusarium_HP100k.svg) showing abundance across barcodes.
  • Facetted plots for all genera:
  • facet_pathogen_HP100k_fixed.svg
  • facet_pathogen_Filtered_HP100k_free_y.svg Genome coverage plots, also in pathogen_graphs/::
  • One per genus with error bars.
  • Facetted plots across genera:
  • facet_coverage_fixed.svg, facet_coverage_free_y.svg heatmap/:
  • Heatmaps of genus-level abundance:
  • genus_heatmap_HP100k_min50.svg
  • genus_heatmap_Filtered_HP100k_min100.svg Colors are scaled log-wise using the viridis palette.

License & Citation

Please cite this repository if you use it in your research. License details can be found in the repository.

For any issues or contributions, please submit a GitHub issue or pull request.

About

๐Ÿน Metagenomic Alignment and Reporting for Monitoring Of Threats

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published