Long-Read Read Mapping on the OSPool

This tutorial will walk you through a long-read mapping analysis workflow using Oxford Nanopore data from the C. elegans CB4856 and C. elegans N2 strain Reference Genomes on the OSPool high-throughput computing ecosystem. You'll learn how to:

Map your reads to a reference genome using Minimap2
Breakdown massive bioinformatics workflows into many independent smaller tasks
Submit hundreds to thousands of jobs with a few simple commands
Use the Open Science Data Federation (OSDF) to manage file transfer during job submission

All of these steps are distributed across hundreds (or thousands!) of jobs using the HTCondor workload manager and Apptainer containers to run your software reliably and reproducibly at scale. The tutorial is built around realistic genomics use cases and emphasizes performance, reproducibility, and portability. You'll work with real data and see how high-throughput computing (HTC) can accelerate your genomics workflows.

Note

If you're brand new to running jobs on the OSPool, we recommend completing the HTCondor "Hello World" exercise before diving into this tutorial.

Let’s get started!

Jump to...

Tutorial Setup
Mapping Sequencing Reads to Genome
- Data Wrangling and Splitting Reads
  - Splitting the FASTQ reads
  - Pre-staging our files on the Open Science Data Federation (OSDF)
- Running Minimap to Map Reads to the Reference Genome
Next Steps
- Software
- Data
- GPUs
Getting Help

Tutorial Setup

Assumptions

This tutorial assumes that you:

Have basic command-line experience (e.g., navigating directories, using bash, editing text files).
Have a working OSPool account and can log into an Access Point (e.g., .uw.osg-htc.org).
Are familiar with HTCondor job submission, including writing simple .sub files and tracking job status with condor_q.
Understand the general workflow of long-read sequencing analysis: basecalling → mapping → variant calling.
Have access to a machine with a GPU-enabled execution environment (provided automatically via the OSPool).
Have sufficient disk quota and file permissions in your OSPool home and OSDF directories.

Tip

You do not need to be a genomics expert to follow this tutorial. The commands and scripts are designed to be beginner-friendly and self-contained, while still reflecting real-world research workflows.

Materials

To obtain a copy of the files used in this tutorial, you can

Clone the repository, with

git clone https://github.com/osg-htc/tutorial-ospool-minimap.git

or the equivalent for your device

To copy the data files for the tutorial, we're going to use the pelican object get <object> <destination command. We need to get two files: minimap2.sif and wgs_reads_cb4856.fastq . Run the following commands from the Access Point:

If you are on AP20 or AP21

cp /ospool/uc-shared/public/osg-training/tutorial-ospool-minimap/software/minimap2.sif ~/tutorial-ospool-minimap/software/minimap2.sif
cp /ospool/uc-shared/public/osg-training/tutorial-ospool-minimap/data/fastq_reads/wgs_reads_cb4856.fastq ~/tutorial-ospool-minimap/data/fastq_reads/wgs_reads_cb4856.fastq

If you are on AP40

cp /ospool/ap40/osg-staff/tutorial-ospool-minimap/software/minimap2.sif ~/tutorial-ospool-minimap/software/minimap2.sif
cp /ospool/ap40/osg-staff/tutorial-ospool-minimap/data/fastq_reads/wgs_reads_cb4856.fastq ~/tutorial-ospool-minimap/data/fastq_reads/wgs_reads_cb4856.fastq

Tip

You may be able to use:

pelican object get pelican://osg-htc.org/ospool/uc-shared/public/osg-training/tutorial-ospool-minimap/software/minimap2.sif ~/tutorial-ospool-minimap/software/minimap2.sif

pelican object get pelican://osg-htc.org/ospool/uc-shared/public/osg-training/tutorial-ospool-minimap/software/minimap2.sif ~/tutorial-ospool-minimap/data/fastq_reads/wgs_reads_cb4856.fastq

While this method is preferred, if you run into any errors the cp commands above are more resilient to most intermittant OSDF issues.

Setting up your software environment

For this tutorial, we will be using an Apptainer/Singularity container to run minimap2. We will be using the continuumio/miniconda3:latest base image from Dockerhub to conda install minimap2 in our container. An Apptainer/Singularity definition file has been provided to you in this repository and can be found in ./tutorial-ospool-minimap/software/minimap2.def.

Build the container by running the following commands:

cd ~/tutorial-ospool-minimap/software/
mkdir -p $HOME/tmp
export TMPDIR=$HOME/tmp
export APPTAINER_TMPDIR=$HOME/tmp
export APPTAINER_CACHEDIR=$HOME/tmp

apptainer build minimap2.sif minimap2.def

Tip

For more information on using containers on the OSPool, visit our guide on Apptainer/Singularity Containers

Mapping Sequencing Reads to Genome

Data Wrangling and Splitting Reads

To get ready for our mapping step, we need to prepare our read files. This includes two crucial steps, splitting our reads and saving the read subset file names to a file.

Splitting the FASTQ reads

Navigate to your fastq_reads directory

cd ~/tutorial-ospool-minimap/data/fastq_reads/

Split the FASTQ file into subsets of 5,000 reads per subset. Since each FASTQ read consist of four lines in the FASTQ file, we can split it every 20,000 lines
```
split -l 20000 wgs_reads_cb4856.fastq cb4856_fastq_chunk_
rm wgs_reads_cb4856.fastq
```
Generate a list of the split FASTQ subset files. Save it as list_of_FASTQs.txt in your ~/tutorial-ospool-minimap/data/ directory.
```
ls > ~/tutorial-ospool-minimap/list_of_FASTQs.txt
```

Pre-staging our files on the Open Science Data Federation (OSDF)

There are some files we will be using frequently that do not change often. One example of this is the apptainer/singularity container image we will be using for run our minimap2 mappings. The Open Science Data Federation is a data lake accessible to the OSPool with built in caching. The OSDF can significantly improve throughput for jobs by caching files closer to the execution points.

Warning

The OSDF caches files aggressively. Using files on the OSDF with names that are not unique from previous versions can cause your job to download an incorrect previous version of the data file. We recommend using unique version-controlled names for your files, such as data_file_04JAN2025_version4.txt with the data of last update and a version identifier. This ensures your files are correctly called by HTCondor from the OSDF.

Move your minimap2.sif container to your OSDF directory. Make sure to change <ap##> and <user.name> below to the AP number (ap20, ap21, or <ap40>) and the OSPool username assigned to you, respectively.
```
mkdir /ospool/<ap##>/data/<user.name>/tutorial-ospool-minimap/

mv ~/tutorial-ospool-minimap/software/minimap2.sif /ospool/<ap##>/data/<user.name>/tutorial-ospool-minimap/minimap2.sif
```

Running Minimap to Map Reads to the Reference Genome

Indexing our reference genome - Generating Celegans_ref.mmi

Create minimap2_index.sh using either vim or nano

#!/bin/bash
minimap2 -x map-ont -d Celegans_ref.mmi Celegans_ref.fa

Create minimap2_index.sub using either vim or nano

+SingularityImage      = "osdf:///ospool/<ap##>/data/<user.name>/tutorial-ospool-minimap/minimap2.sif"

executable             = ./minimap2_index.sh

transfer_input_files   = ./data/ref_genome/Celegans_ref.fa

transfer_output_files  = ./Celegans_ref.mmi 
transfer_output_remaps = "Celegans_ref.mmi = /ospool/<ap##>/data/<user.name>/tutorial-ospool-minimap/Celegans_ref.mmi"
output                 = ./log/$(Cluster)_$(Process)_indexing_step1.out
error                  = ./log/$(Cluster)_$(Process)_indexing_step1.err
log                    = ./log/$(Cluster)_$(Process)_indexing_step1.log

request_cpus           = 4
request_disk           = 5 GB
request_memory         = 5 GB 

queue 1

Important

Notice that we are using the transfer_output_remaps attribute in our submit file. By default, HTCondor will transfer outputs to the directory where we submitted our job from. Since we want to transfer the indexed reference genome file Celegans_ref.mmi to a specific directory, we can use the transfer_output_remaps attribute on our submission script. The syntax of this attribute is:

transfer_output_remaps = "<file_on_execution_point>=<desired_path_to_file_on_access_point>

It is also important to note that we are transferring our Celegans_ref.mmi to the OSDF directory /ospool/<ap##>/data/<user.name>/tutorial-ospool-minimap/. Since we will be reusing our indexed reference genome file for each mapping job in the next step, we benefit from the caching feature of the OSDF. Therefore, we can direct transfer_output_remaps to redirect the Celegans_ref.mmi file to our OSDF directory.

Submit your minimap2_index.sub job to the OSPool
```
condor_submit minimap2_index.sub
```

Warning

Index will take a few minutes to complete, do not proceed until your indexing job is completed

Map our basecalled reads to the reference C. elegans indexed genome - Celegans_ref.mmi

Create minimap2_mapping.sh using either vim or nano

#!/bin/bash
# Use minimap2 to map the basecalled reads to the reference genome
 ./minimap2 -ax map-ont Celegans_ref.mmi "$1" > "mapped_${1}_reads_to_genome.sam"

Create minimap2_mapping.sub using either vim or nano

+SingularityImage      = "osdf:///ospool/<ap##>/data/<user.name>/tutorial-ospool-minimap/minimap2.sif"

executable             = ./minimap2_mapping.sh
arguments              = $(read_subset_file)
transfer_input_files   = osdf:///ospool/<ap##>/data/<user.name>/tutorial-ospool-minimap/Celegans_ref.mmi, ./data/fastq_reads/$(read_subset_file)

transfer_output_files  = ./mapped_$(read_subset_file)_reads_to_genome.sam
transfer_output_remaps = "mapped_$(read_subset_file)_reads_to_genome.sam = ./data/mappedSAM/mapped_$(read_subset_file)_reads_to_genome.sam
 
output                 = ./log/$(Cluster)_$(Process)_mapping_$(read_subset_file)_step2.out
error                  = ./log/$(Cluster)_$(Process)_mapping_$(read_subset_file)_step2.err
log                    = ./log/$(Cluster)_$(Process)_mapping_$(read_subset_file)_step2.log
 
request_cpus           = 2
request_disk           = 4 GB
request_memory         = 4 GB 
 
queue read_subset_file from ./list_of_FASTQs.txt

Important

In this step, we are not transferring our outputs using the OSDF. The mapped SAM files are intermediate temporary files in our analysis and do not benefit from the aggressive caching of the OSDF.

Submit your cluster of minimap2 jobs to the OSPool
```
condor_submit minimap2_mapping.sub
```

Caution

You may notice some jobs go on hold during your run. While job holds are typically associated with errors on the user's submission side, they can also be cause by errors on our side. If you notice jobs go on hold, you can use the command condor_q -held to print the hold error message. Some errors can be fixed by simply releasing the jobs, which re-queues the job for an additional attempt. To release a held job use the command: condor_release <jobID> for example, condor_release 12636387.9. Other errors may be indicitive or a typo or otherwise user error. If you're ever not sure what the error means, reach out to us at [email protected] and send us the .log, .err, .out, and .sub file as an attachment.

Next Steps

Now that you've completed the long-read minimap tutorial on the OSPool, you're ready to adapt these workflows for your own data and research questions. Here are some suggestions for what you can do next:

🧬 Apply the Workflow to Your Own Data

Replace the tutorial datasets with your own FASTQ files and reference genome.
Modify the mapping submit files to fit your data size, read type, and resource needs.

🧰 Customize or Extend the Workflow

Incorporate quality control steps (e.g., filtering or read statistics) using FastQC.
Use other mappers or variant callers, such as ngmlr, pbsv, or cuteSV.
Add downstream tools for annotation, comparison, or visualization (e.g., IGV, bedtools, SURVIVOR).

📦 Create Your Own Containers

Extend the Apptainer containers used here with additional tools, reference data, or dependencies.
For help with this, see our Containers Guide.

🚀 Run Larger Analyses

Submit thousands of mappings or alignment jobs across the OSPool.
Explore data staging best practices using the OSDF for large-scale genomics workflows.
Consider using workflow managers (e.g., DAGman or Pegasus) with HTCondor.

🧑‍💻 Get Help or Collaborate

Reach out to [email protected] for one-on-one help with scaling your research.
Attend office hours or training sessions—see the OSPool Help Page for details.

Software

In this tutorial, we created a starter apptainer containers for Minimap2. This container can serve as a jumping-off for you if you need to install additional software for your workflows.

Our recommendation for most users is to use "Apptainer" containers for deploying their software. For instructions on how to build an Apptainer container, see our guide Using Apptainer/Singularity Containers. If you are familiar with Docker, or want to learn how to use Docker, see our guide Using Docker Containers.

This information can also be found in our guide Using Software on the Open Science Pool.

Data

The ecosystem for moving data to, from, and within the HTC system can be complex, especially if trying to work with large data (> gigabytes). For guides on how data movement works on the HTC system, see our Data Staging and Transfer to Jobs guides.

GPUs

The OSPool has GPU nodes available for common use. If you would like to learn more about our GPU capacity, please visit our GPU Guide on the OSPool Documentation Portal.

Getting Help

The OSPool Research Computing Facilitators are here to help researchers using the OSPool for their research. We provide a broad swath of research facilitation services, including:

Web guides: OSPool Guides - instructions and how-tos for using the OSPool and OSDF.
Email support: get help within 1-2 business days by emailing [email protected].
Virtual office hours: live discussions with facilitators - see the Email, Office Hours, and 1-1 Meetings page for current schedule.
One-on-one meetings: dedicated meetings to help new users, groups get started on the system; email [email protected] to request a meeting.

This information, and more, is provided in our Get Help page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Long-Read Read Mapping on the OSPool

Tutorial Setup

Assumptions

Materials

Setting up your software environment

Mapping Sequencing Reads to Genome

Data Wrangling and Splitting Reads

Splitting the FASTQ reads

Pre-staging our files on the Open Science Data Federation (OSDF)

Running Minimap to Map Reads to the Reference Genome

Next Steps

Software

Data

GPUs

Getting Help

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
log		log
scripts		scripts
software		software
README.md		README.md

osg-htc/tutorial-ospool-minimap

Folders and files

Latest commit

History

Repository files navigation

Long-Read Read Mapping on the OSPool

Tutorial Setup

Assumptions

Materials

Setting up your software environment

Mapping Sequencing Reads to Genome

Data Wrangling and Splitting Reads

Splitting the FASTQ reads

Pre-staging our files on the Open Science Data Federation (OSDF)

Running Minimap to Map Reads to the Reference Genome

Next Steps

Software

Data

GPUs

Getting Help

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages