Nextflow pipeline to perform basecalling of nanopore reads on cpu/gpu on a cluster using SLURM. Basecalling can be executed online while reads are uploaded to the input folder, in the form of fast5 files.
- Have nextflow (version > 21.10) available in your path. This can be easily installed using conda
conda create --name nextflow -c bioconda python=3.9.12 nextflow=22.10 conda activate nextflow
- Download guppy binaries for gpu/cpu. Symlinks to the binaries should be placed in
guppy_bin/guppy_basecaller_cpuandguppy_bin/guppy_basecaller_gpu. Alternatively the location of the binaries can be specified with the flag--guppyCpuand--guppyGpuoptions.
Nb: the pipeline has been developed using guppy version
6.3.7. Compatibility with other versions is not guaranteed.
Given an input directory containing fast5 files and a desired output directory, the workflow will perform the basecalling of these files, and save the resulting reads in the output directory in format barcodeXX.fastq.gz, where XX indicates the number of the barcode used.
nextflow run basecall.nf \
-profile cluster \
--inputDir test_dataset/raw \
--outputDir test_dataset/basecalled \
--parameterFile test_dataset/params.tsv \
--setWatcher true \
--gpu true \- the
-profile clusteroption is used to trigger SLURM execution, as opposed to local execution. - the
--inputDiroption is used to specify the input directory, in whichfast5files are stored. --outputDirindicates the output directory, in which reads are stored inbarcodeXX.fastq.gzformat.--parameterFileis used to specify the parameter file, from which some options for the basecalling are parsed. See below for the format of this file.- if
--setWatcher trueis specified then new basecalling jobs are dispatched live as newfast5files get uploaded in the input directory. See below for details. - if the option
--gpu trueis specified then basecalling is performed on gpu. Otherwise the cpu version of guppy is used.
For convenience the run.sh bash script is provided to launch a standard run with watcher activated. It takes as only argument the data folder in which the raw and basecalled subfolder are located. It can be modified by the user according to the desired pipeline parameters.
The parameter file, passed with the --parameterFile option, must be a tsv file in which each row corresponds to a different barcode. The file can be generated from the template nanopore_sequencing_params_template.ods. The relevant columns are:
barcode_id: the barcode number.guppy_config_file: the config file used (e.g.dna_r9.4.1_450bps_hac.cfg), passed as-coption to the guppy basecaller.barcode_kits: list of barcode kits separated by spaces, e.g.EXP-NBD114 EXP-NBD104. This is passed to guppy as the--barcode_kitsoption. It must be the same for all columns. If can also be left empty, in which case the--barcode_kitsargument is not passed to guppy.
If --setWatcher true is specified, then the workflow instantiates a watcher that continually checks for uploads in the input folder. If a new fast5 file is uploaded then a new basecalling job is dispatched.
In order to terminated the workflow and produce fastq files with the read, it is sufficient to create an empty file named end-signal.fast5. This stops the watcher and and after all basecalling jobs have been completed triggers the creation of the fastq files containing the reads.
Unless --liveStats false is specified, the workflow will produce a csv file named {params}_basecalling_stats.csv, where {params} is the prefix of the parameters tsv file. This is placed in the same folder as the parameter file. This file is updated live as the basecalling proceeds, with each row corresponding to a single read. The file contains three columns: len,barcode,time. These contain the read length, the corresponding assigned barcode and the time in which it was basecalled.
This file can be used to estimate the length of reads and the barcode distribution, while the basecalling workflow is in progress. The script scripts/basecall_stats_plots.py can be used to produce plots to visualize read count and length distribution stratified by barcode. See scripts/basecall_stats_plots.py --help for usage.
Every time the workflow is launched a log file named {params}_{time}.log is created, where {params} is the prefix of the parameters tsv file and {time} is a timestamp in the form yyyy-MM-dd--HH-mm-ss. This files contains the following information:
- execution time and id of the nextflow run
- remote and current commit of the repository containing this basecalling workflow.
- version of the guppy basecaller used, and whether the gpu version was used.
- path of the parameter file and relevant parameters (list of barcodes, flowcell id, flowcell type, ligation kit, barcode kits)
- input and output directories
If --filterBarcodes true is specified, then only the barcodeXX.fastq.gz files corresponding to barcodes present in the parameter file are produced. Other barcodes (usually corresponding to mis-classfications) are excluded.
After basecalling is complete, the script scripts/archive_run.py can be used to archive the reads in an experiment folder.
It requires pandas to be installed. This can simply be installed with conda install -c conda-forge pandas.
The script has the following usage:
usage: archive_run.py [-h] --reads_fld READS_FLD --param_file PARAM_FILE [--archive_fld ARCHIVE_FLD] [--overwrite] [--only_barcodes [ONLY_BARCODES ...]]
Script used to archive the results of basecalling in the experiment folder.
It subdivides the reads in folders based on the experiment id, creating symlinks
to the original files. It also creates (or updates) a sample_info.csv file
containing the information on the samples stored in each folder.
optional arguments:
-h, --help show this help message and exit
--reads_fld READS_FLD
the source folder, containing the reads for the sequencing run. These are in fastq.gz format.
--param_file PARAM_FILE
the parameters.tsv file containing information on every sample.
--archive_fld ARCHIVE_FLD
the destination archive folder, containing one subfolder per experiment.
(default: /scicore/home/nccr-antiresist/GROUP/unibas/neher/experiments)
--overwrite Do not raise an error if one or more barcodes are already present and overwrite them.
(default: False)
--only_barcodes [ONLY_BARCODES ...]
Only process the specified barcodes. Space-separated list of numbers (e.g. --only_barcodes 1 2 44 )
(default: None)
The mandatory arguments are:
reads_fldis the folder containing the reads for every barcode, saved asbarcodeXX.fastq.gz.param_fileis the.tsvfile containing info on the link between each barcode and the corresponding experimental conditions.archive_fldis theexperimentsarchive folder. The script will take care of creating sub-folders with the name of the experiments, where link to the reads are stored. These are named as<date>_<research_group>_<experiment_id>, where the last two parts are extracted from the parameter file, and the date is the date of first archiviation. The optional arguments are:- if
--overwriteis specified then barcodes that are already present in the experiment folders are removed and substituted. - if
--only_barcodes 3 5 7is specified then only samples corresponding to barcodes 3,5,7 are archived.
Upon successful completion the script also updates an archive_log.txt file in the --archive_fld directory, with a list of archived barcodes.
The generated folder structure looks like this:
experiments/
├── <date>_<research-group>_<experiment-id>
│ ├── sample_info.csv (dataframe with list of samples archived, one per file)
│ └── samples (folder with one sample per subfolder)
│ ├── <sample-id-1>
│ │ └── <sample-id-1>_<flowcell-id-1>_barcode<barcode-1>.fastq.gz -> symlink to corresponding file
│ ...
│ └── <sample-id-n>
│ └── <sample-id-n>_<flowcell-id-n>_barcode<barcode-n>.fastq.gz -> symlink to corresponding file
└── archive_log.txt