Skip to content

widdowquinn/find_differential_primers

Repository files navigation

README.md (pdp/diagnostic_primers)

Documentation

Full documentation about usage of pdp/diagnostic_primers can be found on ReadTheDocs:

CITATIONS

If you use pdp/diagnostic_primers, please cite one or both of the following papers:

  • Pritchard L et al. (2012) "Alignment-Free Design of Highly Discriminatory Diagnostic Primer Sets for Escherichia coli O104:H4 Outbreak Strains." PLoS ONE 7(4): e34498. doi:10.1371/journal.pone.0034498 - Method description and application to human bacterial pathogens, sub-serotype resolution
  • Pritchard L et al. (2013) "Detection of phytopathogens of the genus Dickeya using a PCR primer prediction pipeline for draft bacterial genome sequences." Plant Pathology, 62, 587-596 doi:10.1111/j.1365-3059.2012.02678.x - Application to plant pathogens, species-level resolution

NOTE FOR USERS

The default branch for this repository is a development branch: diagnostic_primers. If you are looking for code to reproduce work from Pritchard et al. (2012) or Pritchard et al. (2013), please checkout the master branch, or download release v0.1.3.

  • diagnostic_primers:

codecov Build Status Codacy Badge CodeFactor

  • master:

codecov Build Status

NOTE FOR DEVELOPERS

The default master branch for development is diagnostic_primers. We appreciate it when you raise issues, or make contributions via pull request, especially if you follow the guidelines on the wiki. Writing tests to demonstrate/catch bugs, or to improve coverage, is especially appreciated!

  • Current test coverage (diagnostic_primers): codecov

Overview

This package automates discovery of discriminatory PCR or qPCR primers that can distinguish between arbitrary subgroups of input genomes or other biological sequences of interest. The package also aids identification of novel metabarcoding marker sequences. The package can be used in a number of ways:

  • The pdp command-line program can be used to conduct this analysis in an interactive or scripted manner.
  • The diagnostic_primers Python model can be used within your own programs to automate primer and/or marker design.

Third-party Packages

This tool depends on some third-party software packages

  • Primer3 VERSION 1.1.4: Primer3 is the tool used to design primers. For compatibility with EMBOSS, version 1 of the software is essential. [webpage]
  • EMBOSS: This suite of tools is used to interact with Primer3 and to perform in silico cross-hybridisation checks with primersearch. [webpage]
  • BLAST+: This tool is used to screen primers against a database of off-target sequences with the blastscreen command. [webpage]
  • prodigal: This program is used to identify candidate prokaryotic CDS features when using the pdp filter subcommand. [webpage]
  • MAFFT: This is required to align candidate metabarcoding marker sequences, when using the pdp extract subcommand. [webpage]

These packages can all be installed using the conda package manager, and the bioconda channel.

  • bioconda: a channel for the conda package manager, specialising in bioinformatics software. [webpage]
conda install primer3=1.1.4 emboss blast prodigal mafft

Recent changes

The new version of diagnostic_primers (formerly find_differential_primers) now uses a subcommand model (like the tools git and subversion). These execute the following subtasks, some or all of which may be required to perform a specific primer/marker design run. This change has been made for flexibility in composing workflows.

  • pdp config: Process/validate the configuration file and stitch input contig fragments/replace ambiguity symbols as necessary.
  • pdp filter: Filter input genome sequences so that primers are designed only to targeted regions.
  • pdp eprimer3/e3: Design amplifying primers on the input sequences
  • pdp blastscreen/bs: Filter designed primers against a database of negative examples
  • pdp primersearch/ps: Filter designed primers on their ability to amplify each input sequence
  • pdp classify/cl: Classify designed primers by specificity for each class of input sequence
  • pdp extract/ex: Extract amplicon sequences corresponding to diagnostic primer sets

Each of these subcommands has specific help, accessible with pdp <subcommand> -h or pdp <subcommand> --help.

Walkthrough

This section describes a worked example analysis, proceeding from defining a config file to producing a diagnostic primer set result. All the files required for this analysis can be found in the subdirectory tests/walkthrough.

1. Producing and validating the config file

We begin with a small set of bacterial genomes: three Pectobacterium species. These are defined as .fasta sequences in the directory tests/walkthrough/sequences:

$ ls tests/walkthrough/sequences/
GCF_000011605.1.fasta	GCF_000291725.1.fasta	GCF_000749845.1.fasta

A basic config file defining the three genomes is also provided as tests/walkthrough/pectoconf.tab in tab-separated tabular format (you can generate equivalent files for your own data using Excel, and saving the sheet as tab-separated text format).

# Pectobacterium genomes downloaded from GenBank/NCBI; genomovars inferred from ANIm
# Annotated Pba: genomovar 1
Pba_SCRI1043	Pectobacterium,atrosepticum_NCBI,gv1	tests/walkthrough/sequences/GCF_000011605.1.fasta	-
# Annotated Pwa: genomovars 2, 3
Pwa_CFBP_3304	Pectobacterium,wasabiae_NCBI,gv2	tests/walkthrough/sequences/GCF_000291725.1.fasta	-
# Annotated Pb	: genomovar 7
Pbe_NCPPB_2795	Pectobacterium,betavasculorum_NCBI,gv7	tests/walkthrough/sequences/GCF_000749845.1.fasta	-

Four columns are required (no headers are read), one row per input genome:

  • a unique name for the genome
  • a list of comma-separated labels for this genome - this will determine specificity of predicted primer sets
  • path to a FASTA file of the input genome sequence - this can be a draft genome in multiple fragments
  • (optional) path to a .gff file descrigbing genome features - if no features are provided, this should be a hyphen/dash (-).

In the walkthrough config file no features are defined, so this column contains only the symbol - to mark it as being empty.

Comment lines can be included, starting with # as the first character. These are ignored in the analysis

To confirm that the config file is correctly-formatted, we use the pdp config --validate command:

$ pdp config --validate tests/walkthrough/pectoconf.tab
WARNING: Validation problems
    Pbe_NCPPB_2795 requires stitch (tests/walkthrough/sequences/GCF_000749845.1.fasta)
    Pwa_CFBP_3304 requires stitch (tests/walkthrough/sequences/GCF_000291725.1.fasta)
    Pwa_CFBP_3304 has non-N ambiguities (tests/walkthrough/sequences/GCF_000291725.1.fasta)

This tells us that the first two genome files are in multiple parts so must be concatenated for this analysis, and that the second file also has ambiguity base symbols that are not N. These must be replaced by N for the analysis to proceed using the third-party tools.

2. Fix sequences for analysis

The pdp config command will fix input genomes for analysis, when the --fix_sequences argument is supplied. In this case, pdp writes new sequences having the necessary changes (stitching, replacing ambiguity symbols) and generates a new config file that refers to the fixed sequences. The path to the new config file is given with the argument --fix_sequences <NEWCONFIG>.json:

$ pdp config --fix_sequences tests/walkthrough/fixed.json \
                tests/walkthrough/pectoconf.tab

This writes corrected sequences to the tests/walkthrough/sequences subdirectory, and a new config file to tests/walkthrough/fixed.json (in JSON, rather than tabular format):

$ tree tests/walkthrough/
tests/walkthrough/
├── fixed.json
├── pectoconf.tab
└── sequences
    ├── GCF_000011605.1.fasta
    ├── GCF_000291725.1.fasta
    ├── GCF_000291725.1_concat.fas
    ├── GCF_000291725.1_concat_noambig.fas
    ├── GCF_000749845.1.fasta
    └── GCF_000749845.1_concat.fas

3. Defining CDS features on each genome (optional)

For prokaryotic genomes, we can use a genecaller to predict gene features with the pdp filter --prodigal command. This generates a GFF file describing CDS feature locations on each genome, and FASTA files of the CDS sequences. This can be useful because, in the primer design stage, we can take CDS locations into account to retain only primers that amplify within CDS regions.

To use the genecaller, we must provide an appropriate config file (the fixed.json config file), and the path to a new config file that will contain information about the predicted features (we'll call this fixed_with_features.json). We will tell prodigal to place the predicted gene locations in the subdirectory tests/walkthrough/prodigal:

pdp filter --prodigal --outdir tests/walkthrough/prodigal \
                tests/walkthrough/fixed.json \
                tests/walkthrough/fixed_with_features.json

The new directory containing genecaller output is created, as is the new config file:

$ tree tests/walkthrough/
tests/walkthrough/
├── fixed.json
├── fixed_with_features.json
├── pectoconf.tab
├── prodigal
│   ├── GCF_000011605.1.features
│   ├── GCF_000011605.1.gff
│   ├── GCF_000291725.1_concat_noambig.features
│   ├── GCF_000291725.1_concat_noambig.gff
│   ├── GCF_000749845.1_concat.features
│   └── GCF_000749845.1_concat.gff
└── sequences
[…]

4. Design primers to each genome in bulk

Using the config file fixed_with_features.json we can design primers to each input genome with the EMBOSS ePrimer3 package. At a minimum, we need to give the pdp eprimer3 command the input config file, and the path to an output config file that will contain additional information about the primers designed to each genome. We can also use the --outdir argument to provide a path to a directory where pdp will put the ePrimer3 output files.

If we wish to use only primers that amplify within the CDS regions we designed in the previous step, we need to give the --filter argument. Otherwise pdp eprimer3 will ignore this information.

$ pdp eprimer3 --filter --outdir tests/walkthrough/eprimer3 \
                  tests/walkthrough/fixed_with_features.json \
                  tests/walkthrough/with_primers.json

This places the output of ePrimer3 into its own directory (.eprimer3 files), and generates JSON files that describe the primers for each of the genomes. Additionally, a .bed file is produced describing the location of each primer set on the source genome, which can be visualised using a tool such as IGV.

$ tree tests/walkthrough/
tests/walkthrough/
├── eprimer3
│   ├── GCF_000011605.1.eprimer3
│   ├── GCF_000011605.1_named.bed
│   ├── GCF_000011605.1_named.eprimer3
│   ├── GCF_000011605.1_named.json
│   ├── GCF_000291725.1_concat_noambig.eprimer3
│   ├── GCF_000291725.1_concat_noambig_named.bed
│   ├── GCF_000291725.1_concat_noambig_named.eprimer3
│   ├── GCF_000291725.1_concat_noambig_named.json
│   ├── GCF_000749845.1_concat.eprimer3
│   ├── GCF_000749845.1_concat_named.bed
│   ├── GCF_000749845.1_concat_named.eprimer3
│   └── GCF_000749845.1_concat_named.json
├── fixed.json
├── fixed_with_features.json
├── pectoconf.tab
├── prodigal
[…]
├── sequences
[…]
└── with_primers.json

5. Deduplicate primer sets (optional)

When designing thermodynamically plausible primers to closely-related genomes, it is very likely that identical primer sets will be created on distinct genomes due to their sequence conservation. Carrying these duplicate primer sets through to later analysis stages can result in significant unnecessary computational load. To remove identical primer sets, we can use the pdp dedupe command:

$ pdp dedupe --dedupedir tests/walkthrough/deduped \
                    tests/walkthrough/with_primers.json \
                    tests/walkthrough/deduped_primers.json

This examines the primers described in the input config file (with_primers.json) and places equivalent sets of new JSON and .bed files, with all redundant primers removed, in the specified directory (tests/walkthrough/deduped), and creates a new config file (deduped_primers.json) that points to these files.

6. Screen primers against BLASTN database (optional)

Now that primers have been designed and deduplicated, it can be useful to screen them against a BLASTN database to identify and filter out any primers that have potential for off-target amplification. In general, we advise that this step is used not to demonstrate potential for target amplification, but only to exclude primers that might have potential for off-target binding, in order to remove non-specific primer sets.

The pdp blastscreen subcommand expects the path to a suitable pre-compiled BLASTN database to be provided with the argument --db (there is a BLASTN E. coli genome database in the subdirectory tests/walkthrough/blastdb/). By specifiying --outdir, we place BLAST output in the tests/walkthrough/blastn subdirectory. The input config file use is the deduped_primers.json file produced in the last step, having locations of the non-redundant primer files, and we specify that the command writes a new config file called screened.json that points to a reduced set of primers, with the potentially off-target/non-specific sets excluded.

No sequences are deleted as a result of this action.

$ pdp blastscreen --db tests/walkthrough/blastdb/e_coli_screen.fna \
                     --outdir tests/walkthrough/blastn \
                     tests/walkthrough/deduped_primers.json \
                     tests/walkthrough/screened.json

This screen produces the new subdirectory tests/walkthrough/blastn that contains all the primer sequences derived from each genome in FASTA format, and the tabular format output (.blasttab) of each BLAST search. New files are added to the deduped subdirectory (with suffix _screened), describing the reduced sets of primers, post-screening.

$ tree tests/walkthrough/
tests/walkthrough/
├── blastdb
│   ├── e_coli_screen.fna.nhr
│   ├── e_coli_screen.fna.nin
│   └── e_coli_screen.fna.nsq
├── blastn
│   ├── GCF_000011605.1_primers.blasttab
│   ├── GCF_000011605.1_primers.fasta
│   ├── GCF_000291725.1_concat_noambig_primers.blasttab
│   ├── GCF_000291725.1_concat_noambig_primers.fasta
│   ├── GCF_000749845.1_concat_primers.blasttab
│   └── GCF_000749845.1_concat_primers.fasta
[…]
├── deduped
│   ├── GCF_000011605.1_named_deduped.bed
│   ├── GCF_000011605.1_named_deduped.json
│   ├── GCF_000011605.1_named_deduped_screened.bed
│   ├── GCF_000011605.1_named_deduped_screened.fasta
│   ├── GCF_000011605.1_named_deduped_screened.json
│   ├── GCF_000291725.1_concat_noambig_named_deduped.bed
│   ├── GCF_000291725.1_concat_noambig_named_deduped.json
│   ├── GCF_000291725.1_concat_noambig_named_deduped_screened.bed
│   ├── GCF_000291725.1_concat_noambig_named_deduped_screened.fasta
│   ├── GCF_000291725.1_concat_noambig_named_deduped_screened.json
│   ├── GCF_000749845.1_concat_named_deduped.bed
│   ├── GCF_000749845.1_concat_named_deduped.json
│   ├── GCF_000749845.1_concat_named_deduped_screened.bed
│   ├── GCF_000749845.1_concat_named_deduped_screened.fasta
│   └── GCF_000749845.1_concat_named_deduped_screened.json
[…]
├── fixed.json
├── fixed_with_features.json
├── pectoconf.tab
├── prodigal
[…]
├── sequences
[…]
└── with_primers.json

7. Test primers against input sequences for crosshybridisation with primersearch

To identify which primers might be diagnostically useful for any of the classes/labels defined in the config file, we test how they potentially amplify the other genomes from the input set. We use the EMBOSS tool primersearch to do this.

Using the pdp primersearch subcommand, we pass the appropriate .json config file and specify a directory to hold primersearch output with --outdir, as well provide the path to write a new config file that will hold data about this crosshybridisation screen.

pdp primersearch \
       --outdir tests/walkthrough/primersearch \
       tests/walkthrough/screened.json \
       tests/walkthrough/primersearch.json

After running the command, a new directory tests/walkthrough/primersearch is produced, containing a new primer file (.tab - required for primersearch input) and .primersearch and .json files for each genome's primersearch results. In addition, there are .json and .bed files describing all amplicons produced on a specific genome: _amplicons.bed and _amplicons.json. The .bed files can be visualised using a tool like IGV.

$ tree tests/walkthrough
tests/walkthrough
[...]
├── primersearch
│   ├── Pba_SCRI1043_amplicons.bed
│   ├── Pba_SCRI1043_amplicons.json
│   ├── Pba_SCRI1043_primers.primertab
│   ├── Pba_SCRI1043_primersearch.json
│   ├── Pba_SCRI1043_ps_Pba_SCRI1043.primersearch
│   ├── Pba_SCRI1043_ps_Pbe_NCPPB_2795.primersearch
│   ├── Pba_SCRI1043_ps_Pwa_CFBP_3304.primersearch
│   ├── Pbe_NCPPB_2795_amplicons.bed
│   ├── Pbe_NCPPB_2795_amplicons.json
│   ├── Pbe_NCPPB_2795_primers.primertab
│   ├── Pbe_NCPPB_2795_primersearch.json
│   ├── Pbe_NCPPB_2795_ps_Pba_SCRI1043.primersearch
│   ├── Pbe_NCPPB_2795_ps_Pbe_NCPPB_2795.primersearch
│   ├── Pbe_NCPPB_2795_ps_Pwa_CFBP_3304.primersearch
│   ├── Pwa_CFBP_3304_amplicons.bed
│   ├── Pwa_CFBP_3304_amplicons.json
│   ├── Pwa_CFBP_3304_primers.primertab
│   ├── Pwa_CFBP_3304_primersearch.json
│   ├── Pwa_CFBP_3304_ps_Pba_SCRI1043.primersearch
│   ├── Pwa_CFBP_3304_ps_Pbe_NCPPB_2795.primersearch
│   ├── Pwa_CFBP_3304_ps_Pwa_CFBP_3304.primersearch
│   └── target_amplicons.json
├── primersearch.json
[...]

The new primersearch.json config file contains information about this crosshybridisation screen, and can be used for identification and extraction of diagnostic primer sequence sets.

8. Classify primers by predicted diagnostic capability with classify

To classify primer sets by their ability to amplify only genomes belonging to a specific named group/label in the configuration file, we use the pdp classify subcommand. This examines the primersearch output and reports back primer sets that amplify all genomes having a specific label/group, and only those genomes.

We pass as input the .json file produced by the pdp primersearch run, and the path to a directory for the output of the pdp classify subcommand:

pdp classify \
       tests/walkthrough/primersearch.json \
       tests/walkthrough/classify

The new directory contains .json and .ePrimer3 format files for each set of primers diagnostic to a given class, and summary information for thos primers in summary.tab and results.json files.

The output directory also contains .json and .bed files describing all regions on each genome that are amplified by primers having a particular specificity. For instance, the Pbe_NCPPB_2795_Pectobacterium_amplicons.* files contain information about the regions of the genome Pbe_NCPPB_2795 that are amplified by all primer sets with specificity to the label/group Pectobacterium.

$ tree tests/walkthrough/classify
tests/walkthrough/classify
├── Pba_SCRI1043_Pectobacterium_amplicons.bed
├── Pba_SCRI1043_Pectobacterium_amplicons.json
├── Pba_SCRI1043_atrosepticum_NCBI_amplicons.bed
├── Pba_SCRI1043_atrosepticum_NCBI_amplicons.json
├── Pba_SCRI1043_gv1_amplicons.bed
├── Pba_SCRI1043_gv1_amplicons.json
├── Pbe_NCPPB_2795_Pectobacterium_amplicons.bed
├── Pbe_NCPPB_2795_Pectobacterium_amplicons.json
├── Pbe_NCPPB_2795_betavasculorum_NCBI_amplicons.bed
├── Pbe_NCPPB_2795_betavasculorum_NCBI_amplicons.json
├── Pbe_NCPPB_2795_gv7_amplicons.bed
├── Pbe_NCPPB_2795_gv7_amplicons.json
├── Pectobacterium_primers.ePrimer3
├── Pectobacterium_primers.json
├── Pwa_CFBP_3304_Pectobacterium_amplicons.bed
├── Pwa_CFBP_3304_Pectobacterium_amplicons.json
├── Pwa_CFBP_3304_gv2_amplicons.bed
├── Pwa_CFBP_3304_gv2_amplicons.json
├── Pwa_CFBP_3304_wasabiae_NCBI_amplicons.bed
├── Pwa_CFBP_3304_wasabiae_NCBI_amplicons.json
├── atrosepticum_NCBI_primers.ePrimer3
├── atrosepticum_NCBI_primers.json
├── betavasculorum_NCBI_primers.ePrimer3
├── betavasculorum_NCBI_primers.json
├── gv1_primers.ePrimer3
├── gv1_primers.json
├── gv2_primers.ePrimer3
├── gv2_primers.json
├── gv7_primers.ePrimer3
├── gv7_primers.json
├── results.json
├── summary.tab
├── wasabiae_NCBI_primers.ePrimer3
└── wasabiae_NCBI_primers.json

The summary.tab file contains a table showing how many primer sets were designed that are potentially specific to each of the input genomes and classes/labels:

$ cat tests/walkthrough/classify/summary.tab 
Group	NumPrimers	Primers
Pectobacterium	4	tests/walkthrough/classify/Pectobacterium_primers.json
atrosepticum_NCBI	1	tests/walkthrough/classify/atrosepticum_NCBI_primers.json
betavasculorum_NCBI	2	tests/walkthrough/classify/betavasculorum_NCBI_primers.json
gv1	1	tests/walkthrough/classify/gv1_primers.json
gv2	2	tests/walkthrough/classify/gv2_primers.json
gv7	2	tests/walkthrough/classify/gv7_primers.json

9. Extract and assess candidate metabarcoding markers

The candidate diagnostic primer sets designed using pdp can be useful for generating group-specific metabarcoding markers. Using the pdp extract subcommand, the region(s) predicted to be amplified by each candidate primer set (in the primersearch step) are extracted from the target genomes, aligned (using MAFFT) and then compared to each other to estimate their sequence diversity.

Where a set of primers is predicted to amplify a specific subgroup of the input genomes, and the predicted amplicons are sequence-diverse, those regions and primers are potentially useful markers for metabarcoding. pdp extract produces an output table of diversity measures for each primer set to assess the sequence diversity of each primer set's predicted amplicons.

We provide pdp extract with paths to the JSON config file describing primersearch results, and to a JSON output output from the pdp classify step, corresponding to one of the groups/labels that the resulting primers are specific to. We also supply the path to an output directory for the command to write to.

pdp extract \
    tests/walkthrough/primersearch.json \
    tests/walkthrough/classify/Pectobacterium_primers.json \
    tests/walkthrough/extract

The new output directory is created (if needed), and a further subdirectory named for primer specificity. Within that directory, all predicted amplicons for each group-specific primer are written as FASTA files, and a corresponding .aln alignment file (produced using MAFFT). An additional distances_summary.tab file is also produced, describing sequence diversity measures for amplicons produced by each primer set:

$ tree tests/walkthrough/extract
tests/walkthrough/extract
└── Pectobacterium_primers
    ├── GCF_000011605.1_primer_00001.aln
    ├── GCF_000011605.1_primer_00001.fasta
    ├── GCF_000011605.1_primer_00002.aln
    ├── GCF_000011605.1_primer_00002.fasta
    ├── GCF_000011605.1_primer_00003.aln
    ├── GCF_000011605.1_primer_00003.fasta
    ├── GCF_000749845.1_concat_primer_00003.aln
    ├── GCF_000749845.1_concat_primer_00003.fasta
    └── distances_summary.tab

The distances_summary file describes, for each primer set, sequence distance (min, max, mean and stdev), and diversity/evenness measures (Shannon Index and Shannon Evenness), to aid selection of candidate markers. It also provides a count of the number of unique amplicon sequences observed, and the count of sequences that are not unique (unique + nonunique = total count).

$ cat tests/walkthrough/extract/Pectobacterium_primers/distances_summary.tab 
primer	dist_mean	dist_sd	dist_min	dist_max	unique	nonunique	shannon_index	shannon_evenness
GCF_000011605.1_primer_00001	0.0267	0.0153	0.0100	0.0400	3	0	1.10	1.00
GCF_000011605.1_primer_00002	0.0600	0.0000	0.0600	0.0600	3	0	1.10	1.00
GCF_000011605.1_primer_00003	0.0400	0.0100	0.0300	0.0500	3	0	1.10	1.00
GCF_000749845.1_concat_primer_00003	0.1133	0.0115	0.1000	0.1200	3	0	1.10	1.00

FURTHER INFORMATION

For further technical information, please read the comments contained within the top of each '*.py' file as well as the Supporting Information ('Methods S1' document) of doi:10.1371/journal.pone.0034498.

CONTRIBUTORS