Skip to content

Edinburgh-Genome-Foundry/Sequeduct_demo

Repository files navigation

Sequeduct demo

This repository contains demonstration data and results for the Sequeduct pipeline. Guidelines for interpreting the results are also provided here. For older versions of the pipeline, check releases.

View

Example files and results (results_example) can be browsed here or after downloading (cloning) the repository. Certain intermediate analysis files and directories were not included, due to their size.

Run

The pipeline can also be run on the provided data. After installation of the pipeline as described on the Sequeduct website, download (clone) the repository (click on the "<> Code" button at the top of this page) and run the below commands.

All results are output in a newly created results directory.


Preview

nextflow run edinburgh-genome-foundry/Sequeduct -r v0.4.2 -entry preview \
    --fastq_dir='fastq_pass' \
    --sample_sheet='sample_sheet.csv'

The Preview pipeline runs a NanoPlot analysis on the raw reads. This is useful for getting an overview of our sequencing data. The results are saved in the results/dir1_preview directory.

The pipeline requires a sample sheet that lists the selected barcodes (directories in the FASTQ folders) that we want to analyse. This allows us to analyse a subset of the data, for example when not all barcodes are used, or when we have multiple projects in the same flowcell run. The sheet must come with the header line as shown in the example.

Please see NanoPlot documentation for details.

It's recommended to examine the plots in order to check the overall quality of the run. The histograms provide information on the number of reads obtained, the quality distribution and the potential presence of plasmid dimers. With this information, we can proceed with selected samples in the Analysis pipeline.


Analysis

nextflow run edinburgh-genome-foundry/Sequeduct -r v0.4.2 -entry analysis \
    --fastq_dir='fastq_pass' \
    --reference_dir='genbank' \
    --sample_sheet='sample_sheet.csv' \
    --projectname='EGF demo'

The Analysis pipeline compares the Nanopore reads against the expected (designed) sequence.

The pipeline requires the following input data: reference Genbank files (in standard format, and with extension .gb), the FASTQ file folder, and a sample sheet that maps reference filenames (without extension) with the subdirectories in the FASTQ folder. (EGF's Convert Sequence Files can convert FASTA, Genbank or other formats into the required standard format.) As the pipeline was designed to work with multiple barcodes, the FASTQ folder must be structured with subdirectories for each barcode.

The output files are saved in the results/dir2_analysis directory. In this, n8_results contain the final results: a PDF report, a list of the sample analysed, and a summary of the results in the CSV file format. The PDF report describes the results in detail, as shown below (see its Appendix for a detailed description). The report is also provided in a HTML format. The sample list details the relevant result files for for each analysed sample entry (entries.csv). The results summary file can be opened in a spreadsheet software and revised based on the report, following the procedure described below.

The report is structured into chapters, one for each FASTQ subdirectory (i.e. barcode). Note that the reads are not required to be derived from barcoded samples, as the pipeline does not utilise any barcode sequences. Therefore any set of FASTQ files can be used, whether they originated from non-barcoded sequencing, or from demultiplexed data from a custom barcoding protocol. The pipeline expects one DNA sequence (reference) for each sample. For multiplexed (pooled) samples, use the demultiplexing pipeline first, described further below.

Report cover page:

Report

The cover page contains basic information about the run, such as the number of reads analysed.

First chapter or FASTQ directory (barcode)

Chapter cover page:

Report

The first line summarises the results for the FASTQ directory, followed by a table of key statistics. A histogram of filtered read lengths is displayed. If we have mostly full length reads (e.g. linearised plasmids or plasmids from rapid barcoding kit), then the peak can be used for estimating the size of the plasmid. Wrong size indicates large structural variation (insertion / deletion), sample mixup or errors in the sample spreadsheet.

The pipeline investigates the sequencing data in two parallel ways: (i) create a de novo assembly (asm) entirely from the reads (using Canu), without utilising the reference file and align to the reference (ref) to find large structural variations; (ii) align the reads to the reference and call variants (SNPs etc).

The cover page also displays the results of the asm analysis. The alignment is shown as a plot and in the PAF (pairwise mapping format) table. The difference between the assembly and the reference length is also shown (∆ = asm − ref).

Here we see that the de novo assembled sequence and the reference sequence are identical.

Note that the assembly workflow does not contain a polishing step. Current sequencing accuracy is above 99%, which is good enough to detect large errors. Point mutations may show in asm, but these can be ignored. Canu also requires at least 10× coverage for the de novo assembly. Canu sometimes creates multiple assembled sequences, of which only the first one is used.

Chapter (second page): read alignment analysis

Report

Each reference DNA (plasmid) is analysed separately, with its corresponding reads. A result call (fail / pass / low coverage / warning) is assigned to each section. It's recommended to revise this decision, based on the contents of the report. A coverage plot is displayed under the reference sequence (plasmid) map. In this example, we have ~500× coverage, evenly covering the full length of the sequence, therefore we can exclude large deletions.

A cumulative plot of longest unaligned intervals is also shown to help detecting insertions. (This plot was developed before the asm approach was implemented.) The proportion of reads with inserts above the cutoff (50 bp, vertical grey line) is also calculated. Inserts smaller than this cutoff are detected in the variant call analysis. The plot suggests that there are no regions in the reads that don't align to the reference, suggesting that there are no large insertions.

Finally, a simplified variant call format (VCF) table lists all detected small variants (SNPs and indels). Homopolymer stretches and methylated nucleotides (at Dam/Dcm patterns) are known to produce systemic sequencing errors, therefore these are not considered true variants (marked in column T). In these cases we usually also see a disagreement between the reads (in the reference / alternate allele observation counts or RO / AO columns). The table is empty, as there are no variations detected.

Chapter (third page): variant plot

Report

This page shows the plasmid map with variant annotations, for an easy overview. (Variants at homopolymers are ignored as explained above.) Here we do not have annotations because the reference sequence matches the DNA sequence. A consensus file of true variants is also provided for each barcode. This consensus is provided in the FASTA format and is derived from the reference by applying the variant modifications.

Second chapter

Report

The second example plasmid was flagged as a failed sample. The histogram of reads suggests that the plasmid is smaller than what we expect. This example was constructed by inserting 120 bp into the reference, using a sequence editor. This appears as deletion in the analysis, because the FASTQ reads do not contain this extra sequence.

We see this deletion in the assembly alignment results. We also see an artefact, that the assembled sequence is twice as long as the reference, and is made up of two consecutive sequences of the reference. This "duplication" happens when reads are derived from random segments of a circular sequence, such as a plasmid, and joined by an assembler. Canu, the assembler, tries to identify whether the sequence is duplicated, but this automatic identification is not always successful. The result of the identification is shown in the suggestCircular parameter of the consensus FASTA header (sequence name).

For an example of real data from a failed plasmid sample, see documentation for an earlier release.

Report

We see zero coverage for the affected segment on the coverage plot. In real samples, this could indicate recombination or DNA cloning error.

When there are large errors, then the VCF table can be ignored. In these cases the variant annotation plot is not displayed, and the consensus FASTA sequence derived from variant calls should not be used.

Please see the Appendix of the PDF report, the publication, and the Nextflow pipeline code and documentation for more details.

Results

The results.csv file contains these columns:

Barcode and Sample : the original sample sheet values.

Retained_bp_pct : the amount of data (basepairs) retained after quality filtering (bp %). To avoid unnecessary work, this is calculated from the results of the Preview pipeline. If the Preview pipeline was not run, NA is displayed in the column.

Aligned_read_pct : proportion of aligned filtered reads (%).

Aligned_reads : number of aligned reads.

Result : the result of the analysis.

Review_de_novo : a column to mark samples for the Review pipeline.

Setting parameters

Key pipeline parameters can be set by the user. The list of parameters and their default settings can be found in the nextflow.config file. The default parameters work very well for most sequencing runs, but we provide a few suggestions below for setting different values, depending on use-case.

--max_len_fraction=1.5 : the maximum read length cutoff is used to filter reads by NanoFilt. The default is 1.5x of the reference sequence length. Set this to a higher value to include plasmid dimers, or if the reference sequence is a short subsegment of the sequenced DNA.

--min_length=500 : the value is in number of nucleotides (bp). Most plasmids are longer than 1 kbp and linearisation or fragmentation results in mostly full length reads. Set the value lower if the DNA is shorter or the reads are more fragmented.

--quality_cutoff=10 : this PHRED quality score cutoff parameter is used by NanoFilt. Set to a higher value to work with better reads or if there is an overabundance of reads. Set to a lower value to use more reads if the reads are lower quality than usual, e.g. due to an issue with a sequencing run or sequencer software. Results of the Preview pipeline can be used for determining an appropriate cutoff.

--freebayes.args : these are used by the freebayes variant detector. Set --min-base-quality PHRED score cutoff higher/lower to use fewer/more bases during variant call. The variant call table (DP column) in the report can help in determining a different cutoff. If the values in the DP column are very low in contrast to high sequencing depth or coverage, then try a lower value such as 15. In most cases the default will be the most suitable.

--save_bam=false : by default, the BAM files are not output in the results folder. Set to true to change.

--save_filtered_fastq=false : filtered FASTQ files are also not output.

Summary of revising steps

  • Open a copy of results.csv in a spreadsheet software.
  • Create a new column and review the results for each sample in the PDF report.
  • Check number of reads and coverage. Insufficient coverage is usually marked with 'low coverage'.
  • On the histogram, check read length distribution (fragmentation). If there are full-length reads, then compare the peak against the expected length (vertical red line). Differences indicate large indels.
  • Inspect the de novo assembly analysis plot and table to find large indels.
  • Inspect the coverage plot for uncovered regions, which indicate deletions.
  • Interpret the cumulative plot of longest unaligned intervals to find large inserts.
  • Review the variant call format table and note variants. Note that large indels are not displayed in the table.
  • Reject or accept samples, depending on the requirements of the project.
  • For large indels, the Review pipeline can help in clarifying the nature of the error.

Review

nextflow run edinburgh-genome-foundry/Sequeduct -r v0.4.2 -entry review \
    --reference_dir='genbank' \
    --results_csv='results_reviewed.csv' \
    --projectname='EGF demo review' \
    --all_parts='parts_fasta/parts.fasta' \
    --assembly_plan='demo_assembly_plan.csv'

The Review pipeline aligns a user-defined list of sequences against a de novo plasmid sequence, and then reports the alignments. This is useful for evaluating plasmids that are constructed from parts to clarify whether we have part or sample mix-ups, recombination events or overhang (DNA "sticky end") misannealing in a Type IIS cloning (Golden Gate).

This pipeline can only be run after running the Analysis pipeline, as it uses the generated files. It requires the reference Genbank files, the sequences in a single FASTA file, and a sheet specifying which samples we want to review. This sample sheet is the results.csv file from the Analysis run, with requested samples marked with 1 in the Review_de_novo column. (Other marker value can be set with the --denovo_true parameter.) Optionally, an assembly plan can be specified, which lists which sequences we expect to be present for each reference sequence. This information is used in the report for an easier interpretation of results. The assembly plan CSV file must have header line: e.g. construct,parts.

The results are saved in the results/dir3_review directory. Please see the Appendix of the consensus review PDF report, or the de novo review PDF report for a description. The de novo consensus sequence is provided in the FASTA format.

Note that this example utilises data from an earlier release.

The report greets the user with a cover page:

Report

This is followed by one chapter for each sample:

Report

On this page, a plot of the aligning parts is displayed against the de novo assembled sequence. There are no feature annotations as the assembly was created from the reads, and the parts were supplied in unannotated FASTA format. The reference sequence is also aligned against the de novo assembly, shown in grey. The annotations are coloured based on the assembly plan. Green colour indicates expected parts, and red colour indicates unexpected parts.

We can see that the green parts cover the full region of the de novo sequence, and the grey reference also fully aligns, indicating that the assembly is correct. We note that the sequence assembled from reads do not (necessarily) have the same origin as our reference. (Circular sequences are stored in a "linear" format as a sequence of letters, and the first nucleotide is the origin of the sequence, as stored in the file.) The DNA parts were in a carrier backbone plasmid, and this was also supplied with the name 'part_carrier'. The ori region of this carrier plasmid is nearly identical to the ori of the backbone, 'HC_Amp_ccdB', therefore it aligns and shows on the plot. The software automatically recognises if the de novo assembly is in reverse complement to the reference, as noted on the top of the page.

The next chapter shows results for a failed Type IIS cloning (Golden Gate assembly) product:

Report

We can see that the part carrier backbone, displayed in red, was assembled into the product, instead of the intended insert. Upon inspection of the corresponding de novo assembly FASTA, we can find the presence of KanR, a resistance marker of the carrier plasmid, and a recognition site for the restriction enzyme used for the assembly (BsmBI). A separate part carrier ori sequence alignment is also present, as discussed above.

Note that the Review pipeline cannot handle two samples with the same name in the same run.


Assembly

nextflow run edinburgh-genome-foundry/Sequeduct -r v0.4.2 -entry assembly \
    --fastq_dir='fastq_pass' \
    --assembly_sheet='de_novo_assembly_sheet.csv'

This pipeline is currently under development, and does not yet contain a polishing step.

The standalone Assembly pipeline creates de novo assembly sequences, without any reference files. It requires the FASTQ files, and a sample sheet listing the barcodes and corresponding expected DNA (plasmid) length (in kbp). The results are saved in the results/dir4_assembly directory.


Demultiplex

nextflow run edinburgh-genome-foundry/Sequeduct -r v0.4.2 -entry demultiplex \
    --fastq_dir='fastq_pass' \
    --reference_dir='genbank' \
    --sample_sheet='sample_sheet.csv'

The optional Demultiplex pipeline is used for multiplexed (pooled) samples, where multiple plasmids are sequenced with the same barcode. For each sample, the pipeline aligns the FASTQ reads to the references, assigns each read to the best matching reference, then creates new samples and directories with FASTQ files that can be used in the Analysis pipeline.

This pipeline requires the sample sheet in the following CSV format:

Sample,Barcode_dir
EGF2_2;EGF2_2,barcode02
EGF2_13,barcode02

The references for the multiplexed samples are separated by a semicolon (;). The number of references per sample is not limited. The sample sheet may also include non-multiplexed ("singleplex") samples. For these, the corresponding FASTQ directories are simply copied to the output, without further analysis. This can be turned off using the parameter --singleplex_out=false. The pipeline also creates a new sample sheet that can be used with the other pipelines. The results are saved in the results/dir0_demultiplex directory.

Note that the new FASTQ directory names are derived from the original names, using a suffix (_1 etc). (If the FASTQ directories are named in such a way that they are suffixed versions of each other, then there may be a clash. The simplest way to prevent this is by avoiding underscore (_) characters in folder names, or by avoiding suffixing barcodes.)

If none of the samples are multiplexed, then this pipeline is not needed.


Citation

Biofoundry-scale DNA assembly validation using cost-effective high-throughput long-read sequencing, Peter Vegh, Sophie Donovan, Susan Rosser, Giovanni Stracquadanio, Rennos Fragkoudis. ACS Synthetic Biology (2024) 13, 2, 683–686


Notes

Please use the Sequeduct project's page to file any issues or comments. See the main page for other ways of contact.

Copyright 2023 Edinburgh Genome Foundry, University of Edinburgh

About

Demonstration data for the Sequeduct pipeline

Resources

Stars

Watchers

Forks

Languages