Skip to content

Commit

Permalink
Merge branch 'master' into DOR-993_fix_auto_batchsize_for_short_chunk…
Browse files Browse the repository at this point in the history
…_supv5
  • Loading branch information
iiSeymour committed Dec 17, 2024
2 parents 5cbfbcd + f55612c commit 24c274a
Show file tree
Hide file tree
Showing 46 changed files with 1,406 additions and 391 deletions.
29 changes: 29 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,35 @@

All notable changes to Dorado will be documented in this file.

# [0.9.0] (16 Dec 2024)

This major release of Dorado introduces several new features and enhancements. The `polish` command, currently experimental, is optimised for refining draft assemblies of human genomes. This release also adds faster DNA modification calling models and improved 6mA false positive rate (FPR) in native human samples. Barcode demultiplexing accuracy has been significantly enhanced for kits with barcodes at both ends, including `SQK-NBD114`. Note that using custom barcode kits now requires the `--kit-name` option. A feature has been added to enable running `dorado correct` in blocks, allowing work to be divided into smaller pieces for easy submission to a compute cluster. Additional updates include the `qs` tag for mean basecall Q-scores in FASTQ output, an upgrade to POD5 to support systems with large page sizes, improvements to Poly(A) tail length estimation, and various bug fixes to enhance stability and functionality.

* 2b96c0b3d421e7729b6189f334cd7c1be50d53a6 - New Dorado `polish` feature for assembly polishing
* 0bab1669df18689aafbc6d42403aedacc36d297d - Faster modified base models for DNA `4mC_5mC`, `5mC_5hmC`, `5mCG_5hmCG`, and `6mA`
* e6371667f8c2fbd66e2ef5b1b91f36d47b3767f5 - Enable running dorado correct in blocks, for easy submission to a compute cluster
* 40296da37d815a1ae2dc7f5739daeabd76dc5767 - Reduced false positive classification rates for kits with barcodes at both ends
* 35da003bfc50afdeef0462cfd15f2d2fc237e538 - Improve barcode classification when barcodes can be on either end
* cbcdf38faaac21dc6d805ed70e14e1786b739d02 - Only classify barcodes which are present on sample sheet if provided
* 2449d03c23c577fb281e9fa02d414289ae8e2c08 - Correct `AF02F_14` and `AH10R_80` barcodes from `TWIST-96A-UDI`
* 631e94c823463156188d0e3364505c7f39d3327a - Prevent Dorado `demux` from stripping alignment information when `--no-trim` is specified
* affea85594a405a2cb162637e1c22e8a71ac7cc5 - Prevent missing filenames when using `--emit-summary` with Dorado `demux`
* 3dec15a4ae6f2acc7953c8f81e34bae63a659372 - Improve poly(A) tail estimation accuracy, including with interrupted tails
* df57d34665d8c6ab6d45bc893c5f5ac95d54f58e - Limit poly(A) estimation to reads with plausible signal to prevent stalls in calculation
* 6cf701a825959cf14079b976f2199e0643b450b2 - Add `min_primer_separation` option to custom poly(A) configuration
* bf51bd492618a896afc33cbe246f76e8f62e852f - Add `qs` tag with mean basecall Q-score to FASTQ output
* dac076de03f323bef273a159debfd458629befa6 - Upgrade to POD5 v0.3.23 to support systems with large page sizes for POD5 and .fast5
* c7a7a58e9f5f264ae2c57982fe02bc6bb28fc6bd - Prevent silent failure or segfault on Windows with bad custom barcode files
* 1e829d5494d5b3cf6c7bbc67697d0561e966c7d0 - Do not allow basecalling if target directory includes both POD5 and .fast5 files
* 05d0981cbe5a05ca910806ace5bfe02fd8baef00 - Fix modified base trim for reverse-aligned BAM records
* afdb06837706e69d06095c7652ef3fcea700bfa4 - Fix invalid `MM` tag after trimming when no mods are present
* 0d788d7df7edcbc2e828985b5ff894f5535f5e49 - Prevent crash when insufficient permissions to read an input file/folder
* dbece016fe88c1fded10162fd770bd1bdbf5ebed - Update custom barcoding documentation to accurately reflect demultiplexing logic
* 6db40ec1f33b2a2974d475f701e3e48ee745421e - Correct model context info shown in `dorado download --list-structured`
* 03acc12855fc944e071fef98ce89c99171ef465e - Use the `-o` short option only for `--output-dir` and not for `--overlap`
* 8d9c017097b6dc5fa6b3f2f40f2a9851f88383cb - Added support for reading gzipped compressed FASTQ files


# [0.8.3] (11 Nov 2024)

This release of Dorado includes fixes and improvements to the Dorado 0.8.2 release, including a fix to SUP basecalling on Apple Silicon.
Expand Down
2 changes: 2 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,8 @@ add_library(dorado_lib
dorado/api/runner_creation.h
dorado/api/pipeline_creation.cpp
dorado/api/pipeline_creation.h
dorado/demux/adapter_primer_kits.cpp
dorado/demux/adapter_primer_kits.h
dorado/demux/adapter_info.h
dorado/demux/AdapterDetector.cpp
dorado/demux/AdapterDetector.h
Expand Down
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,10 @@ If you encounter any problems building or running Dorado, please [report an issu

First, download the relevant installer for your platform:

- [dorado-0.8.3-linux-x64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.8.3-linux-x64.tar.gz)
- [dorado-0.8.3-linux-arm64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.8.3-linux-arm64.tar.gz)
- [dorado-0.8.3-osx-arm64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.8.3-osx-arm64.zip)
- [dorado-0.8.3-win64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.8.3-win64.zip)
- [dorado-0.9.0-linux-x64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.9.0-linux-x64.tar.gz)
- [dorado-0.9.0-linux-arm64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.9.0-linux-arm64.tar.gz)
- [dorado-0.9.0-osx-arm64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.9.0-osx-arm64.zip)
- [dorado-0.9.0-win64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.9.0-win64.zip)

Once the relevant `.tar.gz` or `.zip` archive is downloaded, extract the archive to your desired location.

Expand Down
2 changes: 1 addition & 1 deletion cmake/HDF5.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ option(DYNAMIC_HDF "Link HDF as dynamic libs" OFF)
if((CMAKE_SYSTEM_NAME STREQUAL "Linux") AND (CMAKE_SYSTEM_PROCESSOR STREQUAL "aarch64"))
# download the pacakge for arm, we want to package this due to hdf5's dependencies
set(DYNAMIC_HDF ON)
set(HDF_VER hdf5-1.10.0-1-aarch64)
set(HDF_VER hdf5-1.10.0-aarch64)
download_and_extract(https://cdn.oxfordnanoportal.com/software/analysis/${HDF_VER}.zip ${HDF_VER})
list(PREPEND CMAKE_PREFIX_PATH ${DORADO_3RD_PARTY_DOWNLOAD}/${HDF_VER}/${HDF_VER})

Expand Down
56 changes: 56 additions & 0 deletions documentation/CustomPrimers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
### Custom Adapter and Primer Sequences

Dorado will normally automatically detect and trim any adapter or primer sequences it finds. The specific sequences it searches for depend on the specified sequencing kit. This applies to both the basecaller subcommand, where the kit name is expected to be embedded in the read in the input pod5 file, and the trim subcommand, where the kit must be specified as a command-line option to dorado.

In some cases, it may be necessary to find and remove adapter and/or primer sequences that would not normally be associated with the sequencing kit that was used, or you may be working with older data for which the sequencing kit and/or primers being used are no longer directly supported by dorado (for example, anything prior to kit14). In such cases, you can specify a custom adapter/primer file, using the command-line option `--primer-sequences`.

If this option is used, then the sequences encoded in the specified file will be used instead of the built-in sequences that dorado normally searches for.

#### Custom adapter/primer file format

The custom adapter/primer file is really just a fasta file, with the desired sequences specified within. However, some additional metadata is needed to allow dorado to properly interpret how the sequences should be used.

* The record name for each sequence must be of the form `[id]_front` or `[id]_rear`.
* The `id` part of the record name may occur, at most, twice in the file: Once with `_front` and once with `_rear`.
* Immediately following the record name must be a space, followed by either `type=adapter` or `type=primer`.
* Following the type designator, you can have an additional space, followed by `kits=[kit1],[kit2],[kit3][etc...]`.

The `_front` and `_rear` part of the record name tells dorado how to search for the sequence. In the case of adapters, dorado will look for the `front` sequence near the beginning of the read, and for the `rear` sequence near the end of the read. For primers, dorado also look for the `front` and `rear` sequences at the beginning and end of the read, just as with adapters, but it will also look for the reverse-complement of the `rear` sequence near the beginning of the read, and for the reverse-complement of the `front` sequence near the end of the read.

The `type` designator is required to designate whether the sequence in an adapter or a primer sequence, so that dorado knows how it should be used.

The `kits` designator is optional. If provided, then the sequence will only be searched for if the sequencing-kit information in the read matches one of the kit names in the custom file. If the `kits` designator is not provided, then the sequence will be searched for in all reads, regardless of the kit that was used. Note that the kit names are case-insensitive.

#### Example custom adapter/primer file.

The following could be used to detect the PCR_PSK_rev1 and PCR_PSK_rev2 primers, along with the LSK109 adapters, for older data.

```
>LSK109_front type=adapter
AATGTACTTCGTTCAGTTACGTATTGCT
>LSK109_rear type=adapter
AGCAATACGTAACTGAACGAAGT
>PCR_PSK_front type=primer
ACTTGCCTGTCGCTCTATCTTCGGCGTCTGCTTGGGTGTTTAACC
>PCR_PSK_rear type=primer
AGGTTAAACACCCAAGCAGACGCCGCAATATCAGCACCAACAGAAA
```

In this case, the above adapters and primers would be searched for in all reads, regardless of the sequencing-kit information encoded in the read file, or in the case of dorado trim, regardless of the sequencing-kit specified on the command-line. If you wanted to restrict the software so that the primers would only be searched for in reads with `SQK-PSK004` specified as the kit name, and the adapters would only be searched for if the kit name was specified as either `SQK-PSK004` or `SQK-LSK109`, then the following could be used.

```
>LSK109_front type=adapter kits=SQK-PSK004,SQK-LSK109
AATGTACTTCGTTCAGTTACGTATTGCT
>LSK109_rear type=adapter kits=SQK-PSK004,SQK-LSK109
AGCAATACGTAACTGAACGAAGT
>PCR_PSK_front type=primer kits=SQK-PSK004
ACTTGCCTGTCGCTCTATCTTCGGCGTCTGCTTGGGTGTTTAACC
>PCR_PSK_rear type=primer kits=SQK-PSK004
AGGTTAAACACCCAAGCAGACGCCGCAATATCAGCACCAACAGAAA
```
7 changes: 5 additions & 2 deletions dorado/cli/basecaller.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -760,8 +760,11 @@ int basecaller(int argc, char* argv[]) {
parser.visible.present<std::string>("--barcode-sequences");
if (custom_seqs.has_value()) {
try {
std::unordered_map<std::string, std::string> custom_barcodes =
demux::parse_custom_sequences(*custom_seqs);
std::unordered_map<std::string, std::string> custom_barcodes;
auto custom_sequences = demux::parse_custom_sequences(*custom_seqs);
for (const auto& entry : custom_sequences) {
custom_barcodes.emplace(std::make_pair(entry.name, entry.sequence));
}
barcode_kits::add_custom_barcodes(custom_barcodes);
} catch (const std::exception& e) {
spdlog::error(e.what());
Expand Down
7 changes: 5 additions & 2 deletions dorado/cli/demux.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -269,8 +269,11 @@ int demuxer(int argc, char* argv[]) {
parser.visible.present<std::string>("--barcode-sequences");
if (custom_seqs.has_value()) {
try {
std::unordered_map<std::string, std::string> custom_barcodes =
demux::parse_custom_sequences(*custom_seqs);
std::unordered_map<std::string, std::string> custom_barcodes;
auto custom_sequences = demux::parse_custom_sequences(*custom_seqs);
for (const auto& entry : custom_sequences) {
custom_barcodes.emplace(std::make_pair(entry.name, entry.sequence));
}
barcode_kits::add_custom_barcodes(custom_barcodes);
} catch (const std::exception& e) {
spdlog::error(e.what());
Expand Down
6 changes: 0 additions & 6 deletions dorado/cli/polish.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -588,9 +588,6 @@ const polisher::ModelConfig resolve_model(const polisher::BamInfo& bam_info,
// Example: [email protected]_polish_rl_mv
std::string model_name = basecaller_model + polish_model_suffix;

// Example: dna_r10.4.1_e8.2_400bps_hac_v5.0.0_polish_rl_mv
std::replace(std::begin(model_name), std::end(model_name), '@', '_');

spdlog::info("Downloading model: '{}'", model_name);
model_dir = download_model(model_name);

Expand All @@ -613,9 +610,6 @@ const polisher::ModelConfig resolve_model(const polisher::BamInfo& bam_info,
// Example: [email protected]_polish_rl_mv
std::string model_name = basecaller_model + polish_model_suffix;

// Example: dna_r10.4.1_e8.2_400bps_hac_v5.0.0_polish_rl_mv
std::replace(std::begin(model_name), std::end(model_name), '@', '_');

spdlog::info("Downloading model: '{}'", model_name);
model_dir = download_model(model_name);

Expand Down
16 changes: 14 additions & 2 deletions dorado/cli/trim.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -40,15 +40,16 @@ int trim(int argc, char* argv[]) {
.nargs(argparse::nargs_pattern::any);
parser.add_argument("-t", "--threads")
.help("Combined number of threads for adapter/primer detection and output generation. "
"Default uses "
"all available threads.")
"Default uses all available threads.")
.default_value(0)
.scan<'i', int>();
parser.add_argument("-n", "--max-reads")
.help("Maximum number of reads to process. Mainly for debugging. Process all reads by "
"default.")
.default_value(0)
.scan<'i', int>();
parser.add_argument("-k", "--sequencing-kit")
.help("Sequencing kit name to use for selecting adapters and primers to trim.");
parser.add_argument("-l", "--read-ids")
.help("A file with a newline-delimited list of reads to trim.")
.default_value(std::string(""));
Expand Down Expand Up @@ -82,6 +83,16 @@ int trim(int argc, char* argv[]) {
utils::SetVerboseLogging(static_cast<dorado::utils::VerboseLogLevel>(verbosity));
}

if (!parser.is_used("--sequencing-kit")) {
spdlog::error("The sequencing kit name must be specified with --sequencing-kit.");
return EXIT_FAILURE;
}
auto kit_name = parser.get<std::string>("--sequencing-kit");
if (kit_name.empty()) {
spdlog::error("Sequencing kit name must be non-empty.");
return EXIT_FAILURE;
}

auto reads(parser.get<std::vector<std::string>>("reads"));
auto threads(parser.get<int>("threads"));
auto max_reads(parser.get<int>("max-reads"));
Expand Down Expand Up @@ -146,6 +157,7 @@ int trim(int argc, char* argv[]) {
auto adapter_info = std::make_shared<demux::AdapterInfo>();
adapter_info->trim_adapters = true;
adapter_info->trim_primers = !parser.get<bool>("--no-trim-primers");
adapter_info->kit_name = kit_name;
adapter_info->custom_seqs = custom_primer_file;

auto client_info = std::make_shared<DefaultClientInfo>();
Expand Down
1 change: 1 addition & 0 deletions dorado/data_loader/DataLoader.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,7 @@ SimplexReadPtr process_pod5_thread_fn(
new_read->start_sample = read_data.start_sample;
new_read->end_sample = read_data.start_sample + read_data.num_samples;
new_read->read_common.flowcell_id = run_info_data->flow_cell_id;
new_read->read_common.sequencing_kit = run_info_data->sequencing_kit;
new_read->read_common.flow_cell_product_code = run_info_data->flow_cell_product_code;
new_read->read_common.position_id = run_info_data->sequencer_position;
new_read->read_common.experiment_id = run_info_data->experiment_name;
Expand Down
Loading

0 comments on commit 24c274a

Please sign in to comment.