Skip to content

Commit 24c274a

Browse files
committed
Merge branch 'master' into DOR-993_fix_auto_batchsize_for_short_chunk_supv5
2 parents 5cbfbcd + f55612c commit 24c274a

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+1406
-391
lines changed

CHANGELOG.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,35 @@
22

33
All notable changes to Dorado will be documented in this file.
44

5+
# [0.9.0] (16 Dec 2024)
6+
7+
This major release of Dorado introduces several new features and enhancements. The `polish` command, currently experimental, is optimised for refining draft assemblies of human genomes. This release also adds faster DNA modification calling models and improved 6mA false positive rate (FPR) in native human samples. Barcode demultiplexing accuracy has been significantly enhanced for kits with barcodes at both ends, including `SQK-NBD114`. Note that using custom barcode kits now requires the `--kit-name` option. A feature has been added to enable running `dorado correct` in blocks, allowing work to be divided into smaller pieces for easy submission to a compute cluster. Additional updates include the `qs` tag for mean basecall Q-scores in FASTQ output, an upgrade to POD5 to support systems with large page sizes, improvements to Poly(A) tail length estimation, and various bug fixes to enhance stability and functionality.
8+
9+
* 2b96c0b3d421e7729b6189f334cd7c1be50d53a6 - New Dorado `polish` feature for assembly polishing
10+
* 0bab1669df18689aafbc6d42403aedacc36d297d - Faster modified base models for DNA `4mC_5mC`, `5mC_5hmC`, `5mCG_5hmCG`, and `6mA`
11+
* e6371667f8c2fbd66e2ef5b1b91f36d47b3767f5 - Enable running dorado correct in blocks, for easy submission to a compute cluster
12+
* 40296da37d815a1ae2dc7f5739daeabd76dc5767 - Reduced false positive classification rates for kits with barcodes at both ends
13+
* 35da003bfc50afdeef0462cfd15f2d2fc237e538 - Improve barcode classification when barcodes can be on either end
14+
* cbcdf38faaac21dc6d805ed70e14e1786b739d02 - Only classify barcodes which are present on sample sheet if provided
15+
* 2449d03c23c577fb281e9fa02d414289ae8e2c08 - Correct `AF02F_14` and `AH10R_80` barcodes from `TWIST-96A-UDI`
16+
* 631e94c823463156188d0e3364505c7f39d3327a - Prevent Dorado `demux` from stripping alignment information when `--no-trim` is specified
17+
* affea85594a405a2cb162637e1c22e8a71ac7cc5 - Prevent missing filenames when using `--emit-summary` with Dorado `demux`
18+
* 3dec15a4ae6f2acc7953c8f81e34bae63a659372 - Improve poly(A) tail estimation accuracy, including with interrupted tails
19+
* df57d34665d8c6ab6d45bc893c5f5ac95d54f58e - Limit poly(A) estimation to reads with plausible signal to prevent stalls in calculation
20+
* 6cf701a825959cf14079b976f2199e0643b450b2 - Add `min_primer_separation` option to custom poly(A) configuration
21+
* bf51bd492618a896afc33cbe246f76e8f62e852f - Add `qs` tag with mean basecall Q-score to FASTQ output
22+
* dac076de03f323bef273a159debfd458629befa6 - Upgrade to POD5 v0.3.23 to support systems with large page sizes for POD5 and .fast5
23+
* c7a7a58e9f5f264ae2c57982fe02bc6bb28fc6bd - Prevent silent failure or segfault on Windows with bad custom barcode files
24+
* 1e829d5494d5b3cf6c7bbc67697d0561e966c7d0 - Do not allow basecalling if target directory includes both POD5 and .fast5 files
25+
* 05d0981cbe5a05ca910806ace5bfe02fd8baef00 - Fix modified base trim for reverse-aligned BAM records
26+
* afdb06837706e69d06095c7652ef3fcea700bfa4 - Fix invalid `MM` tag after trimming when no mods are present
27+
* 0d788d7df7edcbc2e828985b5ff894f5535f5e49 - Prevent crash when insufficient permissions to read an input file/folder
28+
* dbece016fe88c1fded10162fd770bd1bdbf5ebed - Update custom barcoding documentation to accurately reflect demultiplexing logic
29+
* 6db40ec1f33b2a2974d475f701e3e48ee745421e - Correct model context info shown in `dorado download --list-structured`
30+
* 03acc12855fc944e071fef98ce89c99171ef465e - Use the `-o` short option only for `--output-dir` and not for `--overlap`
31+
* 8d9c017097b6dc5fa6b3f2f40f2a9851f88383cb - Added support for reading gzipped compressed FASTQ files
32+
33+
534
# [0.8.3] (11 Nov 2024)
635

736
This release of Dorado includes fixes and improvements to the Dorado 0.8.2 release, including a fix to SUP basecalling on Apple Silicon.

CMakeLists.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -177,6 +177,8 @@ add_library(dorado_lib
177177
dorado/api/runner_creation.h
178178
dorado/api/pipeline_creation.cpp
179179
dorado/api/pipeline_creation.h
180+
dorado/demux/adapter_primer_kits.cpp
181+
dorado/demux/adapter_primer_kits.h
180182
dorado/demux/adapter_info.h
181183
dorado/demux/AdapterDetector.cpp
182184
dorado/demux/AdapterDetector.h

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -24,10 +24,10 @@ If you encounter any problems building or running Dorado, please [report an issu
2424

2525
First, download the relevant installer for your platform:
2626

27-
- [dorado-0.8.3-linux-x64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.8.3-linux-x64.tar.gz)
28-
- [dorado-0.8.3-linux-arm64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.8.3-linux-arm64.tar.gz)
29-
- [dorado-0.8.3-osx-arm64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.8.3-osx-arm64.zip)
30-
- [dorado-0.8.3-win64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.8.3-win64.zip)
27+
- [dorado-0.9.0-linux-x64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.9.0-linux-x64.tar.gz)
28+
- [dorado-0.9.0-linux-arm64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.9.0-linux-arm64.tar.gz)
29+
- [dorado-0.9.0-osx-arm64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.9.0-osx-arm64.zip)
30+
- [dorado-0.9.0-win64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.9.0-win64.zip)
3131

3232
Once the relevant `.tar.gz` or `.zip` archive is downloaded, extract the archive to your desired location.
3333

cmake/HDF5.cmake

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ option(DYNAMIC_HDF "Link HDF as dynamic libs" OFF)
33
if((CMAKE_SYSTEM_NAME STREQUAL "Linux") AND (CMAKE_SYSTEM_PROCESSOR STREQUAL "aarch64"))
44
# download the pacakge for arm, we want to package this due to hdf5's dependencies
55
set(DYNAMIC_HDF ON)
6-
set(HDF_VER hdf5-1.10.0-1-aarch64)
6+
set(HDF_VER hdf5-1.10.0-aarch64)
77
download_and_extract(https://cdn.oxfordnanoportal.com/software/analysis/${HDF_VER}.zip ${HDF_VER})
88
list(PREPEND CMAKE_PREFIX_PATH ${DORADO_3RD_PARTY_DOWNLOAD}/${HDF_VER}/${HDF_VER})
99

documentation/CustomPrimers.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
### Custom Adapter and Primer Sequences
2+
3+
Dorado will normally automatically detect and trim any adapter or primer sequences it finds. The specific sequences it searches for depend on the specified sequencing kit. This applies to both the basecaller subcommand, where the kit name is expected to be embedded in the read in the input pod5 file, and the trim subcommand, where the kit must be specified as a command-line option to dorado.
4+
5+
In some cases, it may be necessary to find and remove adapter and/or primer sequences that would not normally be associated with the sequencing kit that was used, or you may be working with older data for which the sequencing kit and/or primers being used are no longer directly supported by dorado (for example, anything prior to kit14). In such cases, you can specify a custom adapter/primer file, using the command-line option `--primer-sequences`.
6+
7+
If this option is used, then the sequences encoded in the specified file will be used instead of the built-in sequences that dorado normally searches for.
8+
9+
#### Custom adapter/primer file format
10+
11+
The custom adapter/primer file is really just a fasta file, with the desired sequences specified within. However, some additional metadata is needed to allow dorado to properly interpret how the sequences should be used.
12+
13+
* The record name for each sequence must be of the form `[id]_front` or `[id]_rear`.
14+
* The `id` part of the record name may occur, at most, twice in the file: Once with `_front` and once with `_rear`.
15+
* Immediately following the record name must be a space, followed by either `type=adapter` or `type=primer`.
16+
* Following the type designator, you can have an additional space, followed by `kits=[kit1],[kit2],[kit3][etc...]`.
17+
18+
The `_front` and `_rear` part of the record name tells dorado how to search for the sequence. In the case of adapters, dorado will look for the `front` sequence near the beginning of the read, and for the `rear` sequence near the end of the read. For primers, dorado also look for the `front` and `rear` sequences at the beginning and end of the read, just as with adapters, but it will also look for the reverse-complement of the `rear` sequence near the beginning of the read, and for the reverse-complement of the `front` sequence near the end of the read.
19+
20+
The `type` designator is required to designate whether the sequence in an adapter or a primer sequence, so that dorado knows how it should be used.
21+
22+
The `kits` designator is optional. If provided, then the sequence will only be searched for if the sequencing-kit information in the read matches one of the kit names in the custom file. If the `kits` designator is not provided, then the sequence will be searched for in all reads, regardless of the kit that was used. Note that the kit names are case-insensitive.
23+
24+
#### Example custom adapter/primer file.
25+
26+
The following could be used to detect the PCR_PSK_rev1 and PCR_PSK_rev2 primers, along with the LSK109 adapters, for older data.
27+
28+
```
29+
>LSK109_front type=adapter
30+
AATGTACTTCGTTCAGTTACGTATTGCT
31+
32+
>LSK109_rear type=adapter
33+
AGCAATACGTAACTGAACGAAGT
34+
35+
>PCR_PSK_front type=primer
36+
ACTTGCCTGTCGCTCTATCTTCGGCGTCTGCTTGGGTGTTTAACC
37+
38+
>PCR_PSK_rear type=primer
39+
AGGTTAAACACCCAAGCAGACGCCGCAATATCAGCACCAACAGAAA
40+
```
41+
42+
In this case, the above adapters and primers would be searched for in all reads, regardless of the sequencing-kit information encoded in the read file, or in the case of dorado trim, regardless of the sequencing-kit specified on the command-line. If you wanted to restrict the software so that the primers would only be searched for in reads with `SQK-PSK004` specified as the kit name, and the adapters would only be searched for if the kit name was specified as either `SQK-PSK004` or `SQK-LSK109`, then the following could be used.
43+
44+
```
45+
>LSK109_front type=adapter kits=SQK-PSK004,SQK-LSK109
46+
AATGTACTTCGTTCAGTTACGTATTGCT
47+
48+
>LSK109_rear type=adapter kits=SQK-PSK004,SQK-LSK109
49+
AGCAATACGTAACTGAACGAAGT
50+
51+
>PCR_PSK_front type=primer kits=SQK-PSK004
52+
ACTTGCCTGTCGCTCTATCTTCGGCGTCTGCTTGGGTGTTTAACC
53+
54+
>PCR_PSK_rear type=primer kits=SQK-PSK004
55+
AGGTTAAACACCCAAGCAGACGCCGCAATATCAGCACCAACAGAAA
56+
```

dorado/cli/basecaller.cpp

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -760,8 +760,11 @@ int basecaller(int argc, char* argv[]) {
760760
parser.visible.present<std::string>("--barcode-sequences");
761761
if (custom_seqs.has_value()) {
762762
try {
763-
std::unordered_map<std::string, std::string> custom_barcodes =
764-
demux::parse_custom_sequences(*custom_seqs);
763+
std::unordered_map<std::string, std::string> custom_barcodes;
764+
auto custom_sequences = demux::parse_custom_sequences(*custom_seqs);
765+
for (const auto& entry : custom_sequences) {
766+
custom_barcodes.emplace(std::make_pair(entry.name, entry.sequence));
767+
}
765768
barcode_kits::add_custom_barcodes(custom_barcodes);
766769
} catch (const std::exception& e) {
767770
spdlog::error(e.what());

dorado/cli/demux.cpp

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -269,8 +269,11 @@ int demuxer(int argc, char* argv[]) {
269269
parser.visible.present<std::string>("--barcode-sequences");
270270
if (custom_seqs.has_value()) {
271271
try {
272-
std::unordered_map<std::string, std::string> custom_barcodes =
273-
demux::parse_custom_sequences(*custom_seqs);
272+
std::unordered_map<std::string, std::string> custom_barcodes;
273+
auto custom_sequences = demux::parse_custom_sequences(*custom_seqs);
274+
for (const auto& entry : custom_sequences) {
275+
custom_barcodes.emplace(std::make_pair(entry.name, entry.sequence));
276+
}
274277
barcode_kits::add_custom_barcodes(custom_barcodes);
275278
} catch (const std::exception& e) {
276279
spdlog::error(e.what());

dorado/cli/polish.cpp

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -588,9 +588,6 @@ const polisher::ModelConfig resolve_model(const polisher::BamInfo& bam_info,
588588
// Example: [email protected]_polish_rl_mv
589589
std::string model_name = basecaller_model + polish_model_suffix;
590590

591-
// Example: dna_r10.4.1_e8.2_400bps_hac_v5.0.0_polish_rl_mv
592-
std::replace(std::begin(model_name), std::end(model_name), '@', '_');
593-
594591
spdlog::info("Downloading model: '{}'", model_name);
595592
model_dir = download_model(model_name);
596593

@@ -613,9 +610,6 @@ const polisher::ModelConfig resolve_model(const polisher::BamInfo& bam_info,
613610
// Example: [email protected]_polish_rl_mv
614611
std::string model_name = basecaller_model + polish_model_suffix;
615612

616-
// Example: dna_r10.4.1_e8.2_400bps_hac_v5.0.0_polish_rl_mv
617-
std::replace(std::begin(model_name), std::end(model_name), '@', '_');
618-
619613
spdlog::info("Downloading model: '{}'", model_name);
620614
model_dir = download_model(model_name);
621615

dorado/cli/trim.cpp

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,15 +40,16 @@ int trim(int argc, char* argv[]) {
4040
.nargs(argparse::nargs_pattern::any);
4141
parser.add_argument("-t", "--threads")
4242
.help("Combined number of threads for adapter/primer detection and output generation. "
43-
"Default uses "
44-
"all available threads.")
43+
"Default uses all available threads.")
4544
.default_value(0)
4645
.scan<'i', int>();
4746
parser.add_argument("-n", "--max-reads")
4847
.help("Maximum number of reads to process. Mainly for debugging. Process all reads by "
4948
"default.")
5049
.default_value(0)
5150
.scan<'i', int>();
51+
parser.add_argument("-k", "--sequencing-kit")
52+
.help("Sequencing kit name to use for selecting adapters and primers to trim.");
5253
parser.add_argument("-l", "--read-ids")
5354
.help("A file with a newline-delimited list of reads to trim.")
5455
.default_value(std::string(""));
@@ -82,6 +83,16 @@ int trim(int argc, char* argv[]) {
8283
utils::SetVerboseLogging(static_cast<dorado::utils::VerboseLogLevel>(verbosity));
8384
}
8485

86+
if (!parser.is_used("--sequencing-kit")) {
87+
spdlog::error("The sequencing kit name must be specified with --sequencing-kit.");
88+
return EXIT_FAILURE;
89+
}
90+
auto kit_name = parser.get<std::string>("--sequencing-kit");
91+
if (kit_name.empty()) {
92+
spdlog::error("Sequencing kit name must be non-empty.");
93+
return EXIT_FAILURE;
94+
}
95+
8596
auto reads(parser.get<std::vector<std::string>>("reads"));
8697
auto threads(parser.get<int>("threads"));
8798
auto max_reads(parser.get<int>("max-reads"));
@@ -146,6 +157,7 @@ int trim(int argc, char* argv[]) {
146157
auto adapter_info = std::make_shared<demux::AdapterInfo>();
147158
adapter_info->trim_adapters = true;
148159
adapter_info->trim_primers = !parser.get<bool>("--no-trim-primers");
160+
adapter_info->kit_name = kit_name;
149161
adapter_info->custom_seqs = custom_primer_file;
150162

151163
auto client_info = std::make_shared<DefaultClientInfo>();

dorado/data_loader/DataLoader.cpp

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -180,6 +180,7 @@ SimplexReadPtr process_pod5_thread_fn(
180180
new_read->start_sample = read_data.start_sample;
181181
new_read->end_sample = read_data.start_sample + read_data.num_samples;
182182
new_read->read_common.flowcell_id = run_info_data->flow_cell_id;
183+
new_read->read_common.sequencing_kit = run_info_data->sequencing_kit;
183184
new_read->read_common.flow_cell_product_code = run_info_data->flow_cell_product_code;
184185
new_read->read_common.position_id = run_info_data->sequencer_position;
185186
new_read->read_common.experiment_id = run_info_data->experiment_name;

0 commit comments

Comments
 (0)