From 533ccefc4bb842cb843b5e92b7cd60dc2b80017e Mon Sep 17 00:00:00 2001
From: Art Rand
Date: Sat, 23 Dec 2023 01:20:29 +0000
Subject: [PATCH] [adjust, update, call-mods] Allow parsing of valid
non-primary
---
CHANGELOG.md | 7 ++++++
book/src/advanced_usage.md | 6 +++++
book/src/intro_adjust.md | 4 ++--
book/src/intro_call_mods.md | 5 ++---
book/src/intro_extract.md | 20 +++++++++++++++--
book/src/limitations.md | 5 +----
book/src/troubleshooting.md | 13 +++++++++++
docs/advanced_usage.html | 6 +++++
docs/intro_adjust.html | 4 ++--
docs/intro_call_mods.html | 5 ++---
docs/intro_extract.html | 17 ++++++++++++--
docs/limitations.html | 3 ---
docs/print.html | 45 ++++++++++++++++++++++++++++---------
docs/searchindex.js | 2 +-
docs/searchindex.json | 2 +-
docs/troubleshooting.html | 10 +++++++++
16 files changed, 121 insertions(+), 33 deletions(-)
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 5e28004..5efbe2f 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -4,6 +4,13 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [v0.2.4]
+### Adds
+- [extract, adjust-mods, update-tags, call-mods] Parse MN tag in order to use secondary and supplementary alignments.
+### Fixes
+- [all] Improve performance slightly when using short and frequent motifs with `--motif` option.
+
+
## [v0.2.3]
### Adds
- [dmr, multi] Allow site-level scoring by omitting the `--regions` argument. Sites will be collected from the input bedMethyl files.
diff --git a/book/src/advanced_usage.md b/book/src/advanced_usage.md
index 4332527..35daf8f 100644
--- a/book/src/advanced_usage.md
+++ b/book/src/advanced_usage.md
@@ -732,6 +732,12 @@ Options:
--mapped-only
Include only mapped bases in output. (alias: mapped)
+ --allow-non-primary
+ Output aligned secondary and supplementary base modification probabilities as additional
+ rows. The primary alignment will have all of the base modification probabilities
+ (including soft-clipped ones, unless --mapped-only is used). The non-primary alignments
+ will only have mapped bases in the output.
+
--num-reads
Number of reads to use. Note that when using a sorted, indexed modBAM that the sampling
algorithm will attempt to sample records evenly over the length of the reference sequence.
diff --git a/book/src/intro_adjust.md b/book/src/intro_adjust.md
index fa3ba40..9555c68 100644
--- a/book/src/intro_adjust.md
+++ b/book/src/intro_adjust.md
@@ -2,8 +2,8 @@
The `adjust-mods` subcommand can be used to manipulate MM (and corresponding ML) tags in a
modBam. In general, these simple commands are run prior to `pileup`, visualization, or
-other analysis. If alignment information is present, only the **primary alignment** is used,
-and supplementary alignments will not be in the output (see [limitations](./limitations.md)).
+other analysis. For `adjust-mods` and `update-tags`, if a correct `MN` tag is found, secondary and supplementary
+alignments will be output. See [troubleshooting](./troubleshooting.md) for details.
## Ignoring a modification class.
diff --git a/book/src/intro_call_mods.md b/book/src/intro_call_mods.md
index 4721a50..2af300b 100644
--- a/book/src/intro_call_mods.md
+++ b/book/src/intro_call_mods.md
@@ -6,9 +6,8 @@ modBAM where the base modification probabilities have been clamped to 100% and
[options](./advanced_usage.md#call-mods) are provided, base modification calls
failing the threshold will be removed prior to changing the probabilities. The
output modBAM can be used for visualization, `pileup`, or other applications.
-If alignment information is present, only the **primary alignment** is used,
-and supplementary alignments will not be in the output (see
-[limitations](./limitations.md)).
+For `call-mods`, if a correct `MN` tag is found, secondary and supplementary
+alignments will be output. See [troubleshooting](./troubleshooting.md) for details.
A modBAM that has been transformed with `call-mods` using `--filter-threshold`
and/or `--mod-threshold` cannot be re-transformed with different thresholds.
diff --git a/book/src/intro_extract.md b/book/src/intro_extract.md
index fd5b7f6..f1bbc1e 100644
--- a/book/src/intro_extract.md
+++ b/book/src/intro_extract.md
@@ -1,8 +1,9 @@
# Extracting base modification information
The `modkit extract` sub-command will produce a table containing the base modification probabilities,
-the read sequence context, and optionally aligned reference information. If alignment information is
-present, only the **primary alignment** is used.
+the read sequence context, and optionally aligned reference information.
+For `extract`, if a correct `MN` tag is found, secondary and supplementary alignments may be output with the `--allow-non-primary` flag.
+See [troubleshooting](./troubleshooting.md) for details.
The table will by default contain unmapped sections of the read (soft-clipped sections, for example).
To only include mapped bases use the `--mapped` flag. To only include sites of interest, pass a
@@ -34,6 +35,7 @@ or `stdout` and filter the columns before writing to disk.
| 16 | canonical_base | canonical base from the query sequence, from the MM tag | str |
| 17 | modified_primary_base | primary sequence base with the modification | str |
| 18 | inferred | whether the base modification call is implicit canonical | str |
+| 19 | flag | FLAG from alignment record | str |
# Tabulating base modification _calls_ for each read position
@@ -65,6 +67,7 @@ reserved for "any modification"). The full schema of the table is below:
| 18 | fail | true if the base modification call fell below the pass threshold | str |
| 19 | inferred | whether the base modification call is implicit canonical | str |
| 20 | within_alignment | when alignment information is present, is this base aligned to the reference | str |
+| 21 | flag | FLAG from alignment record | str |
## Note on implicit base modification calls.
@@ -75,6 +78,14 @@ called on that read. For example, if you have a `A+a.` MM tag, and there are `A`
there aren't base modification calls (identifiable as non-0s in the MM tag) will be rows where the `mod_code`
is `a` and the `mod_qual` is 0.0.
+## Note on non-primary alignments
+If a valid `MN` tag is found, secondary and supplementary alignments can be output in the `modkit extract` tables above.
+See [troubleshooting](./troubleshooting.md) for details on how to get valid `MN` tags.
+To have non-primary alignments appear in the output, the `--allow-non-primary` flag must be passed.
+By default, the primary alignment will have all base modification information contained on the read, including soft-clipped and unaligned read positions.
+If the `--mapped-only` flag is used, soft clipped sections of the read will not be included.
+For secondary and supplementary alignments, soft-clipped positions are not repeated. See [advanced usage](./advanced_usage.md) for more details.
+
## Example usages:
### Extract a table from an aligned and indexed BAM
@@ -111,5 +122,10 @@ to /dev/null, to keep this output specify a file or `-` for standard out.
```
modkit extract --read-calls
```
+Use `--allow-non-primary` to get secondary and supplementary mappings in the output.
+```
+modkit extract --read-calls --allow-non-primary
+```
+
See the help string and/or [advanced_usage](./advanced_usage.md) for more details.
diff --git a/book/src/limitations.md b/book/src/limitations.md
index 7fdb24d..ff6d98f 100644
--- a/book/src/limitations.md
+++ b/book/src/limitations.md
@@ -8,7 +8,4 @@ Known limitations and forecasts for when they will be removed.
is detected more than once, the occurrence is logged but both alignments will be used. This limitation may be
removed in the future with a form of dynamic de-duplication.
3. Only one MM-flag (`.`, `?`) per-canonical base is supported within a read.
- - This limitation may be removed in the future.
-4. Functions that transform a modBAM into another modBAM (and manipulate the MM and ML tags) can only do so
- with the primary alignments. Supplementary and secondary alignments will not be present in the output.
- There are plans to remove this limitation in the near future.
+ - This limitation may be removed in the future.
\ No newline at end of file
diff --git a/book/src/troubleshooting.md b/book/src/troubleshooting.md
index ecb7bfe..42fd86e 100644
--- a/book/src/troubleshooting.md
+++ b/book/src/troubleshooting.md
@@ -4,6 +4,19 @@ It's recommended to run all `modkit` commands with the `--log-filepath
To remove a base modification class from a modBAM and produce a new modBAM, use the
--ignore
option for adjust-mods
.
diff --git a/docs/intro_call_mods.html b/docs/intro_call_mods.html
index bf6333d..fa578f5 100644
--- a/docs/intro_call_mods.html
+++ b/docs/intro_call_mods.html
@@ -154,9 +154,8 @@ are provided, base modification calls
failing the threshold will be removed prior to changing the probabilities. The
output modBAM can be used for visualization, pileup
, or other applications.
-If alignment information is present, only the primary alignment is used,
-and supplementary alignments will not be in the output (see
-limitations).
+For call-mods
, if a correct MN
tag is found, secondary and supplementary
+alignments will be output. See troubleshooting for details.
A modBAM that has been transformed with call-mods
using --filter-threshold
and/or --mod-threshold
cannot be re-transformed with different thresholds.
Note on pileup
with clamped probabilities: modkit pileup
will attempt to
diff --git a/docs/intro_extract.html b/docs/intro_extract.html
index c9ba39d..0451f2e 100644
--- a/docs/intro_extract.html
+++ b/docs/intro_extract.html
@@ -149,8 +149,9 @@
The modkit extract
sub-command will produce a table containing the base modification probabilities,
-the read sequence context, and optionally aligned reference information. If alignment information is
-present, only the primary alignment is used.
+the read sequence context, and optionally aligned reference information.
+For extract
, if a correct MN
tag is found, secondary and supplementary alignments may be output with the --allow-non-primary
flag.
+See troubleshooting for details.
The table will by default contain unmapped sections of the read (soft-clipped sections, for example).
To only include mapped bases use the --mapped
flag. To only include sites of interest, pass a
BED-formatted file to the --include-bed
option. Similarly, to exclude sites, pass a BED-formatted
@@ -178,6 +179,7 @@
@@ -207,6 +209,7 @@
@@ -216,6 +219,13 @@
+If a valid MN
tag is found, secondary and supplementary alignments can be output in the modkit extract
tables above.
+See troubleshooting for details on how to get valid MN
tags.
+To have non-primary alignments appear in the output, the --allow-non-primary
flag must be passed.
+By default, the primary alignment will have all base modification information contained on the read, including soft-clipped and unaligned read positions.
+If the --mapped-only
flag is used, soft clipped sections of the read will not be included.
+For secondary and supplementary alignments, soft-clipped positions are not repeated. See advanced usage for more details.
modkit extract <input.bam> <output.tsv>
@@ -240,6 +250,9 @@
diff --git a/docs/limitations.html b/docs/limitations.html
index 3fe6fbb..081d28e 100644
--- a/docs/limitations.html
+++ b/docs/limitations.html
@@ -163,9 +163,6 @@
@@ -503,6 +505,7 @@
@@ -512,6 +515,13 @@
+
If a valid MN
tag is found, secondary and supplementary alignments can be output in the modkit extract
tables above.
+See troubleshooting for details on how to get valid MN
tags.
+To have non-primary alignments appear in the output, the --allow-non-primary
flag must be passed.
+By default, the primary alignment will have all base modification information contained on the read, including soft-clipped and unaligned read positions.
+If the --mapped-only
flag is used, soft clipped sections of the read will not be included.
+For secondary and supplementary alignments, soft-clipped positions are not repeated. See advanced usage for more details.
modkit extract <input.bam> <output.tsv>
@@ -536,6 +546,9 @@ The call-mods
subcommand in modkit
transforms one modBAM into another
@@ -544,9 +557,8 @@
A modBAM that has been transformed with call-mods
using --filter-threshold
and/or --mod-threshold
cannot be re-transformed with different thresholds.
Note on pileup
with clamped probabilities: modkit pileup
will attempt to
@@ -1605,6 +1617,12 @@
--mapped-only
Include only mapped bases in output. (alias: mapped)
+ --allow-non-primary
+ Output aligned secondary and supplementary base modification probabilities as additional
+ rows. The primary alignment will have all of the base modification probabilities
+ (including soft-clipped ones, unless --mapped-only is used). The non-primary alignments
+ will only have mapped bases in the output.
+
--num-reads <NUM_READS>
Number of reads to use. Note that when using a sorted, indexed modBAM that the sampling
algorithm will attempt to sample records evenly over the length of the reference sequence.
@@ -2059,6 +2077,16 @@
It's recommended to run all modkit
commands with the --log-filepath <path-to-file>
option set. When unexpected outputs are produced inspecting this file will often indicate
the reason.
+
+As of v0.2.4 secondary and supplementary alignments are supported in adjust-mods
, update-tags
, call-mods
, and (optionally) in extract
.
+However, in order to use these alignment records correctly, the MN
tag must be present and correct in the record.
+The MN
tag indicates the length of the sequence corresponding to the MM
and ML
tags.
+As of dorado v0.5.0 the MN
tag is output when modified base calls are produced.
+If the aligner has hard-clipped the sequence, this number will not match the sequence length and the record cannot be used.
+Similarly, if the SEQ field is empty (sequence length zero), the record cannot be used.
+One way to use supplementary alignments is to specify the -Y
flag when using dorado or minimap2.
+For these programs, when -Y
is specified, the sequence will not be hardclipped in supplementary alignments and will be present in secondary alignments.
+Other mapping algorithms that are "MM tag-aware" may allow hard-clipping and update the MM
and ML
tags, modkit
will accept these records as long as the MN
tag indicates the correct sequence length.
First, check the logfile, there may be many lines with a variant of
record 905b4cb4-15e2-4060-9c98-866d0aff78bb has un-allowed mode
@@ -2111,9 +2139,6 @@