Skip to content

Commit

Permalink
Merge branch 'ar/update-docs-0.2.4' into 'master'
Browse files Browse the repository at this point in the history
[adjust, update, call-mods] Allow parsing of valid non-primary

See merge request machine-learning/modkit!134
  • Loading branch information
ArtRand committed Dec 23, 2023
2 parents 3600d3a + 533ccef commit 13b7b2e
Show file tree
Hide file tree
Showing 16 changed files with 121 additions and 33 deletions.
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,13 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [v0.2.4]
### Adds
- [extract, adjust-mods, update-tags, call-mods] Parse MN tag in order to use secondary and supplementary alignments.
### Fixes
- [all] Improve performance slightly when using short and frequent motifs with `--motif` option.


## [v0.2.3]
### Adds
- [dmr, multi] Allow site-level scoring by omitting the `--regions` argument. Sites will be collected from the input bedMethyl files.
Expand Down
6 changes: 6 additions & 0 deletions book/src/advanced_usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -732,6 +732,12 @@ Options:
--mapped-only
Include only mapped bases in output. (alias: mapped)
--allow-non-primary
Output aligned secondary and supplementary base modification probabilities as additional
rows. The primary alignment will have all of the base modification probabilities
(including soft-clipped ones, unless --mapped-only is used). The non-primary alignments
will only have mapped bases in the output.
--num-reads <NUM_READS>
Number of reads to use. Note that when using a sorted, indexed modBAM that the sampling
algorithm will attempt to sample records evenly over the length of the reference sequence.
Expand Down
4 changes: 2 additions & 2 deletions book/src/intro_adjust.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@

The `adjust-mods` subcommand can be used to manipulate MM (and corresponding ML) tags in a
modBam. In general, these simple commands are run prior to `pileup`, visualization, or
other analysis. If alignment information is present, only the **primary alignment** is used,
and supplementary alignments will not be in the output (see [limitations](./limitations.md)).
other analysis. For `adjust-mods` and `update-tags`, if a correct `MN` tag is found, secondary and supplementary
alignments will be output. See [troubleshooting](./troubleshooting.md) for details.


## Ignoring a modification class.
Expand Down
5 changes: 2 additions & 3 deletions book/src/intro_call_mods.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,8 @@ modBAM where the base modification probabilities have been clamped to 100% and
[options](./advanced_usage.md#call-mods) are provided, base modification calls
failing the threshold will be removed prior to changing the probabilities. The
output modBAM can be used for visualization, `pileup`, or other applications.
If alignment information is present, only the **primary alignment** is used,
and supplementary alignments will not be in the output (see
[limitations](./limitations.md)).
For `call-mods`, if a correct `MN` tag is found, secondary and supplementary
alignments will be output. See [troubleshooting](./troubleshooting.md) for details.

A modBAM that has been transformed with `call-mods` using `--filter-threshold`
and/or `--mod-threshold` cannot be re-transformed with different thresholds.
Expand Down
20 changes: 18 additions & 2 deletions book/src/intro_extract.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
# Extracting base modification information

The `modkit extract` sub-command will produce a table containing the base modification probabilities,
the read sequence context, and optionally aligned reference information. If alignment information is
present, only the **primary alignment** is used.
the read sequence context, and optionally aligned reference information.
For `extract`, if a correct `MN` tag is found, secondary and supplementary alignments may be output with the `--allow-non-primary` flag.
See [troubleshooting](./troubleshooting.md) for details.

The table will by default contain unmapped sections of the read (soft-clipped sections, for example).
To only include mapped bases use the `--mapped` flag. To only include sites of interest, pass a
Expand Down Expand Up @@ -34,6 +35,7 @@ or `stdout` and filter the columns before writing to disk.
| 16 | canonical_base | canonical base from the query sequence, from the MM tag | str |
| 17 | modified_primary_base | primary sequence base with the modification | str |
| 18 | inferred | whether the base modification call is implicit canonical | str |
| 19 | flag | FLAG from alignment record | str |


# Tabulating base modification _calls_ for each read position
Expand Down Expand Up @@ -65,6 +67,7 @@ reserved for "any modification"). The full schema of the table is below:
| 18 | fail | true if the base modification call fell below the pass threshold | str |
| 19 | inferred | whether the base modification call is implicit canonical | str |
| 20 | within_alignment | when alignment information is present, is this base aligned to the reference | str |
| 21 | flag | FLAG from alignment record | str |


## Note on implicit base modification calls.
Expand All @@ -75,6 +78,14 @@ called on that read. For example, if you have a `A+a.` MM tag, and there are `A`
there aren't base modification calls (identifiable as non-0s in the MM tag) will be rows where the `mod_code`
is `a` and the `mod_qual` is 0.0.

## Note on non-primary alignments
If a valid `MN` tag is found, secondary and supplementary alignments can be output in the `modkit extract` tables above.
See [troubleshooting](./troubleshooting.md) for details on how to get valid `MN` tags.
To have non-primary alignments appear in the output, the `--allow-non-primary` flag must be passed.
By default, the primary alignment will have all base modification information contained on the read, including soft-clipped and unaligned read positions.
If the `--mapped-only` flag is used, soft clipped sections of the read will not be included.
For secondary and supplementary alignments, soft-clipped positions are not repeated. See [advanced usage](./advanced_usage.md) for more details.

## Example usages:

### Extract a table from an aligned and indexed BAM
Expand Down Expand Up @@ -111,5 +122,10 @@ to /dev/null, to keep this output specify a file or `-` for standard out.
```
modkit extract <input.bam> <output.tsv> --read-calls <calls.tsv>
```
Use `--allow-non-primary` to get secondary and supplementary mappings in the output.
```
modkit extract <input.bam> <output.tsv> --read-calls <calls.tsv> --allow-non-primary
```


See the help string and/or [advanced_usage](./advanced_usage.md) for more details.
5 changes: 1 addition & 4 deletions book/src/limitations.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,4 @@ Known limitations and forecasts for when they will be removed.
is detected more than once, the occurrence is logged but both alignments will be used. This limitation may be
removed in the future with a form of dynamic de-duplication.
3. Only one MM-flag (`.`, `?`) per-canonical base is supported within a read.
- This limitation may be removed in the future.
4. Functions that transform a modBAM into another modBAM (and manipulate the MM and ML tags) can only do so
with the primary alignments. Supplementary and secondary alignments will not be present in the output.
There are plans to remove this limitation in the near future.
- This limitation may be removed in the future.
13 changes: 13 additions & 0 deletions book/src/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,19 @@ It's recommended to run all `modkit` commands with the `--log-filepath <path-to-
option set. When unexpected outputs are produced inspecting this file will often indicate
the reason.


## Missing secondary and supplementary alignments in output

As of v0.2.4 secondary and supplementary alignments are supported in `adjust-mods`, `update-tags`, `call-mods`, and (optionally) in `extract`.
However, in order to use these alignment records correctly, the `MN` tag must be present and correct in the record.
The `MN` tag indicates the length of the sequence corresponding to the `MM` and `ML` tags.
As of dorado v0.5.0 the `MN` tag is output when modified base calls are produced.
If the aligner has hard-clipped the sequence, this number will not match the sequence length and the record cannot be used.
Similarly, if the SEQ field is empty (sequence length zero), the record cannot be used.
One way to use supplementary alignments is to specify the `-Y` flag when using [dorado](https://github.com/nanoporetech/dorado/) or [minimap2](https://lh3.github.io/minimap2/minimap2.html).
For these programs, when `-Y` is specified, the sequence will not be hardclipped in supplementary alignments and will be present in secondary alignments.
Other mapping algorithms that are "MM tag-aware" may allow hard-clipping and update the `MM` and `ML` tags, `modkit` will accept these records as long as the `MN` tag indicates the correct sequence length.

## No rows in `modkit pileup` output.

First, check the logfile, there may be many lines with a variant of
Expand Down
6 changes: 6 additions & 0 deletions docs/advanced_usage.html
Original file line number Diff line number Diff line change
Expand Up @@ -862,6 +862,12 @@ <h2 id="extract"><a class="header" href="#extract">extract</a></h2>
--mapped-only
Include only mapped bases in output. (alias: mapped)

--allow-non-primary
Output aligned secondary and supplementary base modification probabilities as additional
rows. The primary alignment will have all of the base modification probabilities
(including soft-clipped ones, unless --mapped-only is used). The non-primary alignments
will only have mapped bases in the output.

--num-reads &lt;NUM_READS&gt;
Number of reads to use. Note that when using a sorted, indexed modBAM that the sampling
algorithm will attempt to sample records evenly over the length of the reference sequence.
Expand Down
4 changes: 2 additions & 2 deletions docs/intro_adjust.html
Original file line number Diff line number Diff line change
Expand Up @@ -150,8 +150,8 @@ <h1 class="menu-title">Modkit</h1>
<h1 id="updating-and-adjusting-mm-tags"><a class="header" href="#updating-and-adjusting-mm-tags">Updating and Adjusting MM tags.</a></h1>
<p>The <code>adjust-mods</code> subcommand can be used to manipulate MM (and corresponding ML) tags in a
modBam. In general, these simple commands are run prior to <code>pileup</code>, visualization, or
other analysis. If alignment information is present, only the <strong>primary alignment</strong> is used,
and supplementary alignments will not be in the output (see <a href="./limitations.html">limitations</a>).</p>
other analysis. For <code>adjust-mods</code> and <code>update-tags</code>, if a correct <code>MN</code> tag is found, secondary and supplementary
alignments will be output. See <a href="./troubleshooting.html">troubleshooting</a> for details.</p>
<h2 id="ignoring-a-modification-class"><a class="header" href="#ignoring-a-modification-class">Ignoring a modification class.</a></h2>
<p>To remove a base modification class from a modBAM and produce a new modBAM, use the
<code>--ignore</code> option for <code>adjust-mods</code>.</p>
Expand Down
5 changes: 2 additions & 3 deletions docs/intro_call_mods.html
Original file line number Diff line number Diff line change
Expand Up @@ -154,9 +154,8 @@ <h1 id="calling-mods-in-a-modbam"><a class="header" href="#calling-mods-in-a-mod
<a href="./advanced_usage.html#call-mods">options</a> are provided, base modification calls
failing the threshold will be removed prior to changing the probabilities. The
output modBAM can be used for visualization, <code>pileup</code>, or other applications.
If alignment information is present, only the <strong>primary alignment</strong> is used,
and supplementary alignments will not be in the output (see
<a href="./limitations.html">limitations</a>).</p>
For <code>call-mods</code>, if a correct <code>MN</code> tag is found, secondary and supplementary
alignments will be output. See <a href="./troubleshooting.html">troubleshooting</a> for details.</p>
<p>A modBAM that has been transformed with <code>call-mods</code> using <code>--filter-threshold</code>
and/or <code>--mod-threshold</code> cannot be re-transformed with different thresholds.</p>
<p>Note on <code>pileup</code> with clamped probabilities: <code>modkit pileup</code> will attempt to
Expand Down
17 changes: 15 additions & 2 deletions docs/intro_extract.html
Original file line number Diff line number Diff line change
Expand Up @@ -149,8 +149,9 @@ <h1 class="menu-title">Modkit</h1>
<main>
<h1 id="extracting-base-modification-information"><a class="header" href="#extracting-base-modification-information">Extracting base modification information</a></h1>
<p>The <code>modkit extract</code> sub-command will produce a table containing the base modification probabilities,
the read sequence context, and optionally aligned reference information. If alignment information is
present, only the <strong>primary alignment</strong> is used.</p>
the read sequence context, and optionally aligned reference information.
For <code>extract</code>, if a correct <code>MN</code> tag is found, secondary and supplementary alignments may be output with the <code>--allow-non-primary</code> flag.
See <a href="./troubleshooting.html">troubleshooting</a> for details.</p>
<p>The table will by default contain unmapped sections of the read (soft-clipped sections, for example).
To only include mapped bases use the <code>--mapped</code> flag. To only include sites of interest, pass a
BED-formatted file to the <code>--include-bed</code> option. Similarly, to exclude sites, pass a BED-formatted
Expand Down Expand Up @@ -178,6 +179,7 @@ <h2 id="description-of-output-table"><a class="header" href="#description-of-out
<tr><td>16</td><td>canonical_base</td><td>canonical base from the query sequence, from the MM tag</td><td>str</td></tr>
<tr><td>17</td><td>modified_primary_base</td><td>primary sequence base with the modification</td><td>str</td></tr>
<tr><td>18</td><td>inferred</td><td>whether the base modification call is implicit canonical</td><td>str</td></tr>
<tr><td>19</td><td>flag</td><td>FLAG from alignment record</td><td>str</td></tr>
</tbody></table>
</div>
<h1 id="tabulating-base-modification-calls-for-each-read-position"><a class="header" href="#tabulating-base-modification-calls-for-each-read-position">Tabulating base modification <em>calls</em> for each read position</a></h1>
Expand Down Expand Up @@ -207,6 +209,7 @@ <h1 id="tabulating-base-modification-calls-for-each-read-position"><a class="hea
<tr><td>18</td><td>fail</td><td>true if the base modification call fell below the pass threshold</td><td>str</td></tr>
<tr><td>19</td><td>inferred</td><td>whether the base modification call is implicit canonical</td><td>str</td></tr>
<tr><td>20</td><td>within_alignment</td><td>when alignment information is present, is this base aligned to the reference</td><td>str</td></tr>
<tr><td>21</td><td>flag</td><td>FLAG from alignment record</td><td>str</td></tr>
</tbody></table>
</div>
<h2 id="note-on-implicit-base-modification-calls"><a class="header" href="#note-on-implicit-base-modification-calls">Note on implicit base modification calls.</a></h2>
Expand All @@ -216,6 +219,13 @@ <h2 id="note-on-implicit-base-modification-calls"><a class="header" href="#note-
called on that read. For example, if you have a <code>A+a.</code> MM tag, and there are <code>A</code> bases in the read for which
there aren't base modification calls (identifiable as non-0s in the MM tag) will be rows where the <code>mod_code</code>
is <code>a</code> and the <code>mod_qual</code> is 0.0.</p>
<h2 id="note-on-non-primary-alignments"><a class="header" href="#note-on-non-primary-alignments">Note on non-primary alignments</a></h2>
<p>If a valid <code>MN</code> tag is found, secondary and supplementary alignments can be output in the <code>modkit extract</code> tables above.
See <a href="./troubleshooting.html">troubleshooting</a> for details on how to get valid <code>MN</code> tags.
To have non-primary alignments appear in the output, the <code>--allow-non-primary</code> flag must be passed.
By default, the primary alignment will have all base modification information contained on the read, including soft-clipped and unaligned read positions.
If the <code>--mapped-only</code> flag is used, soft clipped sections of the read will not be included.
For secondary and supplementary alignments, soft-clipped positions are not repeated. See <a href="./advanced_usage.html">advanced usage</a> for more details.</p>
<h2 id="example-usages"><a class="header" href="#example-usages">Example usages:</a></h2>
<h3 id="extract-a-table-from-an-aligned-and-indexed-bam"><a class="header" href="#extract-a-table-from-an-aligned-and-indexed-bam">Extract a table from an aligned and indexed BAM</a></h3>
<pre><code>modkit extract &lt;input.bam&gt; &lt;output.tsv&gt;
Expand All @@ -240,6 +250,9 @@ <h3 id="extract-read-level-base-modification-calls"><a class="header" href="#ext
to /dev/null, to keep this output specify a file or <code>-</code> for standard out.</p>
<pre><code>modkit extract &lt;input.bam&gt; &lt;output.tsv&gt; --read-calls &lt;calls.tsv&gt;
</code></pre>
<p>Use <code>--allow-non-primary</code> to get secondary and supplementary mappings in the output.</p>
<pre><code>modkit extract &lt;input.bam&gt; &lt;output.tsv&gt; --read-calls &lt;calls.tsv&gt; --allow-non-primary
</code></pre>
<p>See the help string and/or <a href="./advanced_usage.html">advanced_usage</a> for more details.</p>

</main>
Expand Down
3 changes: 0 additions & 3 deletions docs/limitations.html
Original file line number Diff line number Diff line change
Expand Up @@ -163,9 +163,6 @@ <h1 id="current-limitations"><a class="header" href="#current-limitations">Curre
<li>This limitation may be removed in the future.</li>
</ul>
</li>
<li>Functions that transform a modBAM into another modBAM (and manipulate the MM and ML tags) can only do so
with the primary alignments. Supplementary and secondary alignments will not be present in the output.
There are plans to remove this limitation in the near future.</li>
</ol>

</main>
Expand Down
Loading

0 comments on commit 13b7b2e

Please sign in to comment.