Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Methylation tag (MM:Z) broken when there are no methylated cytosines called for [email protected] basecalling model #1107

Open
jenni-westoby opened this issue Oct 29, 2024 · 1 comment
Labels
mods For issues related to modified base calling

Comments

@jenni-westoby
Copy link

Issue Report

Running the latest version of dorado (0.8.2) with the [email protected] and 5mC_5hmC modifed bases, I have discovered an issue with the MM:Z tag for reads where 0 methylated cytosines are called. Here is an example of one of the reads with 0 methylated cytosines in my dataset

554afa26-4ebf-4bd2-970e-63b3514fa982    4       *       0       0       *       *       0       0       GATTAACTATTAA   %',-0,''''''0   qs:f:6.27569  du:f:0.1202   mx:i:4  ch:i:130        st:Z:2024-10-11T04:33:19.124+00:00      rn:i:2661       fn:Z:PBA24182_cf1dc3d4_1fd54266_13.pod5 sm:f:765.842    sd:f:125.658        sv:Z:pa dx:i:0  RG:Z:1fd542667ad8922880ec7dd46d1ff099569ad6c5_dna_r10.4.1_e8.2_400bps_hac@v5.0.0        MN:i:13 BC:Z:SQK-NBD114-24_barcode24    MM:Z:C+h.,;C+m.,;     ML:B:C

The MM:Z tag reads MM:Z:C+h.,;C+m.,; . Based on section 1.7 of the samtools documentation (link), I think this tag should be MM:Z:C+h.;C+m.; in the case where 0 methylated cytosines are called (ie. there shouldn't be commas in the tag). The current tag with trailing commas breaks downstream bioinformatics tools that try to extract methylation information from the reads. Using a sed command to convert the MM:Z:C+h.,;C+m.,; tags to MM:Z:C+h.;C+m.; fixed the downstream tools.

Steps to reproduce the issue:

I suspect we encountered this issue partly because there were some issues with this sequencing run and we had a lot of very short low quality reads, like the one above. This meant we had a lot of reads with no/few cytosines or for which no confident modified basecalls could be made. These were the commands used for basecalling/demultiplexing. There were no error messages during either command, which ran to the end successfully

dorado basecaller [email protected] pod5_dir --modified-bases 5mC_5hmC > hac_mods.bam
dorado demux --output-dir demuxed_bams --kit-name SQK-NBD114-24 hac_mods.bam

Run environment:

  • Dorado version: 0.8.2
  • Dorado command: dorado basecaller [email protected] pod5_dir --modified-bases 5mC_5hmC > hac_mods.bam
  • Operating system: Linux
  • Hardware (CPUs, Memory, GPUs): NVIDIA A10G
  • Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5
  • Source data location (on device or networked drive - NFS, etc.): On device
  • Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): Promethion P2 Solo, SQK-NBD114-24. Most reads ~2000-3000 bases long but some very short reads in dataset - let me know if further details are needed. Total dataset size ~0.5TB.
@iiSeymour iiSeymour added the mods For issues related to modified base calling label Oct 29, 2024
@HalfPhoton
Copy link
Collaborator

Hi @jenni-westoby,
Thanks for bringing this to our attention. We'll get this fixed.

Kind regards,
Rich

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mods For issues related to modified base calling
Projects
None yet
Development

No branches or pull requests

3 participants