Skip to content

Invalid CRAM file: some paired and mapped reads missing from recalibrated CRAM #1919

Open
@jcslimmer

Description

@jcslimmer

Description of the bug

There is something occurring after the mark duplicates step where some reads are lost and generating an invalid CRAM file. These reads have their flags set as being paired and mapped, but one of the mates is missing from the recalibrated CRAM, including the supplementary alignments. The mates of these same reads and their supplementary alignments can be found in the duplicate marked CRAM.

For example:

From the .recal.cram file using samtools view and grep for A00744:46:HV3C3DSXX:2:2456:7681:26694.

A00744:46:HV3C3DSXX:2:2456:7681:26694       97      chr1    125038009       60      23S99M29S       chrUn_KI270435v1        44788   0       AGAGTTGAACCTATCTTAATATTGAGCAGTTTTGACACACTGTTTTCGTAGAATCTGCAAGTGGATAT
TTGGACAGCTTTGAGGCCTATTGTGGAAAAGGAAATATCTTCACATAAAAACTACACGTACGCATTCTGAGAGAGTTCTTTTT     FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:
FFFFF:FFFFFF:F:FFFF     SA:Z:chrUn_KN707973v1_decoy,1870,-,151M,32,0;   MC:Z:133M18S    PG:Z:MarkDuplicates     MQ:i:1  AS:i:59 XS:i:0  pa:f:0.391      MD:Z:12A5C45T0C0G0T6C2G21       NM:i:8  RG:Z:HV3C3DSXX.HG002.1

From the .md.cram file using samtools view and grep for the same QNAME

A00744:46:HV3C3DSXX:2:2456:7681:26694       97      chr1    125038009       60      23S99M29S       chrUn_KI270435v1        44788   0       AGAGTTGAACCTATCTTAATATTGAGCAGTTTTGACACACTGTTTTCGTAGAATCTGCAAGTGGATAT
TTGGACAGCTTTGAGGCCTATTGTGGAAAAGGAAATATCTTCACATAAAAACTACACGTACGCATTCTGAGAGAGTTCTTTTT     FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:
FFFFF:FFFFFF:F:FFFF     SA:Z:chrUn_KN707973v1_decoy,1870,-,151M,32,0;   MC:Z:133M18S    PG:Z:MarkDuplicates     MQ:i:1  AS:i:59 XS:i:0  pa:f:0.391      MD:Z:12A5C45T0C0G0T6C2G21       NM:i:8  RG:Z:HV3C3DSXX.HG002.1
A00744:46:HV3C3DSXX:2:2456:7681:26694       145     chrUn_KI270435v1        44788   1       133M18S chr1    125038009       0       TTGTGATGTGTGCTTTCAACTCACAGAGTTGAACCAATCTTTTGATTGAGCAGTTTTGAATCTCTCTTTCTGTAGAAACTTCAACTGGATATTTGGAGCCCTTATTGGCCTATGGTGGAAAAGGACATATCTTGAAATAAACTCTATATGA     FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF     SA:Z:chrUn_KN707973v1_decoy,1552,+,151M,28,0;   MC:Z:23S99M29S  PG:Z:MarkDuplicates     MQ:i:60 AS:i:63 XS:i:58 pa:f:0.417      MD:Z:8A4A4G2A13T10A5A7A8T7T2G3G18T0G28  NM:i:14 RG:Z:HV3C3DSXX.HG002.1
A00744:46:HV3C3DSXX:2:2456:7681:26694       2177    chrUn_KN707973v1_decoy  1552    28      151M    chr1    125038009       0       TCATATAGAGTTTATTTCAAGATATGTCCTTTTCCACCATAGGCCAATAAGGGCTCCAAATATCCAGTTGAAGTTTCTACAGAAAGAGAGATTCAAAACTGCTCAATCAAAAGATTGGTTCAACTCTGTGAGTTGAAAGCACACATCACAA     FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFF:FFFFFF:FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF     SA:Z:chrUn_KI270435v1,44788,-,133M18S,1,14;     XA:Z:chrUn_KN707899v1_decoy,+430,151M,3;        MC:Z:23S99M29S  PG:Z:MarkDuplicates     MQ:i:60 AS:i:151        XS:i:138        MD:Z:151        NM:i:0  RG:Z:HV3C3DSXX.HG002.1
A00744:46:HV3C3DSXX:2:2456:7681:26694       2161    chrUn_KN707973v1_decoy  1870    32      151M    chrUn_KI270435v1        44788   0       AAAAAGAACTCTCTCAGAATGCGTACGTGTAGTTTTTATGTGAAGATATTTCCTTTTCCACAATAGGCCTCAAAGCTGTCCAAATATCCACTTGCAGATTCTACGAAAACAGTGTGTCAAAACTGCTCAATATTAAGATAGGTTCAACTCT     FFFF:F:FFFFFF:FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF     SA:Z:chr1,125038009,+,23S99M29S,60,8;   XA:Z:chrUn_KN707899v1_decoy,-748,151M,3;        MC:Z:133M18S    PG:Z:MarkDuplicates     MQ:i:1  AS:i:151        XS:i:136        MD:Z:151        NM:i:0  RG:Z:HV3C3DSXX.HG002.1

I ran a bunch of different tools using these CRAM files as input, mostly related to SV calling, so I thought maybe the CRAMs had been mistakenly modified after the sarek read processing. I reran sarek with one of the samples and it gave the same result though.

Command used and terminal output

nextflow run /data/sarek/nf-core-sarek_3.4.4/3_4_4/ \
	--input /data/menzies_projects/ms-family/share/benchmarking/sarek-3.4.4_HG002_bwa_hg38_fastp_noqualfilter_dedup6_test/GIAB_HG002_google_novaseq_input.csv \
	-profile singularity \
	--aligner bwa-mem \
	--igenomes_base /data/iGenomes/ \
	-c /data/menzies_projects/ms-family/share/benchmarking/sarek-3.4.4_HG002_bwa_hg38_fastp_noqualfilter_dedup6_test/fastp_noqualfilter_dedup.manysamples.custom.config \
	--outdir /data/menzies_projects/ms-family/share/benchmarking/sarek-3.4.4_HG002_bwa_hg38_fastp_noqualfilter_dedup6_test \
	--save_trimmed \
	--save_mapped \
	-resume

Relevant files

Attached is a zip with the .nextflow.log, config and stdout files.

sarek_logs_cram_issue.zip

System information

Nextflow v23.10.1
HPC
PBS Pro
Singularity
CentOS 7 Linux
sarek v3.4.4

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions