Skip to content

ValidateSamFile Exception counting mismatches on read with * in seq column #2011

Open
@tobsecret

Description

@tobsecret

Problem

ValidateSamFile returns an error ValidateSamFile Exception counting mismatches for read 2a883449-02ce-4bca-8f6b-3cf7857caf61 0b aligned to chr1:11425-11571. when trying to validate a BAM file created by aligning using minimap2.

Setup

❯ gatk ValidateSamFile --version                                                                                                                                
Using GATK jar /home/myuser/miniforge3/envs/bioinfo/share/gatk4-4.6.2.0-0/gatk-package-4.6.2.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/myuser/miniforge3/envs/bioinfo/share/gatk4-4.6.2.0-0/gatk-package-4.6.2.0-local.jar ValidateSamFile --version
The Genome Analysis Toolkit (GATK) v4.6.2.0
HTSJDK Version: 4.2.0
Picard Version: 3.4.0

Runs on CentOS on a cluster.

Example data

The .dict file which I had to zip bc .dict files aren't allowed by GH apparently: GRCh38.primary_assembly.genome.fa.dict.zip
Cannot share the genome bc it would be too big to attach to an issue.

All records with the read ID 2a883449-02ce-4bca-8f6b-3cf7857caf61:

read_2a883449-02ce-4bca-8f6b-3cf7857caf61.sam.zip

ValidateSamFile call and error trace

❯ gatk ValidateSamFile -I read_2a883449-02ce-4bca-8f6b-3cf7857caf61.sam -R /path/to/assemblies/GRCh38-P14/GRCh38.primary_assembly.genome.fa

Using GATK jar /home/myuser/miniforge3/envs/bioinfo/share/gatk4-4.6.2.0-0/gatk-package-4.6.2.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/myuser/miniforge3/envs/bioinfo/share/gatk4-4.6.2.0-0/gatk-package-4.6.2.0-local.jar ValidateSamFile -I read_2a883449-02ce-4bca-8f6b-3cf7857caf61.sam -R /path/to/assemblies/GRCh38-P14/GRCh38.primary_assembly.genome.fa
16:26:34.351 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/myuser/miniforge3/envs/bioinfo/share/gatk4-4.6.2.0-0/gatk-package-4.6.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed May 21 16:26:34 EDT 2025] ValidateSamFile --INPUT read_2a883449-02ce-4bca-8f6b-3cf7857caf61.sam --REFERENCE_SEQUENCE /path/to/assemblies/GRCh38-P14/GRCh38.primary_assembly.genome.fa --MODE VERBOSE --MAX_OUTPUT 100 --IGNORE_WARNINGS false --VALIDATE_INDEX true --INDEX_VALIDATION_STRINGENCY EXHAUSTIVE --IS_BISULFITE_SEQUENCED false --MAX_OPEN_TEMP_FILES 8000 --SKIP_MATE_VALIDATION false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Wed May 21 16:26:34 EDT 2025] Executing as [email protected] on Linux 4.18.0-425.19.2.el8_7.x86_64 amd64; OpenJDK 64-Bit Server VM 23.0.1+11-39; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.6.2.0
ERROR::MISSING_READ_GROUP:Read groups is empty
WARNING::RECORD_MISSING_READ_GROUP:Read name 2a883449-02ce-4bca-8f6b-3cf7857caf61, A record is missing a read group
ERROR   2025-05-21 16:26:35     ValidateSamFile Exception counting mismatches for read 2a883449-02ce-4bca-8f6b-3cf7857caf61 0b aligned to chr1:11425-11571.
[Wed May 21 16:26:35 EDT 2025] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=811073536
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Tool returned:
-1

The record in question:

2a883449-02ce-4bca-8f6b-3cf7857caf61    272     chr1    11425   0       188S147M80S     *       0       0       *       *       NM:i:3  ms:i:138       AS:i:138        nn:i:0  tp:A:S  cm:i:33 s1:i:129     de:f:0.0204     rl:i:46

The * indicates the read sequence was omitted for this record but that's supported in the SAM/BAM spec.
Running CleanSam does not fix this error. In fact the sam file I attached was already processed with CleanSam.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions