Description
Problem
ValidateSamFile returns an error ValidateSamFile Exception counting mismatches for read 2a883449-02ce-4bca-8f6b-3cf7857caf61 0b aligned to chr1:11425-11571.
when trying to validate a BAM file created by aligning using minimap2.
Setup
❯ gatk ValidateSamFile --version
Using GATK jar /home/myuser/miniforge3/envs/bioinfo/share/gatk4-4.6.2.0-0/gatk-package-4.6.2.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/myuser/miniforge3/envs/bioinfo/share/gatk4-4.6.2.0-0/gatk-package-4.6.2.0-local.jar ValidateSamFile --version
The Genome Analysis Toolkit (GATK) v4.6.2.0
HTSJDK Version: 4.2.0
Picard Version: 3.4.0
Runs on CentOS on a cluster.
Example data
The .dict file which I had to zip bc .dict files aren't allowed by GH apparently: GRCh38.primary_assembly.genome.fa.dict.zip
Cannot share the genome bc it would be too big to attach to an issue.
All records with the read ID 2a883449-02ce-4bca-8f6b-3cf7857caf61:
read_2a883449-02ce-4bca-8f6b-3cf7857caf61.sam.zip
ValidateSamFile call and error trace
❯ gatk ValidateSamFile -I read_2a883449-02ce-4bca-8f6b-3cf7857caf61.sam -R /path/to/assemblies/GRCh38-P14/GRCh38.primary_assembly.genome.fa
Using GATK jar /home/myuser/miniforge3/envs/bioinfo/share/gatk4-4.6.2.0-0/gatk-package-4.6.2.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/myuser/miniforge3/envs/bioinfo/share/gatk4-4.6.2.0-0/gatk-package-4.6.2.0-local.jar ValidateSamFile -I read_2a883449-02ce-4bca-8f6b-3cf7857caf61.sam -R /path/to/assemblies/GRCh38-P14/GRCh38.primary_assembly.genome.fa
16:26:34.351 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/myuser/miniforge3/envs/bioinfo/share/gatk4-4.6.2.0-0/gatk-package-4.6.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed May 21 16:26:34 EDT 2025] ValidateSamFile --INPUT read_2a883449-02ce-4bca-8f6b-3cf7857caf61.sam --REFERENCE_SEQUENCE /path/to/assemblies/GRCh38-P14/GRCh38.primary_assembly.genome.fa --MODE VERBOSE --MAX_OUTPUT 100 --IGNORE_WARNINGS false --VALIDATE_INDEX true --INDEX_VALIDATION_STRINGENCY EXHAUSTIVE --IS_BISULFITE_SEQUENCED false --MAX_OPEN_TEMP_FILES 8000 --SKIP_MATE_VALIDATION false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Wed May 21 16:26:34 EDT 2025] Executing as [email protected] on Linux 4.18.0-425.19.2.el8_7.x86_64 amd64; OpenJDK 64-Bit Server VM 23.0.1+11-39; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.6.2.0
ERROR::MISSING_READ_GROUP:Read groups is empty
WARNING::RECORD_MISSING_READ_GROUP:Read name 2a883449-02ce-4bca-8f6b-3cf7857caf61, A record is missing a read group
ERROR 2025-05-21 16:26:35 ValidateSamFile Exception counting mismatches for read 2a883449-02ce-4bca-8f6b-3cf7857caf61 0b aligned to chr1:11425-11571.
[Wed May 21 16:26:35 EDT 2025] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=811073536
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Tool returned:
-1
The record in question:
2a883449-02ce-4bca-8f6b-3cf7857caf61 272 chr1 11425 0 188S147M80S * 0 0 * * NM:i:3 ms:i:138 AS:i:138 nn:i:0 tp:A:S cm:i:33 s1:i:129 de:f:0.0204 rl:i:46
The * indicates the read sequence was omitted for this record but that's supported in the SAM/BAM spec.
Running CleanSam does not fix this error. In fact the sam file I attached was already processed with CleanSam.