Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read Group not added to bam files generated by bowtie2, which causes GATK genotyping to fail #655

Closed
5 tasks done
IdoBar opened this issue Jan 10, 2021 · 6 comments
Closed
5 tasks done
Assignees
Labels
bug Something isn't working

Comments

@IdoBar
Copy link
Contributor

IdoBar commented Jan 10, 2021

Check Documentation

I have checked the following places for your error:

Description of the bug

Read Group information is not added to the bam files generated by bowtie2, which causes GATK ug genotyping to fail.

Steps to reproduce

Steps to reproduce the behaviour:

  1. Command line: nextflow run nf-core/eager -r 2.2.2 -params-file bt2local.gatkug.json -c /home/ibar/.nextflow/awoonga.config -resume
    Content of bt2local.gatkug.json:
{
    "input": "Dingo_aDNA_NF1.tsv",
    "fasta": "\/30days\/ibar/\/data\/Dingo\/reference_genome\/CanFam3.1.fasta",
    "fasta_index": "\/30days\/ibar\/data\/Dingo\/reference_genome\/CanFam3.1.fasta.fai",
    "seq_dict": "\/30days\/ibar\/data\/Dingo\/reference_genome\/CanFam3.1.dict",
    "save_reference": "true",
    "email": "[email protected]",
    "skip_fastqc": "true",
    "skip_trim": "true",
    "mergedonly": "true",
    "mapper": "bowtie2",
    "hostremoval_mode": "",
    "run_bam_filtering": "true",
    "bam_mapping_quality_threshold": "10",
    "bam_filter_minreadlength": "20",
    "dedupper": "dedup",
    "dedup_all_merged": "true",
    "run_pmdtools": "true",
    "run_bedtools_coverage": "true",
    "anno_file": "\/30days\/ibar\/data\/Dingo\/reference_genome\/CanFam3.1.gff",
    "run_trim_bam": "true",
    "bamutils_softclip": "true",
    "run_genotyping": "true",
    "genotyping_source": "trimmed",
    "run_mtnucratio": "true",
    "mtnucratio_header": "NC_002008.4"
}
  1. See error:
Error executing process > 'genotyping_ug (D12)'

Caused by:
  Process `genotyping_ug (D12)` terminated with an error exit status (1)

Command executed:

  samtools index -b D12.trimmed.bam
  gatk3 -T RealignerTargetCreator -R CanFam3.1.fasta -I D12.trimmed.bam -nt 2 -o D12.intervals 
  gatk3 -T IndelRealigner -R CanFam3.1.fasta -I D12.trimmed.bam -targetIntervals D12.intervals -o D12.realign.bam 
  gatk3 -T UnifiedGenotyper -R CanFam3.1.fasta -I D12.realign.bam -o D12.unifiedgenotyper.vcf -nt 2 --genotype_likelihoods_model SNP -stand_call_conf 30 --sample_ploidy 2 -dcov 250 --output_mode EMIT_VARIANTS_ONLY 
  
  rm D12.realign.{bam,bai}
  
  pigz -p 2 D12.unifiedgenotyper.vcf

Command exit status:
  1

Command output:
  (empty)

Command error:
  INFO  20:30:52,376 HelpFormatter - -------------------------------------------------------------------------------- 
  INFO  20:30:52,380 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56 
  INFO  20:30:52,381 HelpFormatter - Copyright (c) 2010 The Broad Institute 
  INFO  20:30:52,381 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
  INFO  20:30:52,384 HelpFormatter - Program Args: -T RealignerTargetCreator -R CanFam3.1.fasta -I D12.trimmed.bam -nt 2 -o D12.intervals 
  INFO  20:30:52,390 HelpFormatter - Executing as [email protected] on Linux 3.10.0-693.5.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_144-b01. 
  INFO  20:30:52,390 HelpFormatter - Date/Time: 2021/01/10 20:30:52 
  INFO  20:30:52,391 HelpFormatter - -------------------------------------------------------------------------------- 
  INFO  20:30:52,391 HelpFormatter - -------------------------------------------------------------------------------- 
  INFO  20:30:52,514 GenomeAnalysisEngine - Strictness is SILENT 
  INFO  20:30:52,772 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
  INFO  20:30:52,778 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
  INFO  20:30:52,919 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.14 
  INFO  20:30:54,823 GATKRunReport - Uploaded run statistics report to AWS S3 
  ##### ERROR ------------------------------------------------------------------------------------------
  ##### ERROR A USER ERROR has occurred (version 3.5-0-g36282e4): 
  ##### ERROR
  ##### ERROR This means that one or more arguments or inputs in your command are incorrect.
  ##### ERROR The error message below tells you what is the problem.
  ##### ERROR
  ##### ERROR If the problem is an invalid argument, please check the online documentation guide
  ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
  ##### ERROR
  ##### ERROR Visit our website and forum for extensive documentation and answers to 
  ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
  ##### ERROR
  ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
  ##### ERROR
  ##### ERROR MESSAGE: SAM/BAM/CRAM file D12.trimmed.bam is malformed. Please see http://gatkforums.broadinstitute.org/discussion/1317/collected-faqs-about-input-files-for-sequence-read-data-bam-cramfor more information. Error details: SAM file doesn't have any read groups defined in the header.  The GATK no longer supports SAM files without read groups

Expected behaviour

The bam files should have Read Group information to allow for subsequent genotyping with gatk.

Log files

Have you provided the following extra information/files:

  • The command used to run the pipeline
  • The .nextflow.log file
    Dingo_aDNA_NF2.log
  • The exact error: see above

System

  • Hardware: HPC
  • Executor: PBSPro
  • OS: RHEL
  • Version Linux 3.10.0-693.5.2.el7.x86_64
  • Runtime: Groovy 3.0.5 on OpenJDK 64-Bit Server VM 11.0.1+13-LTS

Nextflow Installation

  • Version: 20.10.0 build 5430

Container engine

  • Engine: Singularity
  • version: 3.6.3
  • Image tag: nfcore/eager:2.2.2

Additional context

This is the output of picard ValidateSamFile -I D12_PE.mapped.bam:

08:51:19.221 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gpfs1/homes/ibar/.pyenv/versions/miniconda-latest/envs/aDNA/share/picard-2.23.3-0/picard.jar!/com/intel/gkl/native
/libgkl_compression.so
[Mon Jan 11 08:51:19 AEST 2021] ValidateSamFile --INPUT D12_PE.mapped.bam --MODE VERBOSE --MAX_OUTPUT 100 --IGNORE_WARNINGS false --VALIDATE_INDEX true --INDEX_VALIDATION_STRINGENCY EXHAUSTIVE --IS_BI
SULFITE_SEQUENCED false --MAX_OPEN_TEMP_FILES 8000 --SKIP_MATE_VALIDATION false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE
_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Mon Jan 11 08:51:19 AEST 2021] Executing as [email protected] on Linux 3.10.0-693.5.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_265-b11; Deflater: Intel; Inflater: Intel; Provider GCS is not
 available; Picard version: Version:2.23.3
WARNING 2021-01-11 08:51:19     ValidateSamFile NM validation cannot be performed without the reference. All other validations will still occur.
ERROR::POORLY_FORMATTED_HEADER_TAG:File /gpfs1/scratch/30days/ibar/data/Dingo/Dingo_aDNA_NF2_process_10_01_2021/work/2d/c7fc0b43e86023978259162367df06/D12_PE.mapped.bam, Error parsing SAM header. Prob
lem parsing @PG key:value pair. Line:
@PG     ID:bowtie2      PN:bowtie2      VN:     CL:"/opt/conda/envs/nf-core-eager-2.2.2/bin/bowtie2-align-s --wrapper basic-0 -x reference_genome/CanFam3.1.fasta -p 4 --sensitive-local -1 D12_1.sample
d1M.trimmed.fq.gz -2 D12_2.sampled1M.trimmed.fq.gz"
ERROR::MISSING_READ_GROUP:Read groups is empty
WARNING::RECORD_MISSING_READ_GROUP:Read name ST-E00127:1042:H7TV7CCX2:6:2104:31893:69045, A record is missing a read group
WARNING::RECORD_MISSING_READ_GROUP:Read name ST-E00127:1042:H7TV7CCX2:6:2104:31893:69045, A record is missing a read group
WARNING::RECORD_MISSING_READ_GROUP:Read name ST-E00127:1042:H7TV7CCX2:6:2109:21186:69414, A record is missing a read group
WARNING::RECORD_MISSING_READ_GROUP:Read name ST-E00127:1042:H7TV7CCX2:6:2109:21186:69414, A record is missing a read group
WARNING::RECORD_MISSING_READ_GROUP:Read name ST-E00127:1042:H7TV7CCX2:6:2115:4178:71629, A record is missing a read group
...
@IdoBar IdoBar added the bug Something isn't working label Jan 10, 2021
@IdoBar
Copy link
Contributor Author

IdoBar commented Jan 10, 2021

I looked a bit further into the bowtie2 commands and found out that RG are not specified at all:

#!/bin/bash -euo pipefail
bowtie2 -x reference_genome/CanFam3.1.fasta -1 D12_1.sampled1M.trimmed.fq.gz -2 D12_2.sampled1M.trimmed.fq.gz -p 4 --sensitive-local     2> "D12"_bt2.log | samtools sort -@ 4 -O bam > "D12"_"PE".mapped.bam
samtools index "D12"_"PE".mapped.bam

The RG info might be added in a later step that I may be skipping, but it would be best to add them during the initial alignment (or right after).
An example of how the RG info should be added to the bowtie2 command (see Read Group requirements for GATK):

bowtie2 -x reference_genome/CanFam3.1.fasta -1 D12_1.sampled1M.trimmed.fq.gz -2 D12_2.sampled1M.trimmed.fq.gz -p 4 --sensitive-local --rg-id <Sample_Name> --rg SM:<Sample_Name> --rg LB:<Library_ID> --rg PU:<Sample_Name_Library_ID> PL:<ILLUMINA>

IdoBar added a commit to IdoBar/eager that referenced this issue Jan 10, 2021
Add Read Group information to `bowtie2` mapping to fix issue [nf-core#655 ]
@IdoBar
Copy link
Contributor Author

IdoBar commented Jan 10, 2021

I've added the PR to fix this issue (adding exactly the same RG info as the bwa alignment).
Sorry, but I'm not familiar with how to run the tests and checks, so I'll leave it like that for you to implement this properly.

@jfy133 jfy133 linked a pull request Jan 11, 2021 that will close this issue
11 tasks
@jfy133 jfy133 self-assigned this Jan 11, 2021
@jfy133 jfy133 added this to the 2.3.1 - Patch Release milestone Jan 11, 2021
IdoBar added a commit to IdoBar/eager that referenced this issue Jan 13, 2021
@jfy133 jfy133 closed this as completed Jan 14, 2021
@kclem
Copy link

kclem commented Sep 28, 2023

Updating @IdoBar 's solution for an example of how the RG info should be added to the bowtie2 command: The last PL tag also needs the --rg parameter.

bowtie2 -x reference_genome/CanFam3.1.fasta -1 D12_1.sampled1M.trimmed.fq.gz -2 D12_2.sampled1M.trimmed.fq.gz -p 4 --sensitive-local --rg-id <Sample_Name> --rg SM:<Sample_Name> --rg LB:<Library_ID> --rg PU:<Sample_Name_Library_ID> --rg PL:<ILLUMINA>

@jfy133
Copy link
Member

jfy133 commented Sep 28, 2023

@TCLamnidis could you also add to your 2.5 PR?

@TCLamnidis
Copy link
Collaborator

#1021

@jfy133
Copy link
Member

jfy133 commented Oct 13, 2023

@kclem @TCLamnidis will have a couple of questions for you regarding the exact implementation going into here #1021, as per your request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants