Skip to content

GATK GenotypeGVCFs memory issue when including non-variant-sites and using GenomicsDB as input #1901

Open
@percyfal

Description

@percyfal

This is more of an issue with GATK itself but since there is a known workaround I post my attempts to resolve it, as requested by @maxulysse .

I am running Sarek on non-model organism with a smallish genome. I include non variant sites (see #1870) which requires the inclusion of the intervals file when creating the GenomicsDB (#1872). I have multiple populations and the workflow works fine for the smaller samples, but for the largest (n~100), the workflow fails due to a memory leak, despite the modest sample size. This turned out to be a known GATK issue (broadinstitute/gatk#8989). The proposed fix is

Our temporary solution until we make an updated release would be to convert imported genomicsDB instances to GVCF using

gatk SelectVariants -V gendb://instancename -O GVCF_export.g.vcf.gz -R ref.fa -L whateverintervalusedinGDBimport

Therefore, as a quick fix I have modified modules/nf-core/gatk4/genotypegvcfs/main.nf to perform two steps, namely 1. extract regions from the DB to a GVCF and 2. use GVCF as input to GenotypeGVCFs:

gatk --java-options "-Xmx${avail_mem}M -XX:-UsePerfData" \\
        SelectVariants \\
        --variant $input_command \\
        --output ${prefix}.g.vcf.gz \\
        --reference $fasta \\
        $interval_command \\
        --tmp-dir .

    gatk --java-options "-Xmx${avail_mem}M -XX:-UsePerfData" \\
        GenotypeGVCFs \\
        --variant ${prefix}.g.vcf.gz \\
        --output ${prefix}.vcf.gz \\
        --reference $fasta \\
        $interval_command \\
        $dbsnp_command \\
        --tmp-dir . \\
        $args

    rm -f ${prefix}.g.vcf.gz

I am running the workflow on a SLURM cluster. Although the fix does create the desired output file, Nextflow exits with a cryptic error along the lines of not being able to write to a trace file. I have set trace.overwrite = true (see nextflow-io/nextflow#3317), to no avail.

Since the genotyping step takes a day or two to finish on my data, I have tried to reproduce the error using one of the test data sets:

nextflow run main.nf -profile test,uppmax -params-file params.yml --project $NAISS_COMPUTE_ID --input ./tests/csv/3.0/fastq_pair.csv --output results

Omitting the uppmax profile works fine, confirming that the fix is ok, but the above command fails with the error

ERROR ~ Cannot invoke method getName() on null object

 -- Check script './workflows/sarek/../../subworkflows/local/channel_variant_calling_create_csv/main.nf' at line: 16 or see '.nextflow.log' file for more details

The full java stack trace does not really help. The line that seems to be the culprit in subworkflows/local/channel_variant_calling_create_csv/main.nf is

    main:
        // Creating csv files to restart from this step
        vcf_to_annotate.collectFile(keepHeader: true, skip: 1,sort: true, storeDir: "${outdir}/csv"){ meta, vcf ->
            patient       = meta.patient
            sample        = meta.id
            variantcaller = meta.variantcaller
            vcf = "${outdir}/variant_calling/${variantcaller}/${meta.id}/${vcf.getName()}"
            ["variantcalled.csv", "patient,sample,variantcaller,vcf\n${patient},${sample},${variantcaller},${vcf}\n"]
        }

where the vcf variable is defined.

The proper solution would be to run SelectVariants separately after the creation of GenomicsDB. I was hoping I could cut a corner, but apparently not, it seems.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions