GATK GenotypeGVCFs memory issue when including non-variant-sites and using GenomicsDB as input

This is more of an issue with GATK itself but since there is a known workaround I post my attempts to resolve it, as requested by @maxulysse .

I am running Sarek on non-model organism with a smallish genome. I include non variant sites (see #1870) which requires the inclusion of the intervals file when creating the GenomicsDB (#1872). I have multiple populations and the workflow works fine for the smaller samples, but for the largest (n~100), the workflow fails due to a memory leak, despite the modest sample size. This turned out to be a known GATK issue (https://github.com/broadinstitute/gatk/issues/8989). The proposed fix is

> Our temporary solution until we make an updated release would be to convert imported genomicsDB instances to GVCF using
> 
> gatk SelectVariants -V gendb://instancename -O GVCF_export.g.vcf.gz -R ref.fa -L whateverintervalusedinGDBimport

Therefore, as a quick fix I have modified `modules/nf-core/gatk4/genotypegvcfs/main.nf` to perform two steps, namely 1. extract regions from the DB to a GVCF and 2. use GVCF as input to GenotypeGVCFs:

```
gatk --java-options "-Xmx${avail_mem}M -XX:-UsePerfData" \\
        SelectVariants \\
        --variant $input_command \\
        --output ${prefix}.g.vcf.gz \\
        --reference $fasta \\
        $interval_command \\
        --tmp-dir .

    gatk --java-options "-Xmx${avail_mem}M -XX:-UsePerfData" \\
        GenotypeGVCFs \\
        --variant ${prefix}.g.vcf.gz \\
        --output ${prefix}.vcf.gz \\
        --reference $fasta \\
        $interval_command \\
        $dbsnp_command \\
        --tmp-dir . \\
        $args

    rm -f ${prefix}.g.vcf.gz
```

I am running the workflow on a SLURM cluster. Although the fix does create the desired output file, Nextflow exits with a cryptic error along the lines of not being able to write to a trace file. I have set `trace.overwrite = true` (see https://github.com/nextflow-io/nextflow/issues/3317), to no avail.

Since the genotyping step takes a day or two to finish on my data, I have tried to reproduce the error using one of the test data sets:

```
nextflow run main.nf -profile test,uppmax -params-file params.yml --project $NAISS_COMPUTE_ID --input ./tests/csv/3.0/fastq_pair.csv --output results
```

Omitting the `uppmax` profile works fine, confirming that the fix is ok, but the above command fails with the error 

```
ERROR ~ Cannot invoke method getName() on null object

 -- Check script './workflows/sarek/../../subworkflows/local/channel_variant_calling_create_csv/main.nf' at line: 16 or see '.nextflow.log' file for more details
```

The full java stack trace does not really help. The line that seems to be the culprit in `subworkflows/local/channel_variant_calling_create_csv/main.nf` is 

```
    main:
        // Creating csv files to restart from this step
        vcf_to_annotate.collectFile(keepHeader: true, skip: 1,sort: true, storeDir: "${outdir}/csv"){ meta, vcf ->
            patient       = meta.patient
            sample        = meta.id
            variantcaller = meta.variantcaller
            vcf = "${outdir}/variant_calling/${variantcaller}/${meta.id}/${vcf.getName()}"
            ["variantcalled.csv", "patient,sample,variantcaller,vcf\n${patient},${sample},${variantcaller},${vcf}\n"]
        }
```

where the `vcf` variable is defined.

The proper solution would be to run `SelectVariants` separately after the creation of `GenomicsDB`. I was hoping I could cut a corner, but apparently not, it seems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GATK GenotypeGVCFs memory issue when including non-variant-sites and using GenomicsDB as input #1901

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GATK GenotypeGVCFs memory issue when including non-variant-sites and using GenomicsDB as input #1901

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions