Description
This is more of an issue with GATK itself but since there is a known workaround I post my attempts to resolve it, as requested by @maxulysse .
I am running Sarek on non-model organism with a smallish genome. I include non variant sites (see #1870) which requires the inclusion of the intervals file when creating the GenomicsDB (#1872). I have multiple populations and the workflow works fine for the smaller samples, but for the largest (n~100), the workflow fails due to a memory leak, despite the modest sample size. This turned out to be a known GATK issue (broadinstitute/gatk#8989). The proposed fix is
Our temporary solution until we make an updated release would be to convert imported genomicsDB instances to GVCF using
gatk SelectVariants -V gendb://instancename -O GVCF_export.g.vcf.gz -R ref.fa -L whateverintervalusedinGDBimport
Therefore, as a quick fix I have modified modules/nf-core/gatk4/genotypegvcfs/main.nf
to perform two steps, namely 1. extract regions from the DB to a GVCF and 2. use GVCF as input to GenotypeGVCFs:
gatk --java-options "-Xmx${avail_mem}M -XX:-UsePerfData" \\
SelectVariants \\
--variant $input_command \\
--output ${prefix}.g.vcf.gz \\
--reference $fasta \\
$interval_command \\
--tmp-dir .
gatk --java-options "-Xmx${avail_mem}M -XX:-UsePerfData" \\
GenotypeGVCFs \\
--variant ${prefix}.g.vcf.gz \\
--output ${prefix}.vcf.gz \\
--reference $fasta \\
$interval_command \\
$dbsnp_command \\
--tmp-dir . \\
$args
rm -f ${prefix}.g.vcf.gz
I am running the workflow on a SLURM cluster. Although the fix does create the desired output file, Nextflow exits with a cryptic error along the lines of not being able to write to a trace file. I have set trace.overwrite = true
(see nextflow-io/nextflow#3317), to no avail.
Since the genotyping step takes a day or two to finish on my data, I have tried to reproduce the error using one of the test data sets:
nextflow run main.nf -profile test,uppmax -params-file params.yml --project $NAISS_COMPUTE_ID --input ./tests/csv/3.0/fastq_pair.csv --output results
Omitting the uppmax
profile works fine, confirming that the fix is ok, but the above command fails with the error
ERROR ~ Cannot invoke method getName() on null object
-- Check script './workflows/sarek/../../subworkflows/local/channel_variant_calling_create_csv/main.nf' at line: 16 or see '.nextflow.log' file for more details
The full java stack trace does not really help. The line that seems to be the culprit in subworkflows/local/channel_variant_calling_create_csv/main.nf
is
main:
// Creating csv files to restart from this step
vcf_to_annotate.collectFile(keepHeader: true, skip: 1,sort: true, storeDir: "${outdir}/csv"){ meta, vcf ->
patient = meta.patient
sample = meta.id
variantcaller = meta.variantcaller
vcf = "${outdir}/variant_calling/${variantcaller}/${meta.id}/${vcf.getName()}"
["variantcalled.csv", "patient,sample,variantcaller,vcf\n${patient},${sample},${variantcaller},${vcf}\n"]
}
where the vcf
variable is defined.
The proper solution would be to run SelectVariants
separately after the creation of GenomicsDB
. I was hoping I could cut a corner, but apparently not, it seems.