Releases: broadinstitute/gatk
4.6.2.0
Download release: gatk-4.6.2.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the VERSION release:
-
Funcotator Data Location Moved
We've moved the location thatFuncotatorDataSourceDownloaderpulls data from because it turned out to be rather expensive to host it there. If you use this in a pipeline we would appreciate it if you updated to the new version. (#9131)- Old:
- gs://broad-public-datasets/funcotator/
- https://console.cloud.google.com/storage/browser/broad-public-datasets/funcotator
- New:
- gs://gcp-public-data--broad-references/funcotator/
- https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/funcotator
- Old:
-
New SV Tools There are several new tools to work with SV Data from GATK-SV
SVStratifyandGroupedSVCluster( #8990) -
CallableLoci was ported from GATK3 since it is useful in some situations. (#9031)
-
New BQSR argument
--allow-missing-read-groupto work around a rare but annoying issue where BQSR fails if a Read Group is completely filtered from the training data but present at application time. (#9020)
Full list of changes:
-
New Tools
-
Flow Mode Calling
- Tiny performance improvement #9077
-
Mutect2+
-
Funcotator
- Updated references to the funcotator datasets bucket to point to the new google bucket by @KevinCLydon in #9131
-
SV Calling
- Prioritize het calls when merging clustered SVs #9058
-
Notable Enhancements
- BQSR: avoid throwing an error when read group is missing in the recal table, and some refactoring. by @takutosato in #9020
-
Bug Fixes
-
Miscellaneous Changes
- Option to retain source IDs on VariantContext merge #9032
-
Documentation
- Update Python compatibility information in README.md #9047
-
Dependencies
Many dependencies updated including bug fixes and security patches- Update Htsjdk 4.1.3-> 4.2.0 in
- Update Picard 3.3.0 -> 3.4.0 #9143
- Update logback-core from 1.4.14 to 1.5.13 #9079
- Update GenomicsDB #9059
- Update Netty #9120
- Exclude bad version of bouncycastle library #9129
- Bump org.apache.commons:commons-vfs2 from 2.9.0 to 2.10.0 #9130
- Update parquet to 1.15.1 #9144
-
Developer Infrastructure
Full Changelog: 4.6.1.0...4.6.2.0
4.6.1.0
Download release: gatk-4.6.1.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.6.1.0 release:
- Modernize the aging Conda environment with up to date python dependencies. All the python tools have been updated appropriately. This will enable easier integration of new machine learning tools.
- If you use python tools outside of the docker, you must rebuild your conda environment for this release
CNNScoreVariantshas been replaced byNVScoreVariants, a rewritten and modernized version. The python code for this tool was written by members of NVIDIA Genomics Research.- Thank you Babak Zamirai, Ankit Sethia, Mehrzad Samadi, George Vacek and the whole NVIDIA genomics team!
- This GATK blog post has more of the story from when we first made the tool available for testing.
- New
Funcotatorargument--prefer-mane-transcriptswhich improves transcript selection and lays groundwork for upcoming improvements. - New argument
--variant-output-filteringwhich lets you restrict output variants based on the input intervals. This replaces and imrpoves on--only-output-calls-starting-in-intervaland works withSelectVariantsand other VariantWalkers. This is useful to prevent duplicating variants when splitting an input VCF into multiple shards.
Full list of changes:
-
CNNScoreVariants -> NVScoreVariants (#8004, #9010, #9009)
- CNNScore variants has been replaced by NVScoreVariants, scripts that use it should be updated to use NVScoreVariants instead.
- The training tools (CNNVariantTrain, CNNVariantWriteTensors)have been removed. If you need to retrain the model for your data type you should continue to use GATK 4.6.0.0. New training tools are in development to work alongside NVScoreVariants and will be added in subsequent releases.
-
New Tools
-
Joint Calling GVS
- Adds QD and AS_QD emission from VariantAnnotator on GVS input (#8978)
-
GenomicsDB
- Switch to logging a warning instead of an exception for intervals in query that were not part of GenomicsDBImport (#8987)
-
Funcotator
- Added a '--prefer-mane-transcripts' mode that enforces MANE_Select tagged Gencode transcripts where possible )(#9012)
-
SV Calling
- Handle CTX_PP/QQ and CTX_PQ/QP CPX_TYPE values inSVConcordance (#8885)
- Complex SV intervals support by @mwalker174 (#8521)
- Require both overlap and breakend proximity for depth-only SV clustering (#8962)
-
Flow Based Calling
- Modified HaplotypeBasedVariantRecaller to support non-flow reads (#8896)
- FlowFeatureMapper: X_FILTERED_COUNT semantics adjusted and documented more accurately (#8894)
- Changes to flow arguments in haplotype caller from Picard (see Picard release notes
-
Miscellaneous Features
- Added a check for whether files can be created and executed within the configured tmp-dir (#8951)
-
Documentation
- Clarify in the README which git lfs files are required to build GATK (#8914)
- Add docs about citing GATK (#8947)
- Update Mutect2.java Documentation (#8999)
- Add more detailed conda setup instructions to the GATK README (#9001)
- Adding small warning messages to not to feed any GVCF files to these tools (#9008)
-
Refactoring
- Swapped mito mode in Mutect to use the mode argument utils (#8986)
-
Tests
-
Dependencies
Updating dependencies to make use of modern frameworks with fewer vulnerabilities was a focus of this release.-
Updated Python and PyMC, removed TensorFlow, and added PyTorch in conda environment. (#8561)
-
Rebuild gatk-base docker image (3.3.1) in order to pull in recent patches (#9005)
-
Updates to java build and dependencies (#8998, #9006, #9016)
- Update to the Gradle 8.10.2
- Improvements to
build.gradleto use of features like consuming publishes Bills of Materials (BOMs) - Update many direct and transitive java dependencies to fix security vulnerabilities.
- Update Htsjdk 4.1.1 to 4.1.3
- Update Picard 3.2.0 to 3.3.0
- Update hdf5-java-bindings to version 1.2.0-hdf5_2.11.0 (#8908)
-
4.6.0.0
Download release: gatk-4.6.0.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.6.0.0 release:
-
We've fixed a serious CRAM writing bug that affects GATK versions 4.3 through 4.5 and Picard versions 2.27.3 through 3.1.1. This bug can, in limited cases, lead to reads with an incorrect base sequence being written. See this comment to GATK issue 8768 and the full release notes below for more details on what conditions trigger the bug.
- To help users detect whether their CRAM files are affected, we've released a CRAM scanning tool called
CRAMIssue8768Detectorthat can detect whether a particular CRAM file is affected by this bug. If you suspect that some of your CRAM files may have been affected, please run this tool on them for confirmation!
- To help users detect whether their CRAM files are affected, we've released a CRAM scanning tool called
-
By overwhelming popular demand, we've switched back to using the standard
./.representation for no-calls inGenotypeGVCFsandGenomicsDBinstead of0/0withDP=0. This reverts the change described in our article GenotypeGVCFs and the death of the dot.- We intend to publish a new article shortly to replace that older article with further details on this change. When we do so, we'll link to it from here.
-
The
Mutect2germline resource can now have split multiallelic format -
Added an
--inverted-read-filterargument to allow for selecting reads that fail read filters from the command line easily -
We've fixed a number of issues with HTTP support, mainly affecting the loading of side inputs such as indices over HTTP
-
Reduced the number of layers in the GATK docker image to help users running into docker quota issues
Full list of changes:
-
Important CRAM writing bug fix and detection tool
- We've updated to
HTSJDK4.1.1 andPicard3.2.0 (#8900), which fix a serious bug in the CRAM writing code first reported in GATK issue 8768 - This issue affects GATK versions 4.3.0.0 through 4.5.0.0, and is fixed in GATK 4.6.0.0.
- This issue also affects Picard versions 2.27.3 through 3.1.1, and is fixed in Picard 3.2.0.
- The bug is triggered when writing a CRAM file using one of the affected GATK/Picard versions, and both of the following conditions are met:
- At least one read is mapped to the very first base of a reference contig
- The file contains more than one CRAM container (10,000 reads) with reads mapped to that same reference contig
- When both of these conditions are met, the resulting CRAM file may have corrupt containers associated with that contig containing reads with an incorrect sequence.
- Since many common references such as hg38 have N's at the very beginning of the autosomes and X/Y, many pipelines will not be affected by this bug. However, users of a telomere-to-telomere reference, users doing mitochondrial calling, and users with reads aligned to the alt sequences will want to scan their CRAM files for possible corruption.
- The other mitigating circumstance is that when a CRAM is affected, the signal will be overwhelmingly obvious, with the mismatch rate typically jumping from sub-1% to 80-90% for the affected regions, making it likely to be caught by standard QC processes.
- We've released a CRAM scanning tool called
CRAMIssue8768Detector(#8819) that can detect whether a particular CRAM file is affected by this bug. If you suspect that some of your CRAM files may have been affected, please run this tool on them for confirmation!
- We've updated to
-
Joint Calling
- We've switched back to using the standard
./.representation for no-calls inGenotypeGVCFsandGenomicsDBinstead of0/0withDP=0(#8715) (#8741) (#8759)- This reverts the change described in our article GenotypeGVCFs and the death of the dot
- Fix for
GenotypeGVCFswith mixed ploidy sites (#8862) - Fix for
GnarlyGenotyperwhen PLs are null (#8878) - Fixed bug in
ReblockGVCFwhen removing annotations (#8870) - Enable
ReblockGVCFto subset AS annotations that aren't "raw" (pipe-delimited) (#8771) - Remove header lines in
ReblockGVCFwhen we remove FORMAT annotations (#8895) ReblockGVCF: Add malaria spanning deletion exception regression test with fix (#8802)- Restore some
GnarlyGenotypertests (#8893)
- We've switched back to using the standard
-
HaplotypeCaller
- Fix to long deletions that overhang into the assembly window causing exceptions in
HaplotypeCaller(#8731)
- Fix to long deletions that overhang into the assembly window causing exceptions in
-
Mutect2
- The
Mutect2germline resource can now have split multiallelic format (#8837) - Make the
Mutect2haplotype and clustered events filters smarter about germline events (#8717) - Added the DragSTR model to the Mutect2 WDL (#8716)
- Improvements to
Mutect2'sPermutecttraining data mode (#8663) - Bigger
Permutecttensors andPermutecttest datasets can be annotated with truth VCF (#8836) Mutect2WDL and GetSampleName can handle multiple sample names in BAM headers (#8859)Permutectdataset engine outputs contig and read group indices, not names (#8860)- Normal artifact LOD is now defined without the extra minus sign (#8668)
- The
-
CNV Calling
- Fixed the GT header in
PostprocessGermlineCNVCalls's--output-genotyped-intervalsoutput (#8621)
- Fixed the GT header in
-
SV Calling
-
Flow-based Calling
-
Notable Enhancements
- Added an
--inverted-read-filterargument to allow for selecting reads that fail read filters from the command line easily (#8724) - Inverted
SoftClippedReadFilterto conform to the standard filtering logic (#8888) - Reduced the number of docker layers in the GATK image from 44 to 16 (#8808)
VariantFiltration: added a--mask-descriptionargument to write custom mask filter description in VCF header (#8831)GatherVcfsCloudis no longer beta (#8680)
- Added an
-
Miscellaneous Changes
GetPileupSummariesnow uses the standardMappingQualityReadFilterinstead of a custom--min-mapping-qualityargument (#8781)Funcotator: suppress a log message about b37 contigs when not doing b37/hg19 conversion (#8758)- Output the new image name at the end of a successful cloud docker build (#8627)
- Exclude the test folder from code coverage calculations (#8744)
- Removed deprecated genomes in the cloud docker image that was causing CNN WDL test failures (#8891)
- Re-commit large test files as lfs stubs (#8769)
- Standardize test results directory between normal/docker tests (#8718)
- Improve failure message in
VariantContextTestUtils(#8725) - Update the
setup_cloudgithub action (#8651) - Parameterize the logging frequency for ProgressLogger in
GatherVcfsCloud(#8662)
-
Documentation
- Updated the README to include list of popular software included in docker image (#8745)
-
Dependencies
- Updated
HTSJDKto 4.1.1, which fixes the CRAM writing bug described above (#8900) - Updated
Picardto 3.2.0, which fixes the CRAM writing bug described above (#8900) - Updated
GenomicsDBto 1.5.3, which supports M1 Macs and switches no-call representation back to./.(#8710) (#8759) - Updated
http-nioto 1.1.1, which fixes several URL-handling bugs with HTTP support (#8889) - Updated several miscellaneous dependencies to fix security vulnerabilities (#8898)
- Updated
4.5.0.0
Download release: gatk-4.5.0.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.5.0.0 release:
-
HaplotypeCallernow supports custom ploidy regions that can be specified via a new--ploidy-regionsargument, overriding the global-ploidysetting -
The default
SmithWatermanimplementation forHaplotypeCallerandMutect2is now the hardware-accelerated version, resulting in a significant speedup -
Funcotatorhas a new datasource release that brings in the latest version ofGencodeand several other key data sources -
We've updated our dependencies and our docker environment to greatly cut down on known security vulnerabilities
-
We've greatly improved support for
http/httpsinputs in GATK-native tools (though most Picard tools bundled with GATK do not yet support it) -
We've ported some additional DRAGEN features to
HaplotypeCallerthat bring us closer to functional equivalence with DRAGEN v3.7.8 -
GenomicsDBImportnow has support for Azure storageaz://URIs -
GnarlyGenotypernow has haploid support -
Lots of important bug fixes, including a fix for a bug in the Intel GKL that could cause output files to intermittently fail to be compressed properly
Full list of changes:
-
HaplotypeCaller
- HaplotypeCaller now supports custom ploidy regions (#8609)
- Added a new argument to
HaplotypeCallercalled--ploidy-regionswhich allows the user to input a.bedor.interval_listwith the "name" column equal to a positive integer for the ploidy to use when calling variants in that region - The main use case is for calling haploid variants outside the PAR for XY individuals as required by the VCF spec, but this provides a much more flexible interface for other similar niche applications, like genotyping individuals with other known aneuploidies
- The global
-ploidyflag will still provide the background default (or the built-in ploidy of 2 for humans), but the user-supplied values will supersede these in overlapping regions
- Added a new argument to
- Changed the
SmithWatermanimplementation to default toFASTEST_AVAILABLE(#8485) - Fixed a bug in pileup calling mode relating to the number of haplotypes (#8489)
- Huge simplication of genotyping likelihoods calculations -- no change in output (#6351)
- Be explicit about when variants are biallelic (#8332)
- Fixed debug log severity for read threading assembler messages (#8419)
- Fixed issue with visibility of the
--dont-use-softclipped-basesargument (#8271)
- HaplotypeCaller now supports custom ploidy regions (#8609)
-
Mutect2
- Added a
--base-qual-correction-factorto allow a scale factor to be provided to modify the base qualities reported by the sequencer and used in theMutect2substitution error model (#8447)- Set to zero to turn off the error model changes introduced in GATK 4.1.9.0
- Fixed a bug in
FilterMutectCallsfor GVCFs (#8458)- When using GVCFs with
Mutect2(for example with the Mitochondria mode), in the filtering step ADs for symbolic alleles are set to 0 so it doesn't contribute to overall AD. There was an off-by-one error that removed the alt allele AD rather than the<NON_REF>allele AD. This led to NaNs and errors when a site had no ref reads (for example a GT of[ref,alt,<NON_REF>]and AD of[0,300,0]would accidentally be changed to an AD of[0,0,0]if the alt index was removed instead of the<NON_REF>index).
- When using GVCFs with
- Added a
-
DRAGEN-GATK
- Added implementations of the "columnwise detection" and "PDHMM" (partially-determined HMM) features from DRAGEN to bring us much closer to functional equivalence with DRAGEN v3.7.8 (#8083)
- Development work to prepare the way for the final missing DRAGEN 3.7.8 feature, "joint detection":
- Graph method for PDHMM event groups that unifies finding/merging and overlap/mutual exclusion (#8366)
- Rewrote haplotype construction methods in
PartiallyDeterminedHaplotypeComputationEngine(#8367) - More refactoring in
PartiallyDeterminedHaplotypeComputationEngineand preparing for joint detection (#8492) - Innocuous housekeeping changes in the partially-determined haplotypes code (#8361)
- Clarify cryptic bitwise operations in the partially-determined haplotype
EventGroupsubclass (#8400)
-
Joint Calling
- Added haploid support to
GnarlyGenotyper(#7750) - Fix to allow
GenotypeGVCFsto properly handle events not in minimal representation (#8567) ReblockGVCF: added a--keep-site-filtersargument to keep site-level filters (#8304) (#8308)ReblockGVCF: added a--add-site-filters-to-genotypeargument to move site-level filters to genotype-level filters (#8484)ReblockGVCF: added a--format-annotations-to-removeargument to specify format-level annotations to remove from all genotypes in final GVCF (#8411)ReblockGVCF: added a check to make sure the input VCF is a GVCF rather than a single sample VCF (#8411)- Improved an error message in
GnarlyGenotyper(#8270) - Added a
mergeWithRemapping()method inReferenceConfidenceVariantContextMergerto perform allele remapping prior to genotyping (#8318) - GVS (Genomic Variant Store) development:
- Added haploid support to
-
GenomicsDB
-
Funcotator
- New data source release V1.8 (#8512)
- Updated
Gencodeto version 43, and also updatedCOSMIC,Clinvar, and several other datasources to their latest versions - The data sources are now split by reference into separate hg19 and hg38 bundles to cut down on size
- Updated
- Fixed support for newer
GencodeGTF versions by making theGencodeGTFFieldparsing more permissive (#8351) - Fixed
FuncotatorVCF output renderer to correctly preserve B37 contig names on output for B37 aligned files (#8539) - Fix bug in VCF comparison code that causes
Funcotatorto crash with certain datasources (#8445) - Connected the splice site window size to CLI parameters (#8463)
- Allow
LocatableXsvFuncotationFactoryto read gzipped files (#8363)
- New data source release V1.8 (#8512)
-
CNV Calling
-
SV Calling
- Added support for breakend replacement alleles in
SVCluster(#8408)- Implements allele collapsing for "breakend replacement" BND alleles, as described in section 5.4 of the VCFv4.2 spec
- Size similarity linkage and bug fixes for SV matching tools (#8257)
- Added size similarity criterion to the
SVConcordanceandSVClustertools. This is particularly useful for accurately matching smaller SVs that have a high degree of breakpoint uncertainty, in which case reciprocal overlap does not work well. PESR/mixed variant types must have size similarity, reciprocal overlap, and breakend window criteria met. Depth-only variants may have either size similarity + reciprocal overlap OR breakend window criteria met (or both).
- Added size similarity criterion to the
- Updated SV split-read strand validation and clustering (#8378)
- Adds some flexibility to the allowed split-read strand annotations on SV records:
- Allow INS -+ strands
- Allow INV null strands
- When clustering, only require that strands match for INV/BND records
- Adds some flexibility to the allowed split-read strand annotations on SV records:
- Sample set and annotation improvements for
SVConcordance(#8211)
- Added support for breakend replacement alleles in
-
Mitochondrial pipeline
-
Flow-based Calling
- New/updated flow-based read tools (#8579)
- Added a new
GroundTruthScorertool to score reads against a reference/ground truth - Updated
FlowFeatureMapper
- Added a new
- Created an
AddFlowBaseQualitytool that writes reads from flow-based SAM/BAM/CRAM files that pass criteria to a new file while adding a base-quality attribute (BQ) (#8235) - Added an experimental tool
FlowPairHMMAlignReadsToHaplotypesthat aligns flow-based reads to set of haplotypes / templates (#8305) - Fixed an issue with reads that contain the tp tag sometimes being incorrectly identified as flow-based (#8337)
- Minor changes and fixes to flow-based annotations (#8442)
- Removed a line in
FlowBasedAnnotationthat contained a bug and thus was meaningless (#8421) - Additional annotation in FeatureMap (#8347)
- Removed unnecessary flow-based argument and option (#8342)
GroundTruthScorerdoc update (#8597)- Removed unnecessary and buggy validation check (#8580)
- New/updated flow-based read tools (#8579)
-
Notable Enhancements
- Major security fixes in our dependencies and docker environment
- Greatly improved HTTP support (#8611)
- Updated the
http-niolibrary and made tweaks to HTSJDK to make it available in more places. The new version ofhttp-nioshould provide much more reliable access to http(s) file paths. This is supported by all methods accessing Paths, and includes SAM/BAM/CRAM and VCF/Feature file...
- Updated the
4.4.0.0
Download release: gatk-4.4.0.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.4.0.0 release:
-
We've moved to Java 17, the latest long-term support (LTS) Java release, for building and running GATK! Previously we required Java 8, which is now end-of-life.
- Newer non-LTS Java releases such as Java 18 or Java 19 may work as well, but since they are untested by us we only officially support running with Java 17.
-
Significant enhancements to
SelectVariants, including arguments to enableGVCFfiltering support and to work with genotype fields more easily. -
A new tool
SVConcordance, that calculates SV genotype concordance between an "evaluation" VCF and a "truth" VCF -
Bug fixes and enhancements to the support for the Ultima Genomics flow-based sequencing platform introduced in GATK 4.3.0.0
Full list of changes:
-
Flow-based Variant Calling
FlowFeatureMapper: added surrounding-median-quality-size feature (#8222)- Removed hardcoded limit on max homopolymer call (#8088)
- Fixed bug in dynamic read disqualification (#8171)
- Fixed a bug in the parsing of the T0 tag (#8185)
- Updated flow-based calling
Mutect2parameters to make them consistent with theHaplotypeCallerparameters (#8186)
-
SelectVariants
- Enabled GVCF type filtering support in
SelectVariants(#7193)- Added an optional argument
--ignore-non-ref-in-typesto support correct handling of VariantContexts that contain a NON_REF allele. This is necessary because every variant in a GVCF file would otherwise be assigned the type MIXED, which makes it impossible to filter for e.g. SNPs. - Note that this only enables correct handling of GVCF input. The filtered output files are VCF (not GVCF) files, since reference blocks are not extended when a variant is filtered out.
- Added an optional argument
SelectVariants: added new arguments for controlling genotype JEXL filtering (#8092)-select-genotype: with this new genotype-specific JEXL argument, we support easily filtering by genotype fields with expressions like 'GQ > 0', where the behavior in the multi-sample case is 'GQ > 0' in at least one sample. It's still possible to manually access genotype fields using the old-selectargument and expressions such asvc.getGenotype('NA12878').getGQ() > 0.--apply-jexl-filters-first: This flag is provided to allow the user to do JEXL filtering before subsetting the format fields, in particular the case where the filtering is done on INFO fields only, which may improve speed when working with a large cohort VCF that contains genotypes for thousands of samples.
- Enabled GVCF type filtering support in
-
SV Calling
-
Notable Enhancements
GenotypeGVCFs: added an--keep-specific-combined-raw-annotationargument to keep specified raw annotations (#7996)VariantAnnotatornow warns instead of fails when the variant contains too many alleles (#8075)- Read filters now output total reads processed in addition to the number of reads filtered (#7947)
- Added
GenomicsDBarguments to theCreateSomaticPanelOfNormalstool (#6746) - Added a
DeprecatedFeatureannotation and a process for officially marking GATK tools as deprecated (#8100) - Prevent tool
close()methods from hiding underlying errors (#7764)
-
Bug Fixes
- Fixed issue causing
VariantRecalibratorto sometimes fail if user provided duplicate -an options (#8227) ReblockGVCF: remove A,R, and G length attributes whenReblockGVCFsubsets an allele (#8209)- Previously if an input gVCF had allele length, reference length, or genotype length annotations in the FORMAT field,
ReblockGVCFwould not remove all of them at sites where an allele was dropped. This makes the output gVCF invalid since the annotation length no longer matches the length described in the header at those sites. Now we fix up F1R2, F2R1, and AF annotations and remove any other annotations that are not already handled that are defined as A, R, or G length in the header.
- Previously if an input gVCF had allele length, reference length, or genotype length annotations in the FORMAT field,
- Fixed a
gCNVbug that breaks the inference when only 2 intervals are provided (#8180) - Fixed NPE from unintialized logger in
GenotypingEngine(#8159) - Fixed asynchronous Python exception propagation in
StreamingPythonExecutor/CNNScoreVariants(#7402) - Fixed issue in
ShiftFastawhere the interval list output was never written (#8070) - Bugfix for the type of some output files in the somatic CNV WDL (#6735) (#8130)
MergeAnnotatedRegionsnow requires a reference as asserted in its documentation (#8067)
- Fixed issue causing
-
Miscellaneous Changes
- Deprecated an untested
VariantRecalibratorargument and an oldReblockGVCFargument that produced invalid GVCFs (#8140) - Removed old
GnarlyGenotypercode with a diploid assumption to prepare for adding haploid support toGnarlyGenotyper(#8140) ReblockGVCF: add error message for when tree-score-threshold is set but the TREE_SCORE annotation is not present (#8218)TransferReadTags: allow empty unaligned bams as input (#8198)- Refactored
JointVcfFilteringWDL and expanded tests. (#8074) - Updated the carrot github action workflow to the most recent version, which supports using
#carrot_prto trigger branch vs master comparison runs (#8084) - Replaced uses of
File.createTempFile()withIOUtils.createTempFile()to ensure that temp files are deleted on shutdown (#6780) - Don't require python just to instantiate the
CNNScoreVariantstool classes. (#8128) - Made several
Funcotatormethods and fields protected so it is easier to extend the tool (#8124) (#8166) - Test for presence of ack result message and simplify
ProcessControllerAckResultAPI (#7816) - Fixed the path reported by the gatkbot when there are test failures (#8069)
- Fixed incorrect boolean value in
DirichletAlleleDepthAndFractionIntegrationTest(#7963) - Removed two ancient and unused
HaplotypeCallertest files that are no longer needed (#7634) - Added scattered gCNV case WDL to dockstore file (#8217)
- Deprecated an untested
-
Documentation
- Updated instructions for installing Java in the README (#8089)
- Added documentation on
OMP_NUM_THREADSandMKL_NUM_THREADStoGermlineCNVCallerandDetermineGermlineContigPloidy(#8223) - Improvements to
PileupDetectionArgumentCollectiondocumentation (#8050) - Fixed typo in documentation for
VariantAnnotator(#8145)
-
Dependencies
- Moved to
Java 17, the latest LTS Java release, for building/running GATK (#8035) - Updated
Gradleto 7.5.1 (#8098) - Updated the GATK base docker image to 3.0.0 (#8228)
- Updated
HTSJDKto 3.0.5 (#8035) - Updated
Picardto 3.0.0 (#8035) - Updated
Barclayto 5.0.0 (#8035) - Updated
GenomicsDBto 1.4.4 (#7978) - Updated
Sparkto 3.3.1 (#8035) - Updated
Hadoopto 3.3.1. (#8102) - Require
commons-text1.10.0 to fix a security vulnerability (#8071)
- Moved to
4.3.0.0
Download release: gatk-4.3.0.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.3.0.0 release:
-
Support for the Ultima Genomics flow-based sequencing platform
-
A next-generation suite of tools for variant filtration based on site-level annotation, intended to eventually supersede the older
VariantRecalibratorworkflow -
CompareReferencesandCheckReferenceCompatibility: new tools for comparing and checking compatibility with genomic references -
Support in
HaplotypeCaller/Mutect2for supplementing the variants discovered in local assembly with variants discovered via a pileup-based approach
Full list of changes:
-
Support for the Ultima Genomics flow-based sequencing platform (#7876)
- Added a new
--flow-modeargument toHaplotypeCallerwhich better supports flow-based calling- Added a new Haplotype Filtering step after assembly which removes suspicious haplotypes from the genotyper
- Added two new likelihoods models,
FlowBasedHMMand theFlowBasedAlignmentLkelihoodEngine
- Added a new
--flow-modeargument toMutect2which better supports flow-based calling - Added support for uncertain read end-positions in
MarkDuplicatesSpark - Added a new tool
FlowFeatureMapperfor quick heuristic calling of bams for diagnostics - Added a new tool
GroundTruthReadsBuilderto generate ground truth files for Basecalling - Added a new diagnostic tool
HaplotypeBasedVariantRecallerfor recalling VCF files using theHaplotypeCallerEngine - Added a new tool breaking up CRAM files by their blocks,
SplitCram - Added a new read interface called
FlowBasedReadthat manages the new features for FlowBased data - Added a number of flow-specific read filters
- Added a number of flow-specific variant annotations
- Added support for read annotation-clipping as part of clipreads and GATKRead
- Added a new
PartialReadsWalkerthat supports terminating before traversal is finished
- Added a new
-
Next-generation suite of tools for variant filtration based on site-level annotations (#7954) (#8049)
- This tool suite is intended to eventually supersede the older
VariantRecalibratorworkflow - The new tools include:
ExtractVariantAnnotations: extracts site-level variant annotations, labels, and other metadata from a VCF file to HDF5 filesTrainVariantAnnotationsModel: trains a model for scoring variant calls based on site-level annotationsScoreVariantAnnotations: scores variant calls in a VCF file based on site-level annotations using a previously trained model
- This tool suite is intended to eventually supersede the older
-
New Reference Comparison Tools
CompareReferences: a new tool for analyzing the differences between references at both the dictionary and the base level (#7930) (#7987) (#7973)- In its default mode, this tool uses the reference dictionaries to generate an MD5-keyed table comparing the specified references, and does an analysis to summarize the differences between the references provided.
- Comparisons are made against a "primary" reference, specified with the
-Rargument. Subsequent references to be compared may be specified using the ``--references-to-compare` argument. - A supplementary table keyed by sequence name can be displayed using the
--display-sequences-by-name argument; to display only sequence names for which the references are not consistent, run with the--display-only-differing-sequencesargument as well. - MD5s can be recalculated from the actual sequence when missing from the dictionary
- When run with
--base-comparison FULL_ALIGNMENT, the tool performs full-sequence alignment on the differing reference sequences to produce a VCF with SNPs and Indels. However, this mode ignores IUPAC / N bases. - Running with
--base-comparison FIND_SNPS_ONLYfinds single-base differences between differing reference sequences of the same length. This mode can handle IUPAC / N bases correctly, but not indels. - To perform the full-sequence alignment, GATK now packages a distribution of
MUMmerfor x86_64 Mac and Linux, which can be invoked from within the GATK using the newMummerExecutorclass.
CheckReferenceCompatibility: a new tool to check a BAM/CRAM/VCF for compatibility against a set of references (#7959) (#7973)- This tool generates a table analyzing the compatibility of a BAM/CRAM/VCF input file against provided references.
- The tool works to compare BAM/CRAMs (specified using the -I argument) as well as VCFs (specified using the -V argument) against provided reference(s), specified using the
--references-to-compareargument. - When MD5s are present, the tool decides compatibility based on all sequence information (MD5, name, length); when MD5s are missing, the tool makes compatibility calls based only on sequence name and length.
-
HaplotypeCaller/Mutect2
- Added an optional "Pileup Detection" step to
Mutect2andHaplotypeCallerbefore assembly that supplements the variants from local assembly with variants that show up in the pileups (#7432) - Fixed a
Mutect2IndexOutOfBoundExceptionwith germline resource (#7979) Mutect3dataset enhancements: optional truth VCF for labels, seq error likelihood annotation (#7975)- Added
Mutect3dataset generation to theMutect2WDL (#7992) GetPileupSummariesnow streams its output rather than storing it in memory (#7664)- Fixed a rare edge case in the
AdaptiveChainPrunerwhere theJavaPriorityQueueis undefined for tied elements (#7851)
- Added an optional "Pileup Detection" step to
-
SV Calling
CondenseDepthEvidence: a new tool that combines adjacent intervals in DepthEvidence files (#7926)LocusDepthtoBAF: a new tool that merges locus-sorted LocusDepth evidence files, calculates the bi-allelic frequency (baf) for each sample and site, and writes these values as a BafEvidence output file (#7776)PrintReadCounts: a new tool that prints (and optionally subsets) an read depth (DepthEvidence) file or a counts file as one or more (for multi-sample DepthEvidence files) counts files for CNV determination (#8015)CollectSVEvidence: fixed a bug where trailing SNP sites and depth intervals without read coverage were being omitted from the output (#8045)CollectSVEvidence: added read depth generation and raw-counts output (#8015)- Improved
PrintSVEvidenceperformance by tweaking theMultiFeatureWalkertraversal (#7869) - Fixes related to
BafEvidence(biallelic-frequency of a sample at some locus) (#7861) - Fixed a bug where the end coordinate was being incorrectly compared when sorting discordant read pair evidence (#7835)
- Sort output from
SVClusterEngine(#7779) - Remove abandoned SV filtering project and unneeded build dependency (#7950)
-
CNV Calling
-
GenomicsDB
GenomicsDBImport: added the ability to specify explicit index locations via the sample name map file (#7967)- Each line in the sample name map file may now optionally contain a third column with the path/URI to the index. This is useful when the index is not in the same location as the corresponding GVCF.
-
Bug Fixes
- Fixed an issue where we weren't properly merging AD values when combining GVCFs and no PLs were present (#7836)
- Fixed a bug in
ReblockGVCFthat could cause the first position on a contig to be dropped (#8028) - Fixed an allele-ordering issue in the allele-specific annotation code (#7585)
VariantRecalibrator: type change int -> long to prevent tranche novel variant count overflow (#7864)- Fixed an issue with tabix index generation (#7858)
- Fixed a bug in
SiteDepthCodec(#7910)
-
Miscellaneous Changes
VariantsToTablenow includes all fields when none are specified (#7911)SelectVariantsnow warns the user about poor performance when the sample names in the VCF header are unsorted (#7887)VariantRecalibratornow has a--dont-run-rscriptargument to disable execution of its R script but still output the actual R script file (#7900)- Added some generic read tag/expression filters for use on numeric tags (#7746)
- Replaced Travis CI with Github Actions for our continuous testing (#7754)
- Switched over to Github Actions for building our nightly docker image (#7775)
- Created a new
build_docker_remote.shscript for building the docker image remotely with Google Cloud Build (#7951) - Added an argument mode manager for group arguments and a demonstration of how it might be used in
HaplotypeCaller--dragen-mode(#7745) - Added unit tests for the
Utils.concat()methods (#7918) - Added a test to validate WDLs in the scripts directory. (#7826)
- Added a
use_allele_specific_annotationarg and fixed task with empty input in theJointVcfFilteringWDL (#8027) - Fixed an issue in the GATK stats script in which the first day's downloads on a new release were set to 0 (#7794)
- Fixed a typo in the Dockerfile that broke git lfs pull (#7806)
- Removed unused code in the
utils.solverpackage (#7922) - Corrected the time for GATK nightly build cron jobs (#7784)
- Disabled the red "X" from failing
CodeCovbuilds and de...
4.2.6.1
Download release: gatk-4.2.6.1.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.2.6.1 release:
This release contains a single bug fix for GenotypeGVCFs to fix an erroneous IllegalStateException ("No likelihood sum exceeded zero -- method was called for variant data with no variant information.") in the edge case where unnormalized PLs are present at monomorphic sites.
4.2.6.0
Download release: gatk-4.2.6.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.2.6.0 release:
-
Important bug fixes for the joint calling tools (GenotypeGVCFs / GenomicsDB)
- GATK 4.2.5.0 contained two joint genotyping bugs that are now fixed in GATK 4.2.6.0:
GenotypeGVCFscan throw NullPointerExceptions in some cases with many alternate alleles.- The expectation-maximization component of the QUAL calculation was disabled, leading to false positive, low quality alleles at some multi-allelic sites.
- If you are running these tools in 4.2.5.0 we strongly recommend updating to 4.2.6.0
- GATK 4.2.5.0 contained two joint genotyping bugs that are now fixed in GATK 4.2.6.0:
-
Fixed a "Bucket is a requester pays bucket but no user project provided" error that occurred when accessing requester pays buckets in Google Cloud Storage even when the
--gcs-project-for-requester-paysargument was specified- If you continue to encounter problems accessing requester pays Google Cloud Storage buckets in 4.2.6.0, please let us know by filing a Github issue!
-
Two new tools for the Structural Variation calling pipeline:
SVAnnotateandPrintSVEvidence -
Some fixes to genotype-given-alleles mode in
HaplotypeCallerandMutect2
Full list of changes:
-
Joint Calling (GenotypeGVCFs / GenomicsDB)
- GATK 4.2.5.0 contained two joint genotyping bugs which are now fixed in 4.2.6.0:
GenotypeGVCFscan throw NullPointerExceptions in some cases with many alternate alleles.- Fixed in:
- Fix for
NullPointerExceptionwhen GenomicsDB has more ALT alleles than specified maximum and many GQ0 hom-ref genotypes allow variants to pass the QUAL filter (#7738)
- Fix for
- Fixed in:
- The expectation-maximization component of the QUAL calculation was disabled, leading to false positive, low quality alleles at some multi-allelic sites.
- Fixed in:
- Fix multi-allelic QUAL calculation and restore some missing ALT annotation data in
ReblockGVCFs(#7670)
- Fix multi-allelic QUAL calculation and restore some missing ALT annotation data in
- Fixed in:
- Mention acceptable compressed VCF file extensions in
GenomicsDBImporterror message (#7692)
- GATK 4.2.5.0 contained two joint genotyping bugs which are now fixed in 4.2.6.0:
-
SV Calling
- Added a new tool
SVAnnotate(#7431)SVAnnotateadds functional annotations for SVs called byGATK-SV(#7431)
- Added a new tool
PrintSVEvidence(#7695)PrintSVEvidenceis a tool that can merge any number of files containing one of five types of evidence of structural variation. It's also capable of subsetting regions or samples. It's used to merge evidence from a cohort in theGATK-SVpipeline.
- Added start/end coordinate validation to
SVCallRecord(#7714)
- Added a new tool
-
HaplotypeCaller / Mutect2
- Fixed an edge case in
HaplotypeCallerwhere filtered alleles in the vicinity of forced-calling alleles could result in empty calls (#7740)- This affects users who run genotype given alleles mode in non-GVCF mode
- Fixed a bug in
HaplotypeCallerandMutect2where force-calling alleles were lost upon trimming by placing allele injection after trimming (#7679) - Added a debug ``--pair-hmm-results-file` argument that dumps the the exact inputs/outputs of the PairHMM to a file (#7660)
- Some changes to
Mutect2to support the futureMutect3(#7663)- Added training data for the Mutect3 normal artifact filter
- Output tensors for Mutect3 as plain text rather than VCF
- Fixed an edge case in
-
RNA Tools
TransferReadTags: a new tool that transfers a read tag from an unaligned bam to the matching aligned bam (#7739).- This tool allows us to retrieve read tags that get lost when converting a SAM file to fastqs, then back to SAM (which is necessary if e.g. running fastp to clip adapter bases before alignment).
PostProcessReadsForRSEM: a new tool that re-orders and filters reads before running RSEM, which has stringent requirements on the input SAM (https://github.com/deweylab/RSEM) (#7752).
-
Funcotator
- Added custom
VariantClassificationseverity ordering. (#7673)- Users can now customize the severity ratings of the various
VariantClassificationsusing the new--custom-variant-classification-orderargument
- Users can now customize the severity ratings of the various
- Added logging statements to the b37 conversion process explaining why the automatic b37 conversion does or does not take place on their VCFs (#7760)
- Added custom
-
VariantRecalibrator
- Added regularization to covariance in GMM maximization step to fix convergence issues in
VariantRecalibrator(#7709)- This makes the tool more robust in cases where annotations are highly correlated
- Added regularization to covariance in GMM maximization step to fix convergence issues in
-
Bug Fixes
- Fixed a "Bucket is a requester pays bucket but no user project provided" error that occurred when accessing requester pays buckets in Google Cloud Storage even when
--gcs-project-for-requester-payswas specified (#7700) (#7730) - Fix for the
PossibleDeNovoannotation to work without Genotype Likelihoods (#7662)PossibleDeNovochecks each trio's genotype (including parent hom ref genotypes) for likelihoods even though it doesn't actually use the PLs. The PLs can get dropped if GVCFs are reblocked which means this annotation no longer works as expected. This changes the check to look for GQs instead of PLs as the GQs are used as part of the annotation.
- Fixed a bug with the
--mate-too-distant-lengthinMateDistantReadFilternot being configurable (#7701)
- Fixed a "Bucket is a requester pays bucket but no user project provided" error that occurred when accessing requester pays buckets in Google Cloud Storage even when
-
GATK Engine
-
Miscellaneous Changes
- Added back the
jcenterrepository resolver to our gradle build, fixing a "Could not find biz.k11i:xgboost-predictor:0.3.0" error when building GATK from source (#7665) - We now properly update the
latesttag in thebroadinstitute/gatk-nightlyDockerhub repo (#7703) - The docker build now only does a
git lfs pullonsrc/main/resources/large(#7727) - Install git lfs with --force in the
Dockerfile(#7682) - Fix WDL generation for
MultiVariantWalkersby adding a companion index to theMultiVariantWalkerinput variant arg (#7689) - Added google apps script to automatically update GATK release stats. (#7637)
- Updated the GATK stats script to be more universally usable (#7759)
- Added
JointCallExomeCNVsto.dockstore.ymland included a note in the WDL (#7719)
- Added back the
-
Documentation
- Corrected the docs for the
--heterozygosityargument in theGenotypeCalculationArgumentCollection(#7661)
- Corrected the docs for the
-
Dependencies
4.2.5.0
Download release: gatk-4.2.5.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.2.5.0 release:
-
Fixed a
GenotypeGVCFsIllegalStateExceptionerror reported by multiple users in #7639 -
Added a new tool
SVClusterthat clusters structural variants based on coordinates, event type, and supporting algorithms.
Full list of changes:
-
Joint Calling (GenotypeGVCFs / GenomicsDB)
- Fixed an
IllegalStateExceptioninGenotypeGVCFsarising from GenomicsDB output with too many alts and no likelihoods, and also added a--genomicsdb-max-alternate-allelesargument that is separate from the--max-alternate-allelesargument used byGenotypeGVCFs(#7655)- This fixes the
GenotypeGVCFserror reported in #7639 - The new
--genomicsdb-max-alternate-allelesargument is required to be at least one greater than the--max-alternate-allelesargument, to account for the NON_REF allele.
- This fixes the
ReblockGVCF: fixed an edge case where hom-ref "variant" records with no data had wrong-sized PLs and didn't merge with adjacent blocks (#7644)
- Fixed an
-
SV Calling
- Added a new tool
SVClusterthat clusters structural variants based on coordinates, event type, and supporting algorithms. (#7541)- Primary use cases include:
- Clustering SVs produced by multiple callers, based on interval overlap, breakpoint proximity, and sample overlap.
- Merging multiple SV VCFs with disjoint sets of samples and/or variants.
- Defragmentation of copy number variants produced with depth-based callers.
- Primary use cases include:
- Added a new tool
-
Mutect2
-
GATK Engine
- Added a new read filter,
ExcessiveEndClippedReadFilter(#7638)- This filter will keep reads that have fewer than the specified number of clipped bases on either end.
- Designed with long reads in mind, and as a result has a default value of 1000.
- Added a new read filter,
4.2.4.1 the log4j strikes back
Download release: gatk-4.2.4.1.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.2.4.1 release:
- Fix more newly discovered log4j2 vulnerabilities. Now that people are paying attention they are finding all sorts of things.
Full list of changes:
-
Build System
- Upgrade our build from Gradle 5.6 to the newest 7.3.2 (#7609)
- This fixes some gradle bugs which were blocking development
-
GenomicsDB
-
Miscellaneous Changes
-
Dependencies