Releases: fulcrumgenomics/fgbio
Release 0.6.0
Release 0.6.0 introduces the following changes to existing tools:
- ReviewConsensusVariants: output
PASS
when there are no filters on the variant; fix format of bases output - MaskPrimers: improved usage documentation to make primer file format clearer
The following API changes were also introduced:
- Added constants to
SamRecord
for SAM/BAM related constant values - NeedlemanWunchAligner renamed to Aligner (old name deprecated by still works)
- Implemented Glocal (or semi-global) alignment mode
- Impleemnted Local alignment mode
- Fixed affine gap implementation
- Fixed
Alignment.subByQuery/subByTarget
to correctly handle adjacent deletions
- In metrics files, ensure 0.0 always formats as
0
and not0E0
- Updated how
Rscript
finds resources in the classpath to support local paths and absolute paths with and without leading slashes
Release 0.5.1
Release 0.5.1 is a minor bug-fix release and introduces the following changes:
- ExtractUmisFromBam
- Improved error messaging
- Fixed bug that prevented it from working when only one read per pair contained a UMI
- GroupReadsByUmi now adds the sub-sort
SS
tag to the header of BAMs produced - CallMolecularConsensusReads and CallDuplexConensusReads attempt to detect the sort order of input data and will fail if the sort order is incompatible
- DemuxFastqs changed some output metrics from 32-bit
Int
to 64-bitLong
to avoid overflows on NovaSeq data
Release 0.5.0
Release 0.5.0 introduces the following changes to existing tools:
- CallDuplexConsensusReads: Fixed a rare bug where the consensus base quality could be zero or one if the two strands' base qualities differ by two or less.
- FilterConsensusReads: Fix for bug where duplex reads formed from raw reads from a single strand only could be incorrectly filtered.
- CorrectUmis: Now stores the original UMI sequences in the
OX
tag upon correction. - DemuxFastqs: Bug fix to correct quality scores in output BAM files
- ClipOverlappingReads: Removed previously deprecated tool. Use
ClipBam
instead. - ClipBam:
- Now optionally outputs metrics about clipping present in reads before and after execution.
- New option to "upgrade" clipping, e.g. replace existing soft-clipping with hard-clipping
Changes to APIs were as follows:
- Various deprecated methods were removed this release.
Metric
formatting now prints smallerDouble
s in scientific notation, and the formatting is generally more efficient.NeedlemanWunchAligner
gained aGlocal
alignment mode for aligning all of a query sequence to a sub-region of a target sequence
Release 0.4.0
Release 0.4.0 introduces the following changes to existing tools:
- CallDuplexConsensusReads
- The single strand consensus bases and quals for each duplex consensus read are output into tags on the duplex consensus read
- Added option to output consensus reads that are formed from only a single strand
- FilterConsensusReads
- New option to filter out reads with low mean base quality
- New option to filter out reads whose minimum depth is too low
- New option to filter duplex consensus reads where the single strand consensuses disagree
- New optional tags will store the the single-strand consensus bases and qualities for duplex consensus reads.
- DemuxFastqs
- will no longer output
/1
and/2
on read names when running in Illumina standards mode - fixed a bug causing an exception when the sample barcode is found in multiple reads (ex. i5 and i7)
- will no longer output
- ErrorRateByReadPosition - fixed bug that resulted in
C>G
errors being counted asA>G
errors - GroupReadsByUmi
- Reads with UMIs with
N
s in them are now rejected - Log messages added with counts of reads filtered out by reason
- Memory usage improvements when grouping reads at very, very high depth.
- Supports enforcing a minimum UMI length and partial UMIs except for the
paired
strategy (duplex sequencing).
- Reads with UMIs with
Finally, changes to various APIs were as follows:
- Method in
Bams
to sort records by tag, or by a function applied to a tag - Improve speed of
Metric.read
for loading large numbers of rows from metrics files - Changed
SamSource
to extendIterableView
instead ofIterable
so thatmap()
,filter()
, etc. return lazy views - Fixed a bug where the specified temporary directory was not being used for sorting.
- Added a
BinomialDistribution
class implemented using unlimited precision decimal math which is slower, but allows computation of cumulative probabilities where other implementations overflow or underflow
Release 0.3.0
Release 0.3.0 introduces the following changes to existing tools:
- ClipBam - The
--overlapping-reads
option was not being used internally and is deprecated in favor of--clip-overlapping-reads
. This caused overlapping reads to always be clipped. - CollectDuplexSeqMetrics - Added the optional output of duplex-umi frequencies with
DuplexUmiMetric
s. - DemuxFastqs - The default output sort order is changed from
Unsorted
toQueryname
. Add an option--illumina-standards
to output file names using Illumina naming conventions. Tuned the amount of memory used, especially for a large # of samples (>96). - CallDuplexConsensusReads - Do not except when we find potential collisions in duplex molecules, instead, do not generate a consensus read.
- FilterBam - adding a few more filters.
- Added a global parameter for log-level.
In addition, the following new tools were added:
- CollectErccMetrics - This will collect metrics for analyzing ERCC spike-ins in
RNA-Seq experiments for dose response but not fold-change
response.
Finally, changes to various APIs were as follows:
- ReferenceSetBuilder - Moved to the
testing
packages for use in projects that extendfgbio
. - Alignment - Added
subByQuery()
andsubByTarget()
methods toAlignment
.
Release 0.2.0
Release 0.2.0 introduces the following changes to existing tools:
- added global arguments accessible to all tools, which are given as arguments prior to the tool name:
--tmp-dir
: directory to use for temporary files.--compression
: default GZIP compression level, BAM compression level.--async-io
: use asynchronous I/O where possible, e.g. for SAM and BAM files.
- numerous changes to the tool documentation to support output in MarkDown format.
- DuplexConsensusCaller:
- adding logging statistics for DuplexConsensusCaller.
- adding quality trimming.
- improved method to find the set of "compatible" cigars to filter which reads from which to build a consensus
- DemuxFastqs:
- the output directory should be created if it does not exist
- change to the new quality format detector caused the detected encoding
not to be printed
- ClipOverlappingReads is deprecated in favor of ClipBam.
- SampleSheet and ExtractBasecallingParamsForPicard
- if the library identifier (
Library_Id
column) does not exist, it will default to the sample identifier (Sample_d
column); previously it defaulted to the sample name (Sample_Name
column).
- if the library identifier (
- HapCutToVcf: updated to support updated HapCut2 outputs.
- the full FORMAT field in the VCF is printed, including trailing missing values.
In addition, the following new tools were added:
- FastqToBam: generates an unmapped BAM (or SAM or CRAM) file from fastq files.
- BuildToolDocs: generates the suite of per-tool MarkDown documents.
- SplitBam: splits a BAM into multiple BAMs, one per-read group (or library).
- ClipBam: clips reads from the same template; replaces ClipOverlappingReads.
- CollectDuplexSeqMetrics: generates metrics for duplex sequencing quality control.
Next, a new API for reading and writing SAM/BAM files built for scala idioms:
- SamRecord: a replacement for htsjdk's
SAMRecord
with more scala-esque fields and methods. - SamSource: a class for reading SAM/BAM/CRAM files and for querying them.
- SamWriter: a class for writing SAM/BAM/CRAM files and sorting them.
- SamOrder: a trait for specifying SAM/BAM orderings; in addition to
coordinate
andqueryname
sort orders, includes useful and novel sorts such as:random
: generates a random order over all the reads.randomquery
: generates a random order withqueryname
grouping.templatecoordinate
: the sort order used byGroupReadByUmi
; sorts reads by the earlier unclipped 5' coordinate of the read pair, followed by the higher unclipped 5' coordinate of the read pair.unsorted
: the official "unsorted" ordering.unknown
: he official "unknown" ordering.
- Bams: methods for manipulating sequences of
SamRecord
s and other useful utility methods.- contains sorting methods that have better disk-backed sorting than htsjdk's for faster sorting of SAM/BAM files.
- SamBuilder: a class for building SAM/BAM files and records; useful for generating test-cases for unit tests.
Finally the following other changes were made:
Release 0.1.4
Release 0.1.4 introduces the following changes to existing tools:
- CallMolecularConsensusReads
- Added the ability to filter the maximum number of reads going into a consensus read
- CallMolecularConsensusReads and FilterConsensusReads
- No longer have default values for their
--min-reads
and--min-consensus-base-quality
/--min-base-quality
parameters. The correct values for these parameters is highly library/coverage dependent and is best set by the user.
- No longer have default values for their
- CallMolecularConsensusReads and CallDuplexConsensusReads
- Raw reads are end-trimmed for
N
s after low-quality masking, prior to consensus calling - Raw reads that are FR pairs with read length > insert size are trimmed to the insert size prior to consensus calling
- Raw reads are end-trimmed for
- ErrorRateByReadPosition
- Fixed a bug whereby the cumulative error plot produced in the PDF incorrectly started the R2 error count at the cumulative sum of the R1 error count.
- Added the count of errors (in addition to error rate) to the output file
- FilterSomaticVcf
- Now gracefully handles reads who's insert size and mapping information disagree. Warnings will be logged for all such reads, but the tool will not stop/exit upon finding such reads. Should reduce the frequency of "genomicPosition is outside of template" error messages
- Works with VCFs that do not contain
#contig
lines in the header
In addition the following new tools were added:
- DemuxFastqs: Performs sample demultiplexing on FASTQs
- CorrectUmis: Corrects UMI sequences in BAM files when a set of fixed UMIs (not randommers) are used
Miscellaneous:
- Added support for cross-building scala 2.11 and 2.12
- Tools that invoke R scripts will now produce less noisy output
Release 0.1.3
Release 0.1.3 introduces the following changes to existing tools:
- CallMolecularConsensusReads now produces detailed information about consensus reads in new optional tags
- MakeTwoSampleMixtureVcf now propogates the ID field from the source VCF into the mixutre VCF
- ErrorRateByReadPosition now masks out known variants, provides per-substitution type error rates and produces summary plots
- ReviewConsensusVariants now generates a detailed output file with a row per variant-supporting-read
In addition the following new tools were added:
- ClipOverlappingReads: clips alignments from read pairs whose alignments overlap
- FilterConsensusReads: filters consensus reads generated by CallMolecularConsensusReads
- EstimatePoolingFractions: estimates the fractional contribution of individual samples with known genotypes to a pooled sample
- EstimateRnaSeqInsertSize: estimates insert size distributions of RNA sequencing experiments in the presence of splicing
- CallDuplexConsensusReads: generates consensus reads from duplex-sequencing protocols that embed a UMI at the start of each read in a pair
- MakeMixtureVcf: generates a VCF for a mixture sample created from many individual samples
- FilterSomaticVcf: applies filters to VCFs of somatic variants
- RemoveSamTags: strips out optional tags/attributes from a SAM/BAM file to reduce size
- ExtractBasecallingParamsForPicard: parses an Illumina Experiment Manager sample sheet and generates the files needed to run Picard's basecalling tools
- ExtractIlluminaRunInfo: extracts information from Illumina's
RunInfo.xml
file into a simple tab-delimited table
fgbio release version 0.1.2
Release of fgbio that contains tools:
ErrorRateByReadPosition
: Calculates the error rate by read position on mapped BAMs.ReviewConsensusVariants
: Extracts data to make reviewing of variant calls from consensus reads easier.PickIlluminaIndices
: Picks a set of molecular indices that should work well together.AssessPhasing
: Assess the accuracy of phasing for a set of variants.AutoGenerateReadGroupsByName
: Adds read groups to a BAM file for a single sample by parsing the read names.MakeTwoSampleMixtureVcf
: Tool to make a VCF with genotypes constructed by mixing the genotypes of two other samples.
Numerous bug fixes, performance improvements, and changes have been made to existing tools and classes. Refer to the commit history for such changes.
fgbio release version 0.1.1
Release of fgbio
that contains tools:
HardMaskFasta
: Converts soft-masked sequence to hard-masked in a FASTA file.TrimFastq
: Trims reads in one or more line-matched fastq files to a specific read length.ExtractUmisFromBam
: Extracts unique molecular indexes from reads in a BAM file into tags.FindTechnicalReads
: Find reads that are from technical or synthetic sequences in a BAM file.RandomizeBam
: Randomizes the order of reads in a SAM or BAM file.SetMateInformation
: Adds and/or fixes mate information on paired-end reads.UpdateReadGroups
: Updates one or more read groups and their identifiers.CallMolecularConsensusReads
Calls consensus sequences from reads with the same unique molecular tag.GroupReadsByUmi
: Groups reads together that appear to have come from the same original molecule.HapCutToVcf
: Converts the output of HapCut to a VCF.