Releases: waveygang/wfmash
memory correctness
Buildable source tarball: wfmash-v0.24.1.tar.gz
These changes specifically work on ensuring that the updates to the mapping system are memory safe. There were some cases of uninitialized variables being used in computation that may lead to instability in the previously released version.
mapping scaffolding in less memory
Buildable source tarball: wfmash-v0.24.0.tar.gz
wfmash v0.24.0 Release
This release brings significant memory optimizations, improved mapping scaffolding capabilities, and enhanced ANI-based identity estimation.
Major Improvements
Memory Optimization
- Drastically reduced memory usage during mapping phase (~66% reduction)
- Optimized alignment phase with on-demand record loading
- Clean memory separation between mapping and alignment phases
- Optimized ANI sketching phase memory consumption
- New compact mapping structures for better memory efficiency
Mapping Scaffolding
- New 2D distance graph scaffolding algorithm for improved syntenic block detection
- Enhanced scaffold filtering with plane sweep optimization
- Configurable minimum scaffold length (default: 5kb)
- Support for scaffold mapping output via new options
- Better handling of boundary mappings for improved leniency
ANI-based Identity Estimation
- New ANI preset system (ani25, ani25-5, etc.) for automatic identity threshold selection
- Automatic identity estimation with
-p auto - Streaming MinHash implementation for efficient ANI computation
- Parallel ANI estimation with TaskFlow
- Per-group identity calculations with better CPU utilization
New Features
Build System
- Added
VENDOR_HTSLIBCMake option for building without system htslib - Updated WFA2-lib submodule integration
- Improved build optimization flags
Command Line Interface
- Redesigned CLI parameters for better usability
- Changed sketch parameter to
-s(was-S) - Changed window-size parameter to
-w(wassegLength) - Updated default overlap threshold from 1.0 to 0.95
- Minimum L1 hits now defaults to 3 (configurable with
-H) - Map sparsification parameter for controlling mapping density
Performance
- ~25% speedup for small genomes through optimized reverseComplement function
- Per-group mutexes for better parallel scaling
- Thread-local reader functions for improved I/O performance
- Progress reporting for all pipeline phases
Bug Fixes
- Fixed critical bug in MinHash sketch computation for groups
- Resolved
stoiconversion errors with invalid records - Fixed type conversion overflow in parameter handling
- Corrected mapping merge logic for query and reference spans
- Fixed boundary mapping criteria for better edge case handling
Technical Details
- Maintained chain identity information for
ch:Z:tag - Helper function to merge adjacent CIGAR operations
- Improved sequence loading patterns for ANI estimation
- Better progress reporting throughout all phases
Contributors
Thanks to all contributors who made this release possible, with special mentions to those who worked on memory optimization, scaffolding improvements, and the ANI estimation system.
What's Changed
- Save mappings at the boundaries by @AndreaGuarracino in #350
- Created a submodule for WFA2-lib to reflect credits by @pjotrp in #351
- Improved ReverseComplement function to use a lookup table by @pjotrp in #352
- Update WFA2-lib subproject to latest commit 49c255df by @AndreaGuarracino in #353
- Sparsify again by @AndreaGuarracino in #360
- Fix type conversion in handy_parameter function to use int64_t for co… by @AndreaGuarracino in #363
- Lower density minmers by @ekg in #362
- fix: apply query padding consistently in total work calculation by @ekg in #364
- Merge adjacent CIGAR operations by @AndreaGuarracino in #365
- Resolved skipping invalid record: stoi by @unavailable-2374 in #367
- Low memory mapper by @ekg in #369
- improve readme to match current version by @ekg in #370
- -p from ANI estimate by @ekg in #371
- Add VENDOR_HTSLIB CMake option for building without system htslib by @ekg in #372
- squeeze down memory by @ekg in #373
New Contributors
- @unavailable-2374 made their first contribution in #367
Full Changelog: v0.23.0...v0.24.0
v0.23.0 - First HPRCv2-iteration release
Buildable source tarball: wfmash-v0.23.0.tar.gz
What's Changed
- fix: Update script to generate git version with additional source path by @AndreaGuarracino in #328
- Mapping memory cleanup by @ekg in #329
- Optimize mapping process for memory efficiency by @ekg in #330
- Alignment badness by @ekg in #331
- Fix index management when target and queries are in different files by @AndreaGuarracino in #332
- Thread safe faidx by @ekg in #333
- Head/tail patching + new alignment penalties by @AndreaGuarracino in #335
- Indicators progress logging by @ekg in #334
- Add option to disable alignment patching at chain boundaries by @AndreaGuarracino in #338
- query padding by @ekg in #339
- feat: Update WFA scoring parameters to [5,8,2,24,1] by @AndreaGuarracino in #340
- Apply target padding everywhere, while query padding only at the ends by @AndreaGuarracino in #341
- Add
--progress-baroption by @AndreaGuarracino in #342 - fix: change batch size type from uint64_t to int64_t for consistency by @AndreaGuarracino in #343
- guix: fix profiler to run with two tests by @pjotrp in #336
- remove unused code by @AndreaGuarracino in #344
- refactor: update align progress meter to use
shared_ptrfor consistency by @AndreaGuarracino in #345 - refactor
mergeMappingsInRangeby @AndreaGuarracino in #346 - Fix progress again by @ekg in #348
Full Changelog: v0.22.0...v0.23.0
Refresh: mapping chaining, biWFA, saffolding, and scaling
Buildable source tarball: wfmash-v0.22.0.tar.gz
Overview
wfmash v0.22.0 represents a significant evolution in our pangenome-scale alignment approach, featuring fundamental algorithmic changes to mapping and alignment processes, a cleaner command line interface. It presents a major reworking of key parts of the mapping and alignment pipeline to improve reliability, sensitivity, and accuracy. This version introduces mutual-best-buddy based mapping chaining, smaller segment sizes for greater SV breakpoint sensitivity, scaffold-mapping based filtering to detect large homology regions, long mapping splitting, (which allows) direct biWFA alignment, improved memory management with batch indexing, and a completely rewritten TaskFlow-based execution model that delivers substantial performance improvements.
Alignment Engine Improvements
Direct biWFA Integration
The most substantial change in this release is the transition from WFlign to biWFA as the default alignment algorithm. Previously, wfmash used a complex hierarchical approach with WFlign and several intermediate alignment steps, which sometimes led to inconsistent results and performance issues. The new direct biWFA implementation provides several benefits:
- Simpler, more reliable alignment process with fewer intermediate steps
- More consistent alignment results across diverse sequence types
- Improved handling of complex structural variations
- Better performance through vectorized code for lower divergence sequences
The alignment engine now uses a direct approach to apply biWFA to mappings found in the initial MashMap phase, resulting in cleaner alignments while reducing computational overhead.
Target Padding
We've implemented target padding around mapping boundaries to improve alignment quality. When requested through the new -E/--target-padding parameter, wfmash extends the reference sequence by a specified amount on both sides of the mapped region before alignment. This helps capture sequence context that might be missed with exact boundaries and ensures more complete alignments at sequence edges.
To maintain valid alignments, we employ coordinate swizzling during the alignment process, ensuring that any indels resulting from target padding are always placed at alignment boundaries, even though we're using global alignment with biWFA.
Mapping and Filtration Improvements
Fundamental Changes to Mapping and Chaining
One of the most significant architectural changes in this version is the complete rewrite of the mapping and chaining logic. These changes fundamentally alter how wfmash detects and represents sequence homology:
- Segment length defaults to 1kb (versus previous 5kb), allowing detection of much finer-grained homology
- Chain gap parameter now defaults to 2kb (versus previous 30kb), providing more precise control over what constitutes a chain
- Chain selection now uses a more sophisticated distance-based metric that considers both reference and query coordinates
- Chains are now properly tracked with unique identifiers, positions, and lengths throughout the process
- Chain information is preserved in output formats with
chain:i:id.pos.lengthtags - Improved merging logic respects maximum mapping length while maintaining chain integrity
- Better handling of divergent regions with intelligent chain splitting and statistics computation
- Support for merging chains on either forward or reverse strand with proper coordinate handling
These changes collectively result in much more accurate mapping chains that better represent the underlying biological reality, especially for sequences with complex evolutionary histories.
Scaffold-Based Mapping
A major advancement in this release is the introduction of scaffold-based mapping, which substantially improves our ability to detect and represent structural variations. The scaffolding process works by:
- Creating "super chains" from maximally merged mappings with aggressive gap parameters
- Using a rotated coordinate system to efficiently filter mappings based on their relationship to these scaffolds
- Employing a plane sweep algorithm with interval trees to identify mappings that fall within scaffold envelopes
This approach preserves mappings that contribute to larger structural patterns while filtering out spurious alignments. Users can control this process with the new -S/--scaffolding parameter which accepts gap size, minimum length, and maximum deviation values.
Improved Chaining
The chaining logic has been completely rewritten to better handle complex genomic arrangements. The new approach:
- Focuses on finding optimal chain pairs based on precise distance metrics
- Uses shorter segment lengths (default 1kb versus previous 5kb) to capture finer-grained homology
- Implements more intelligent and flexible chain gap parameters (default 2kb vs. previous 30kb)
- Respects maximum mapping length constraints when merging chains
These changes allow wfmash to detect smaller structural variants directly in the mapping phase, rather than relying on the more complex and sometimes error-prone WFlign approach used in previous versions.
Memory Optimization and Scalability
Batch Indexing
This release introduces batch indexing for reference sequences, a critical feature for working with very large genomes on memory-constrained systems. With the new -b/--batch parameter, wfmash:
- Partitions reference sequences into batches based on specified size
- Builds and processes indices for each batch independently
- Combines mapping results across all batches for consistent output
This enables processing of reference collections that would otherwise exceed available memory, such as large mammalian pangenome projects spanning terabases of sequence.
Parallel Index Building
Index construction is now fully parallelized, with significant performance improvements:
- Parallel k-mer frequency counting across all sequences
- Thread-local processing during index construction
- Improved synchronization between indexing threads
- Better memory utilization through optimized data structures
The result is much faster index building, particularly for large reference collections, with reduced peak memory usage.
Performance Enhancements
TaskFlow-Based Execution Model
The execution engine has been completely rewritten using the TaskFlow library, replacing the previous atomic queue-based system. This new model provides:
- More efficient task scheduling with explicit dependencies
- Better load balancing across all available threads
- Improved pipeline parallelism for multi-stage processing
- Reduced thread contention and synchronization overhead
The TaskFlow implementation manages the entire workflow from reading input files to writing output, with appropriate parallelization at each stage.
Enhanced FASTA I/O Performance
FASTA input/output operations now benefit from several optimizations:
- Integrated thread pooling for FASTA reading with optimized BGZF queue sizes
- Zero-copy sequence view processing with string_view
- Batch reading of input mapping files
- Efficient memory handling with custom allocators
These improvements significantly reduce I/O bottlenecks, especially for highly compressed reference files.
Command Line Interface Improvements
The command-line interface has been thoroughly reorganized for better usability:
- More logical grouping of related parameters
- Short options for commonly used parameters (e.g.,
-gfor alignment scoring) - Clearer parameter names that better reflect their functionality
- More descriptive help text with improved formatting
- Simplified parameter handling with sensible defaults
New parameters include:
-S/--scaffoldingfor controlling scaffold-based mapping-E/--target-paddingfor reference sequence padding-b/--batchfor controlling batch size in indexing-g/--wfa-paramsfor alignment scoring configuration
Default values have been carefully tuned based on extensive testing across diverse genome types, providing good out-of-the-box performance for most use cases.
Output and Reporting Enhancements
Chain Information in PAF Output
The PAF output format now includes a chain:i field that exposes detailed information about mapping chains:
- Chain ID to identify mappings that belong together
- Position within the chain (1-based)
- Total length of the chain
This makes it easier to track related mappings and understand the structure of complex alignments, particularly when processing outputs with downstream tools.
Enhanced Progress Reporting
Progress reporting has been significantly improved:
- More accurate time estimates during long-running operations
- Detailed statistics about reference and query datasets
- Information about sequence groups and average sizes
- Clear reporting of filtering parameters and their effects
- Better error messages with more context and recovery options
These improvements make wfmash more informative during execution and help troubleshoot potential issues.
Other Notable Changes
- HTSlib thread pooling integration for better performance with compressed files
- Improved error handling and validation for all input parameters
- More robust handling of sequence naming and ID management
- Enhanced detection and processing of overlapping mappings
- Optimization of minimum hits calculation for better sensitivity
- Support for chain field exposed in PAF output for downstream processing
- Improved hypergeometric filtering with configurable parameters
Conclusion
wfmash v0.22.0 represents a substantial step forward in pangenome alignment capability, with fundamental improvements to core algorith...
high sensitivity mapping by default
Buildable source tarball: wfmash-v0.21.0.tar.gz
Previously, settings that might make runtime slightly better when aligning pangenomes hurt performance in comparative genomics contexts. Updates related to mashmap3 and alignment have made us much more robust to defaults that are more sensitive.
In this release, we're setting a bunch of defaults which have become standard in testing:
- Default minimum mapping identity reduced from 90% to 70%.
- Set maximum mapping length to 50k by default (previously unlimited).
- Changed block length default from 5x segment length to 3x segment length.
- Set default chain gap to 30kb (previously was 6x segment length, up to 30k).
- Reduced default segment length from 5k to 1k.
- Changed default kmer size from 19 to 15.
- Modified wflign to run on all fragments except very small ones (less than 1000 bp).
- Changed filtering logic to use Euclidean distance as an absolute cutoff instead of axis-weighted Euclidean distance, while still ranking based on axis-weighted distance.
These should tend to make wfmash more sensitive at the edges of its performance envelope with minimal costs for easy, low-divergence pangenome alignment problems.
chunking and gliding while head tail global patching
Buildable source tarball: wfmash-v0.20.0.tar.gz
Major Changes
-
New Global Alignment Approach:
- Replaced the previous head and tail patching with a comprehensive global alignment strategy.
- Implemented
erode_headanderode_tailfunctions to remove small, potentially spurious matches at alignment boundaries. - The alignment now aims to include the entire query sequence, crucial when using the
-Poption for chunking mappings. - This change ensures continuity across the entire sequence, especially important when mappings are broken into smaller pieces for easier alignment.
- Switched from a semi-global approach (pinned at one end) to a fully global alignment, improving accuracy across the entire sequence length.
-
Improved Chaining Algorithm:
- Introduced an axis-weighted Euclidean distance function for more accurate chaining of mappings.
- This new function helps break mappings when encountering large indels, which can be computationally expensive to align.
- Improves detection of large structural variations directly from the mapping stage.
- Reduces spurious chaining in satellite repetitive sequences by considering the diagonal nature of true matches.
- The weighting maintains the original chain gap threshold for on-diagonal matches while effectively shortening the allowed distance for off-diagonal matches.
-
Mapping and Alignment Improvements:
- Modified the logic for determining cuttable positions in long alignments to avoid breaking alignments in the middle of structural variations (SVs).
- Adjusted the merging of consecutive mappings to be more selective, prioritizing the preservation of potential SV signals.
- Enhanced the handling of complex genomic structures by improving coordination between mapping and alignment stages.
-
Performance Optimization:
- Temporarily disabled multithreaded FASTA input processing due to thread safety issues with the samtools faidx reader.
- This change addresses memory efficiency concerns and prevents potential errors in multi-threaded environments.
- Future updates may reintroduce multi-threaded processing with improved memory management.
- Optimized the mapping process when not splitting sequences.
- Improved efficiency of long mapping handling, particularly when max mapping length is set to infinity.
-
Default Changes:
- Changed the default maximum mapping length (
-P/--max-mapping-length) to infinity, allowing for longer continuous alignments when appropriate.
- Changed the default maximum mapping length (
Minor Improvements and Bug Fixes
- Enhanced error handling and validation throughout the alignment process.
- Improved coordinate calculations, especially in edge cases involving sequence boundaries and large structural variations.
- Added additional PAF output fields, including a chain identifier for merged mappings.
- Adjusted parameters for more robust alignment in complex regions.
This release significantly improves wfmash's efficiency when handle complex genomic structures (e.g. centromeres) and large-scale variations, particularly when using the -P option to chunk mappings for more efficient alignment. While this option has been left unset by default, we do strongly recommend exploring it if you find your alignment times are very slow. A good setting in testing has been -P50k.
Better broken mappings
Buildable source tarball: wfmash-v0.18.0.tar.gz
What's Changed
This release fixes a bunch of small issues with previous updates to the mapping merging and splitting logic.
The main update should improve mapping coverage by correctly calculating the block length of the mapping based on the pre-split mapping. We also correctly organize cuts to be in regions without SVs.
Full Changelog: v0.18.0...v0.19.0
Unfolding
Buildable source tarball: wfmash-v0.18.0.tar.gz
Improving mapping in complex regions, debugging recursive patching, and other fun.
-
Recursive Inversion Patching:
- Implemented recursive patching for inversions, completing the "multipatch" functionality.
- This allows for more accurate alignment of complex genomic regions with inversions.
-
SAM Output for Multipatch Alignments:
- Added support for SAM output format for multipatch alignments.
- Ensures consistent representation of complex alignments across different output formats.
-
Orientation-Consistent Alignments:
- Improved alignment consistency across all orientations of reference-query pairs.
- Enhances reliability and reproducibility of alignment results.
-
Optimized Inversion Patching:
- Implemented a bound on the maximum score for inverted patches.
- Allows for early termination of alignment when the inverted patch is worse than the forward alignment.
-
Dynamic Multi-Producer Alignment Module:
- Rewrote the alignment module to support multiple producers filling the work queue.
- Dynamically handles memory issues, improving efficiency and scalability.
-
Overlap Filtering in Plane Sweep Algorithm:
- Implemented an overlap filter to prevent keeping suboptimal mappings.
- New CLI option:
-O, --overlap-threshold <F>- Allows setting the fraction F for dropping mappings overlapping with higher scoring mappings.
- Default value is 0.5.
-
Long Mapping Fragmentation:
- Enabled breaking of long mappings into smaller fragments at junction points.
- Junctions are defined by four consecutive segments, allowing for more precise breakpoint detection around structural variations.
- New CLI option:
-P, --max-mapping-length <N>- Sets the maximum length of a single mapping before breaking.
- Default value is 1M (1 million bases).
-
Improved Handling of Satellite Sequences:
- The combination of overlap filtering, mapping fragmentation, and recursive patching significantly improves wfmash's ability to handle satellite sequences.
- These changes address common performance issues and mapping problems associated with highly repetitive regions.
- Users should expect better accuracy and efficiency when aligning genomes with abundant satellite sequences.
-
Performance Improvements:
- Various optimizations and code refactoring for better overall performance.
-
Bug Fixes and Minor Enhancements:
- Multiple bug fixes and small improvements throughout the codebase.
This release significantly enhances wfmash's ability to handle complex genomic structures, including challenging satellite sequences. It improves output consistency and optimizes performance for large-scale alignments. The new features and CLI options provide more accurate and detailed alignment information, particularly for regions with inversions, structural variations, and repetitive elements, while offering users greater control over the alignment process. These improvements make wfmash more robust and efficient for a wider range of genomic analyses, especially those involving highly repetitive or complex regions.
What's Changed
Full Changelog: v0.17.0...v0.18.0
Multipatch
Buildable source tarball: wfmash-v0.17.0.tar.gz
This release introduces multipatch alignment capabilities, significantly enhancing wfmash's ability to handle complex genomic structures, particularly inversions and other rearrangements. Multipatching refers to a process in which the initial wflign traceback is patched, we determine that an inverted orientation of the patch is preferable (as introduced in v0.16.0), and (in v0.17.0) we now attempt multiple patching steps to span the gap. Key improvements include:
Multipatch Alignment:
- Implemented a progressive alignment approach that can detect and align multiple patches, including inversions, within a single alignment region.
- Added a new tag
patch:Z:trueto indicate multipatch alignments in the output. - Introduced an
inv:Z:true/falsetag to specify whether a patch is inverted.
Alignment Refinements:
- Implemented trimming of alignments to remove leading and trailing indels, improving alignment quality.
- Added bounds detection for alignments to better handle partial matches.
- Increased the default chain gap to 6x segment length or 30k, allowing for detection of larger variants.
Output Enhancements:
- Modified the output format to clearly distinguish multipatch alignments.
- Improved logging and debugging output for better insight into the alignment process.
Code Improvements:
- Enhanced the
alignment_tclass with new accessors for query and target begin/end positions. - Implemented pruning of overlapping patches to avoid redundant alignments.
- Refactored several core functions for better modularity and readability.
Build System:
- Added libdeflate as a dependency in the Guix build configuration.
This release significantly improves wfmash's ability to handle complex genomic alignments, particularly those involving local inversions and other structural variations. The multipatch approach allows for a more complete representation of genomic relationships in challenging regions than is available in other methods.
Happy aligning with enhanced structural variation breakpoint resolution! 🧬🔍🧮
What's Changed
- add deflate to guix.scm by @AndreaGuarracino in #258
- Multi-patch by @ekg in #259
Full Changelog: v0.16.0...v0.17.0
Inversion patching and mashmap3 index saving
Buildable source tarball: wfmash-v0.16.0.tar.gz
The primary enhancement in this release is the implementation of inversion detection during the alignment patching process. This feature significantly improves the alignment accuracy for sequences containing inversions.
How it works:
-
Patching Process: During the wflign high-level trace patching, the algorithm identifies regions that do not align well in the forward orientation.
-
Reverse Complement Alignment: For these poorly aligned regions, the algorithm attempts an alignment with the reverse complement of the sequence.
-
Score Comparison: The algorithm compares the alignment scores of the forward and reverse complement alignments.
-
Selection: If the reverse complement alignment produces a better score, it is selected for that region.
-
Output: Reverse complement alignments are reported with an additional SAM tag
rc:Z:true.
Key Components:
- New parameter
wflign_min_inv_patch_len: Sets the minimum length of an inverted patch to be considered (default: 23). calculate_alignment_scorefunction: Computes alignment scores based on the CIGAR string and penalties.- Modified
do_wfa_patch_alignmentfunction: Now handles both forward and reverse complement alignments. - Updated
write_merged_alignmentfunction: Processes and outputs reverse complement alignments.
This feature allows wfmash to accurately align sequences with inversions, improving its utility for complex genomic comparisons.
Other Significant Changes
-
MashMap Index Support:
- Implemented creation and usage of MashMap indexes for faster repeat mapping.
- New CLI options:
--mm-index,--create-index-only,--overwrite-mm-index.
-
Memory Optimization:
- Improved memory usage in the
Sketchclass.
- Improved memory usage in the
-
Kmer Size Calculation:
- Modified to handle edge cases with high-identity alignments.
-
Alignment Class Improvements:
- Enhanced
alignment_tclass with proper copy and move semantics.
- Enhanced
-
Index File Handling:
- Improved reading and writing processes with parameter validation.
Detailed Log of Changes
src/align/include/align_parameters.hpp
- Added
wflign_min_inv_patch_lenparameter toParametersstruct.
src/align/include/computeAlignments.hpp
- Integrated
wflign_min_inv_patch_lenintoWFlignconstructor call.
src/common/wflign/src/wflign.cpp and wflign.hpp
- Added
min_inversion_lengthtoWFlignconstructor and member variables. - Modified
minhash_kmer_sizecalculation for edge cases.
src/common/wflign/src/wflign_alignment.cpp and wflign_alignment.hpp
- Implemented copy/move constructors and assignment operators for
alignment_t. - Added
calculate_alignment_scorefunction.
src/common/wflign/src/wflign_patch.cpp and wflign_patch.hpp
- Modified
do_wfa_patch_alignmentfor reverse complement handling. - Updated
write_merged_alignmentfor reverse complement output. - Refined patching process for bidirectional alignment consideration.
src/interface/parse_args.hpp
- Added CLI options for MashMap indexing and
wflign_min_inv_patch_len.
src/map/include/map_parameters.hpp
- Added parameters for MashMap indexing support.
src/map/include/parseCmdArgs.hpp
- Updated parsing for new MashMap indexing options.
src/map/include/winSketch.hpp
- Implemented MashMap index functions (create, read, write).
- Added CLI-index file parameter validation.
- Optimized
Sketchclass memory usage.