Buildable source tarball: wfmash-v0.22.0.tar.gz
Overview
wfmash v0.22.0 represents a significant evolution in our pangenome-scale alignment approach, featuring fundamental algorithmic changes to mapping and alignment processes, a cleaner command line interface. It presents a major reworking of key parts of the mapping and alignment pipeline to improve reliability, sensitivity, and accuracy. This version introduces mutual-best-buddy based mapping chaining, smaller segment sizes for greater SV breakpoint sensitivity, scaffold-mapping based filtering to detect large homology regions, long mapping splitting, (which allows) direct biWFA alignment, improved memory management with batch indexing, and a completely rewritten TaskFlow-based execution model that delivers substantial performance improvements.
Alignment Engine Improvements
Direct biWFA Integration
The most substantial change in this release is the transition from WFlign to biWFA as the default alignment algorithm. Previously, wfmash used a complex hierarchical approach with WFlign and several intermediate alignment steps, which sometimes led to inconsistent results and performance issues. The new direct biWFA implementation provides several benefits:
- Simpler, more reliable alignment process with fewer intermediate steps
- More consistent alignment results across diverse sequence types
- Improved handling of complex structural variations
- Better performance through vectorized code for lower divergence sequences
The alignment engine now uses a direct approach to apply biWFA to mappings found in the initial MashMap phase, resulting in cleaner alignments while reducing computational overhead.
Target Padding
We've implemented target padding around mapping boundaries to improve alignment quality. When requested through the new -E/--target-padding
parameter, wfmash extends the reference sequence by a specified amount on both sides of the mapped region before alignment. This helps capture sequence context that might be missed with exact boundaries and ensures more complete alignments at sequence edges.
To maintain valid alignments, we employ coordinate swizzling during the alignment process, ensuring that any indels resulting from target padding are always placed at alignment boundaries, even though we're using global alignment with biWFA.
Mapping and Filtration Improvements
Fundamental Changes to Mapping and Chaining
One of the most significant architectural changes in this version is the complete rewrite of the mapping and chaining logic. These changes fundamentally alter how wfmash detects and represents sequence homology:
- Segment length defaults to 1kb (versus previous 5kb), allowing detection of much finer-grained homology
- Chain gap parameter now defaults to 2kb (versus previous 30kb), providing more precise control over what constitutes a chain
- Chain selection now uses a more sophisticated distance-based metric that considers both reference and query coordinates
- Chains are now properly tracked with unique identifiers, positions, and lengths throughout the process
- Chain information is preserved in output formats with
chain:i:id.pos.length
tags - Improved merging logic respects maximum mapping length while maintaining chain integrity
- Better handling of divergent regions with intelligent chain splitting and statistics computation
- Support for merging chains on either forward or reverse strand with proper coordinate handling
These changes collectively result in much more accurate mapping chains that better represent the underlying biological reality, especially for sequences with complex evolutionary histories.
Scaffold-Based Mapping
A major advancement in this release is the introduction of scaffold-based mapping, which substantially improves our ability to detect and represent structural variations. The scaffolding process works by:
- Creating "super chains" from maximally merged mappings with aggressive gap parameters
- Using a rotated coordinate system to efficiently filter mappings based on their relationship to these scaffolds
- Employing a plane sweep algorithm with interval trees to identify mappings that fall within scaffold envelopes
This approach preserves mappings that contribute to larger structural patterns while filtering out spurious alignments. Users can control this process with the new -S/--scaffolding
parameter which accepts gap size, minimum length, and maximum deviation values.
Improved Chaining
The chaining logic has been completely rewritten to better handle complex genomic arrangements. The new approach:
- Focuses on finding optimal chain pairs based on precise distance metrics
- Uses shorter segment lengths (default 1kb versus previous 5kb) to capture finer-grained homology
- Implements more intelligent and flexible chain gap parameters (default 2kb vs. previous 30kb)
- Respects maximum mapping length constraints when merging chains
These changes allow wfmash to detect smaller structural variants directly in the mapping phase, rather than relying on the more complex and sometimes error-prone WFlign approach used in previous versions.
Memory Optimization and Scalability
Batch Indexing
This release introduces batch indexing for reference sequences, a critical feature for working with very large genomes on memory-constrained systems. With the new -b/--batch
parameter, wfmash:
- Partitions reference sequences into batches based on specified size
- Builds and processes indices for each batch independently
- Combines mapping results across all batches for consistent output
This enables processing of reference collections that would otherwise exceed available memory, such as large mammalian pangenome projects spanning terabases of sequence.
Parallel Index Building
Index construction is now fully parallelized, with significant performance improvements:
- Parallel k-mer frequency counting across all sequences
- Thread-local processing during index construction
- Improved synchronization between indexing threads
- Better memory utilization through optimized data structures
The result is much faster index building, particularly for large reference collections, with reduced peak memory usage.
Performance Enhancements
TaskFlow-Based Execution Model
The execution engine has been completely rewritten using the TaskFlow library, replacing the previous atomic queue-based system. This new model provides:
- More efficient task scheduling with explicit dependencies
- Better load balancing across all available threads
- Improved pipeline parallelism for multi-stage processing
- Reduced thread contention and synchronization overhead
The TaskFlow implementation manages the entire workflow from reading input files to writing output, with appropriate parallelization at each stage.
Enhanced FASTA I/O Performance
FASTA input/output operations now benefit from several optimizations:
- Integrated thread pooling for FASTA reading with optimized BGZF queue sizes
- Zero-copy sequence view processing with string_view
- Batch reading of input mapping files
- Efficient memory handling with custom allocators
These improvements significantly reduce I/O bottlenecks, especially for highly compressed reference files.
Command Line Interface Improvements
The command-line interface has been thoroughly reorganized for better usability:
- More logical grouping of related parameters
- Short options for commonly used parameters (e.g.,
-g
for alignment scoring) - Clearer parameter names that better reflect their functionality
- More descriptive help text with improved formatting
- Simplified parameter handling with sensible defaults
New parameters include:
-S/--scaffolding
for controlling scaffold-based mapping-E/--target-padding
for reference sequence padding-b/--batch
for controlling batch size in indexing-g/--wfa-params
for alignment scoring configuration
Default values have been carefully tuned based on extensive testing across diverse genome types, providing good out-of-the-box performance for most use cases.
Output and Reporting Enhancements
Chain Information in PAF Output
The PAF output format now includes a chain:i
field that exposes detailed information about mapping chains:
- Chain ID to identify mappings that belong together
- Position within the chain (1-based)
- Total length of the chain
This makes it easier to track related mappings and understand the structure of complex alignments, particularly when processing outputs with downstream tools.
Enhanced Progress Reporting
Progress reporting has been significantly improved:
- More accurate time estimates during long-running operations
- Detailed statistics about reference and query datasets
- Information about sequence groups and average sizes
- Clear reporting of filtering parameters and their effects
- Better error messages with more context and recovery options
These improvements make wfmash more informative during execution and help troubleshoot potential issues.
Other Notable Changes
- HTSlib thread pooling integration for better performance with compressed files
- Improved error handling and validation for all input parameters
- More robust handling of sequence naming and ID management
- Enhanced detection and processing of overlapping mappings
- Optimization of minimum hits calculation for better sensitivity
- Support for chain field exposed in PAF output for downstream processing
- Improved hypergeometric filtering with configurable parameters
Conclusion
wfmash v0.22.0 represents a substantial step forward in pangenome alignment capability, with fundamental improvements to core algorithms, significantly enhanced performance, and better memory efficiency. These changes enable more accurate alignment of complex genomic regions while making the tool more accessible for large-scale projects on diverse computing environments.
What's Changed
- Fix query end position by @ekg in #271
- Smooth chain by @ekg in #272
- smooth the introduction of max mapping length parameter by @ASLeonard in #269
- fix: invalid paf produced for some patch alignments by @kdm9 in #274
- Fix uint64_t underflow by @bkille in #280
- Map chunk query by @ekg in #277
- Super basic hypergeometric filter by @ekg in #284
- Subindexes in one file and remove frequent kmer filtering by @ekg in #282
- Parallel filter by @ekg in #285
- Log indexing by @ekg in #287
- biWFA it by @ekg in #288
- Flagtastrophe by @ekg in #290
- Tweak hits by @ekg in #292
- Freq filter yes by @ekg in #293
- Fix minmer filt by @ekg in #294
- Let me overlap by @ekg in #295
- Full precision cli by @ekg in #296
- Parallelize k-mer frequency counting and index building by @ekg in #297
- Logging target subset count and average size by @ekg in #298
- Reenable writing index only by @ASLeonard in #300
- Prefilter mappings to save memory in batched mapping by @ekg in #301
- Guix static build for wfmash. Fixing tests and working on guix instructions/scripts by @pjotrp in #302
- wfmash now builds with clang by @pjotrp in #304
- New guix build targets and shells. Fixes libasan and adds profiling. by @pjotrp in #305
- map: update mapping selection logic based on merge and split parameters by @AndreaGuarracino in #306
- apply mapping filter before it is too late by @AndreaGuarracino in #307
- merged mappings has to be <= map_mapping_length by @AndreaGuarracino in #308
- Revert on disabling tests and adding updated regression output by @pjotrp in #309
- update regression outputs by @AndreaGuarracino in #310
- Fix tests by @AndreaGuarracino in #313
- Allow big P when mappings are already computed by @AndreaGuarracino in #314
- fix MD cigar when using biwfa by @AndreaGuarracino in #315
- pad the target when aligning by @ekg in #312
- Group fix by @ekg in #317
- update GitHub Actions workflow to use ubuntu-latest by @AndreaGuarracino in #318
- Fix wfa params by @AndreaGuarracino in #320
- Scaffold mapping by @ekg in #319
- Scaffold maxi by @ekg in #321
- update WFA2-lib by @AndreaGuarracino in #322
- Taskflow map by @ekg in #323
- Taskflow align by @ekg in #324
- Multi index again by @ekg in #326
- Remove unused atomic_queue implementation and includes from aligner by @ekg in #327
- Expose
chain:i
field in the aligned PAF output by @AndreaGuarracino in #325
New Contributors
- @ASLeonard made their first contribution in #269
- @kdm9 made their first contribution in #274
- @pjotrp made their first contribution in #302
Full Changelog: v0.21.0...v0.22.0