Skip to content

Conversation

@ekg
Copy link
Contributor

@ekg ekg commented Oct 20, 2025

Summary

Exposes all of AllWave's sparsification strategies through the CLI, following the AllWave naming model instead of our custom simplified interface.

New Sparsification Options

# All pairs (default)
--sparsify none

# Random sampling
--sparsify random:0.5          # 50% of pairs

# Connectivity-based (Erdős-Rényi)
--sparsify connectivity:0.95   # 95% probability of connected component

# Tree sampling (intelligent pair selection)
--sparsify tree:3,3,0.1,16     # k_nearest, k_farthest, random_fraction, kmer_size

Tree Sampling Format

tree:K,K2,F,SIZE where:

  • K = k-nearest neighbors per sequence (1 = minimum spanning tree, higher = denser)
  • K2 = k-farthest neighbors (strangers for diversity)
  • F = additional random sampling fraction (0.0-1.0)
  • SIZE = kmer size for Mash distance calculation (default: 16)

Example: tree:3,3,0.1,16 creates:

  • 3 nearest neighbor edges per sequence
  • 3 farthest neighbor edges per sequence
  • 10% additional random pairs
  • Uses 16-mer for distance calculation

Iterative Mode Integration

Iterative mode now respects the user's --sparsify setting:

# Use custom tree sampling for iterative alignment
./seqrush -s input.fa -o output.gfa --iterative --sparsify tree:5,5,0.2,16

If no tree sampling specified, iterative mode defaults to tree:3,3,0.1,16 and shows a note.

Changes

  • Changed default from '1.0' to 'none' (more intuitive)
  • Added comprehensive parsing for all AllWave strategies
  • Updated help text with examples
  • Iterative mode respects user's sparsification settings
  • Backward compatible: plain floats treated as random factors (with deprecation warning)

Testing

✅ Build successful
✅ All tests pass
✅ Help text updated
✅ Iterative mode uses parsed settings

Exposes all AllWave sparsification strategies through the CLI,
following the AllWave naming model:

- 'none' or '1.0' - align all pairs (default)
- 'auto' - automatic sparsification based on sequence count
- 'random:F' - random sampling with probability F (e.g., 'random:0.5')
- 'connectivity:F' - Erdős-Rényi with probability F for connected component
- 'tree:K,K2,F,SIZE' - tree sampling with:
  * K = k-nearest neighbors
  * K2 = k-farthest neighbors
  * F = additional random fraction
  * SIZE = kmer size for Mash distance (e.g., 'tree:3,3,0.1,16')

Changes:
- Updated --sparsify flag help text with all options
- Changed default from '1.0' to 'none' (more intuitive)
- Added parse_sparsification() method to handle all formats
- Iterative mode now respects user's sparsification setting
- Falls back to tree:3,3,0.1,16 for iterative mode if not specified
- Backward compatible: plain floats treated as random factors (with warning)

Examples:
  --sparsify none                # All pairs
  --sparsify random:0.5          # 50% random sampling
  --sparsify tree:3,3,0.1,16     # Tree sampling (default for iterative)
  --sparsify connectivity:0.95   # Erdős-Rényi for connectivity
@ekg ekg merged commit c785c34 into main Oct 20, 2025
2 checks passed
@ekg ekg deleted the expose-allwave-sparsification branch October 20, 2025 23:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants