Please use the impg repository for future development and updates.
GFALace is a Rust tool that combines multiple GFA (Graphical Fragment Assembly) files into a single unified graph. It's designed for working with pangenome graphs that have been split into multiple files, where each file represents a specific pangenomic region.
Requires Rust 2021 edition or later. Install using:
cargo install --git https://github.com/pangenome/gfalaceOr build from source:
git clone https://github.com/pangenome/gfalace
cd gfalace
cargo build --releaseBasic usage:
gfalace -g *.gfa -o combined.gfaor
gfalace -g file1.gfa file2.gfa.gz file3.gfa -o combined.gfaof from a file list:
gfalace -l gfa_list.txt -o combined.gfaYou can mix compressed (.gfa.gz, .gfa.bgz, .gfa.zst) and uncompressed (.gfa) files in the input.
The input GFA files can be provided in any order. This is because GFALace uses the coordinate information in the path names (CHROM:START-END) to determine the correct ordering and relationships between sequences.
-g, --gfa-files: List of input GFA files (space-separated)-l, --gfa-list: Text file containing GFA paths (one per line)-o, --output: Output GFA file path--compress: Output compression format:none,gzip,bgzip,zstd, orauto(default: auto-detect from extension)--fill_gaps: Gap filling mode (0 =none[default], 1 =middlegaps only, 2 =allgaps)--fasta: FASTA file containing sequences for gap filling--temp-dir: Directory for temporary files-t, --num-threads: Number of threads (default: 4)-h, --help: Show help information-V, --version: Show version information
GFALace expects path names in the format:
NAME:START-END
Example: HG002#1#chr20:1000-2000
The tool uses these coordinates to:
- Identify which sequences belong together
- Order the sequences correctly
- Detect and handle overlaps or gaps
Note: NAME can contain ':' characters. When parsing coordinates, GFALace uses the last occurrence of ':' to separate the name from the coordinate range.
- Combines multiple GFA files while preserving path information
- Parallel processing for improved performance
- Translates node IDs to avoid conflicts
- Creates edges between contiguous path segments
- Handles both contiguous and non-contiguous ranges
- Preserves original sequence and path relationships
- Outputs a standard-compliant GFA 1.0 file
After combining the GFA files, the resulting graph will already have compacted node IDs ranging from 1 to the total number of nodes. However, it is strongly recommended to perform post-processing steps using ODGI to unchop and sort the graph.
odgi unchop -i combined.gfa -o - -t 16 | \
odgi sort -i - -o - -p gYs -t 16 | \
odgi view -i - -g > combined.final.gfaIf overlaps were present, and then trimmed during the merging process, it's advisable to run GFAffix before the ODGI pipeline to remove redundant nodes introduced by the overlap trimming.
gfaffix combined.gfa -o combined.fix.gfa &> /dev/null
odgi unchop -i combined.fix.gfa -o - -t 16 | \
odgi sort -i - -o - -p gYs -t 16 | \
odgi view -i - -g > combined.final.gfaFilling middle gaps with Ns:
gfalace -g *.gfa -o combined.gfa --fill_gaps 1Filling all gaps with pangenome sequences:
gfalace -g *.gfa -o combined.gfa --fill_gaps 2 --fasta pangenome.fastaGFALace provides options to fill gaps between graphs based on the specified gap filling mode:
- Mode 0 (Default): No gap filling. Gaps between segments are left unfilled.
- Mode 1: Fills middle gaps between contiguous ranges with
Ncharacters or with sequences provided with--fasta. Useful for connecting segments that are not directly connected but belong to the same path. - Mode 2: Fills all gaps, including start and end gaps, with
Ncharacters or with sequences provided with--fasta. To fill end gaps, the FASTA is required.
MIT License - See LICENSE file for details.