Skip to content

marbl/TTT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TTT stands for Trivial Tangle Traverser. This tool generates "not terrible" traversals through repetitive genomic tangles that somehow matches coverage and the read alignment.

For help run ./TTT.py --help

Requires python ≥ 3.7 and dataclasses, pulp, ahocorasick, networkx, statistics, and logging python libraries.

Slides explaining algorithmic details

UNDER CONSTRUCTION!

Example usage:

./TTT.py --graph assembly.gfa --alignment reads.gaf --output results_dir --boundary-nodes boundary_nodes.tsv --quality-threshold 20

Will TTT help with a gap in my scaffold?

Generally there are three main reasons for gaps in a scaffold:

  • Lack of coverage

    TTT searches for the "best" path in the assembly graph that traverses the gap. If there's no path because of the coverage gap — nothing can be done.

    gap

    Scaffold <utig4-1497[N100000N:scaffold]<utig4-340 — nothing can be done

  • Long homozygous nodes

    Such gaps happen because of the read length being shorter than homozygous nodes. Typical structure looks like a sequence of "bubbles" of similar length, interlaced with long homozygous nodes. TTT can be run on such tangles. But usually if those structures left unresolved in the assembly graph (especially if homozygous nodes are longer than ~100kbp homopolymer-compressed) then there's just no information in the read alignments helping to traverse this region, and thus it will be essentially a random guess.

    diploid_simple_tangle

    Scaffolds <utig4-1225<utig4-1224[N5000N:ambig_bubble]>utig4-1511<utig4-1513 and <utig4-1226<utig4-1224[N5000N:ambig_bubble]>utig4-1511<utig4-1512. Because of long homozygous nodes utig4-1224 and utig4-1511 there's just no long reads connecting utig4-1228/utig4-1227 with utig4-1225/utig4-1226 or utig4-1512/utig4-1513. TTT will make a random guess, but so can you

  • Complex repeats

    TTT was designed for such cases. However there can be no more than 2 haplotypes in the tangle (so rDNA tangles connecting multiple chromosomes are usually unresolvable). Also TTT does not scaffold so you need to know how to pair incoming and outgoing nodes for two haplotype cases.

    haploid tangle

    Gap caused by repeat array

    diploid tangle

    Gap caused by large duplication of homozygous region, present in one of the haplotypes

Required Arguments:

  • --graph: Path to the GFA file with the graph structure
  • --alignment: Path to a file with GraphAligner alignment

Instead of those two options one can use --verkko-output <verkko output directory> . In this case internal verkko files for HiFi graph, coverage (ONT) and ONT alignments would be used.

  • --outdir Output directory

  • --boundary-nodes <boundary_nodes_file> to locate tangle. boundary_nodes_file should contain tab separated pairs of incoming and outgoing boundary nodes, one pair by line. Also they should be non-repetive and heterozygous in case of 'diploid' tangles. Boundary nodes should completely separate the tangle from the rest of the graph — after their removal there should be no path in remaining graph between tangle nodes and any other non-tangle nodes.

Example helo_border

For this tangle decent choice of boundary nodes would be
utig1-10326 utig1-2575
utig1-10327 utig1-2574

Currently TTT does not support tangles with more than 2 traversing paths (i.e. most of the rDNA tangles in human-like genomes)

Output

TTT outputs two files to the <outdir>traversal.multiplicities.csv with estimated multiplicities of tangle nodes (can be used with Bandage); traversal.gaf with the resulting path and, if graph .gfa file contained node sequences — traversal.hpc.fasta with a patch sequence. However, when combined with verkko (since verkko's graph is based on homopolymer-compressed sequences), this patch is also homopolymer compressed. To get non-hpc sequence you'll need to rerun verkko providing traversal.gaf with --path option — see verkko's manual for details.

Verkko's final graph coverage fix

In verkko up to (and including )v2.2.1 coverage of the short nodes in tangles in final graph (assembly.homopolymer-compressed.gfa) is deeply flawed. To get the updated coverage file we suggest to run additional scripts

./verkko_coverage_fix/utig4_to_utig1.py <assembly_folder> > utig42utig1.gaf

./verkko_coverage_fix/utig4_coverage_updater.py utig42utig1.gaf <assembly_folder>/assembly.homopolymer-compressed.noseq.gfa <assembly_folder>/2-processGraph/unitig-unrolled-hifi-resolved.ont-coverage.csv > utig4_upt.ont-coverage.csv

and then pass utig4_upt.ont-coverage.csv as --coverage-file in main script.

Alternatively you can find how utig4- nodes match to the utig1- graph in utig42utig1.gaf and run TTT.py on the same tangle in hifi-only graph (2-processGraph/unitig-unrolled-hifi-resolved.gfa within verkko output directory). Usually this provides better results and does not require realigning ONT reads to graph.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published