Skip to content

Commit 54c5019

Browse files
authored
Merge pull request #57 from neherlab/fix/issue-56
- fixed issue #56 - pangraph `build` command is now deterministic, a random seed can be set with the `-r` option. - the `build` and `merge` commands now have a `-t` flag. When set sanity checks are performed on the graph. - fasta input files are checked for duplicated record names, and white lines between records are tolerated
2 parents b560169 + 9c0af7b commit 54c5019

File tree

13 files changed

+552
-315
lines changed

13 files changed

+552
-315
lines changed

CHANGELOG.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,11 @@
11
# PanGraph Changelog
22

3-
## v0.6.4 (draft)
3+
## v0.7.0
44

55
- fasta input files are checked for duplicated records, and white lines between records are tolerated, see [#55](https://github.com/neherlab/pangraph/pull/55).
6+
- PanGraph execution is now deterministic, and same input files always produce the same output, see [#57](https://github.com/neherlab/pangraph/pull/57). For the build command, a random seed can be set with the `-r` flag.
7+
- introduced the `-t` flag in the `build` and `merge` command. This activates consistency checks to verify that the input genomes can be exactly reconstructed. See [#57](https://github.com/neherlab/pangraph/pull/57).
8+
- Fixed [#56](https://github.com/neherlab/pangraph/issues/56)
69

710
## v0.6.3
811

Manifest.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -247,9 +247,9 @@ uuid = "efe28fd5-8261-553b-a9e1-b2916fc3738e"
247247
version = "0.5.5+0"
248248

249249
[[deps.OrderedCollections]]
250-
git-tree-sha1 = "85f8e6578bf1f9ee0d11e7bb1b1456435479d47c"
250+
git-tree-sha1 = "d321bf2de576bf25ec4d3e4360faca399afca282"
251251
uuid = "bac558e1-5e72-5ebc-8fee-abe8a469f55d"
252-
version = "1.4.1"
252+
version = "1.6.0"
253253

254254
[[deps.PDMats]]
255255
deps = ["LinearAlgebra", "SparseArrays", "SuiteSparse"]

Project.toml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
name = "PanGraph"
22
uuid = "0f9f61ca-f32c-45e1-b3bc-00138f4f8814"
33
authors = ["Nicholas Noll <[email protected]>"]
4-
version = "0.6.3"
4+
version = "0.7.0"
55

66
[deps]
77
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
@@ -12,6 +12,7 @@ JLD2 = "033835bb-8acc-5ee8-8aae-3f567f8a3819"
1212
JSON = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
1313
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
1414
Logging = "56ddb016-857b-54e1-b83d-db4d58db5568"
15+
OrderedCollections = "bac558e1-5e72-5ebc-8fee-abe8a469f55d"
1516
PackageCompiler = "9b87118b-4619-50d2-8e1e-99f35a4d4d9d"
1617
Pkg = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f"
1718
ProgressMeter = "92933f4c-e287-5a05-a399-4b506db050ca"
@@ -23,4 +24,4 @@ TreeTools = "62f0eae3-8c0e-4032-a621-7756092209e5"
2324
minimap2_jll = "d341526d-637d-5003-8fc4-9c6812cd2b55"
2425

2526
[compat]
26-
TreeTools = ">= 0.6.2"
27+
TreeTools = ">= 0.6.2"

docs/src/cli/build.md

Lines changed: 14 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,18 +4,20 @@
44
Build a multiple sequence alignment pangraph.
55

66
## Options
7-
| Name | Type | Short Flag | Long Flag | Description |
8-
| :------------------- | :------ | :--------- | :--------------- | :-------------------------------------------------------------------------------------------------------- |
9-
| minimum length | Integer | l | len | minimum block size for alignment graph (in nucleotides) |
10-
| block junction cost | Float | a | alpha | energy cost for introducing block partitions due to alignment merger |
11-
| block diversity cost | Float | b | beta | energy cost for interblock diversity due to alignment merger |
12-
| circular genomes | Boolean | c | circular | toggle if input genomes are circular |
13-
| pairwise sensitivity | String | s | sensitivity | controls the pairwise genome alignment sensitivity of minimap 2. Currently only accepts "5", "10" or "20" |
14-
| maximum self-maps | Integer | x | max-self-map | maximum number of iterations to perform block self maps per pairwise graph merger |
15-
| enforce uppercase | Boolean | u | upper-case | toggle to force genomes to uppercase characters |
16-
| distance calculator | String | d | distance-backend | only accepts "native" or "mash" |
17-
| alignment kernel | String | k | alignment-kernel | only accepts "minimap2" or "mmseqs" |
18-
| kmer length (mmseqs) | Integer | K | kmer-length | kmer length, only used for mmseqs2 alignment kernel. If not specified will use mmseqs default. |
7+
| Name | Type | Short Flag | Long Flag | Description |
8+
| :------------------- | :------ | :--------- | :--------------- | :------------------------------------------------------------------------------------------------------------ |
9+
| minimum length | Integer | l | len | minimum block size for alignment graph (in nucleotides) |
10+
| block junction cost | Float | a | alpha | energy cost for introducing block partitions due to alignment merger |
11+
| block diversity cost | Float | b | beta | energy cost for interblock diversity due to alignment merger |
12+
| circular genomes | Boolean | c | circular | toggle if input genomes are circular |
13+
| pairwise sensitivity | String | s | sensitivity | controls the pairwise genome alignment sensitivity of minimap 2. Currently only accepts "5", "10" or "20" |
14+
| maximum self-maps | Integer | x | max-self-map | maximum number of iterations to perform block self maps per pairwise graph merger |
15+
| enforce uppercase | Boolean | u | upper-case | toggle to force genomes to uppercase characters |
16+
| distance calculator | String | d | distance-backend | only accepts "native" or "mash" |
17+
| alignment kernel | String | k | alignment-kernel | only accepts "minimap2" or "mmseqs" |
18+
| kmer length (mmseqs) | Integer | K | kmer-length | kmer length, only used for mmseqs2 alignment kernel. If not specified will use mmseqs default. |
19+
| consistency check | Boolean | t | test | toggle to activate consistency check: verifies that input genomes can be exactly reconstructed from the graph |
20+
| random seed | Int | r | random-seed | random seed for pangraph construction. |
1921

2022
## Arguments
2123
Expects one or more fasta files.

docs/src/cli/marginalize.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ Compute all pairwise marginalizations of a multiple sequence alignment pangraph.
99
| Output path | String | o | output-path | Path to direcotry where the output of all pairwise mariginalizations will be stored if supplied |
1010
| Reduce paralogs | Boolean | r | reduce-paralog | Collapses coparallel paths through duplicated blocks. |
1111
| Projection strains | String | s | Strains | Collapses the graph structure to only blocks and edges contained by the paths of the supplied strain names. comma seperated, no spaces |
12+
| Consistency check | Boolean | t | test | toggle to activate consistency check: verifies that output genomes are exactly equal to input genomes |
1213

1314
## Arguments
1415
Zero or one pangraph file which must be formatted as a JSON.

src/PanGraph.jl

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,8 @@ function main(args)
163163
end
164164

165165
function julia_main()::Cint
166+
# initialize random seed for reproducibility
167+
seed!(0)
166168
try
167169
main(ARGS)
168170
catch

src/align.jl

Lines changed: 79 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ module Align
33
using Rematch, Dates
44
using LinearAlgebra
55
using ProgressMeter
6+
using Random
67

78
using Base.Threads: @spawn, @threads
89

@@ -26,28 +27,32 @@ end
2627
# ------------------------------------------------------------------------
2728
# guide tree for order of pairwise comparison for multiple alignments
2829

30+
2931
# TODO: distance?
3032
"""
3133
mutable struct Clade
3234
name :: String
3335
parent :: Union{Clade,Nothing}
3436
left :: Union{Clade,Nothing}
3537
right :: Union{Clade,Nothing}
36-
graph :: Channel{Graph}
38+
graph :: Channel{Tuple{Graph,Int}}
3739
end
3840
3941
Clade is a node (internal or leaf) of a binary guide tree used to order pairwise alignments
4042
associated to a multiple genome alignment in progress.
4143
`name` is only non-empty for leaf nodes.
4244
`parent` is `nothing` for the root node.
4345
`graph` is a 0-sized channel that is used as a message passing primitive in alignment.
46+
It contains the graph and an index used to decide the order of items in a pair in
47+
pairwise graph merge.
4448
"""
49+
Message=Tuple{Graph,Int}
4550
mutable struct Clade
4651
name :: String
4752
parent :: Union{Clade,Nothing}
4853
left :: Union{Clade,Nothing}
4954
right :: Union{Clade,Nothing}
50-
graph :: Channel{Graph}
55+
graph :: Channel{Message}
5156
end
5257

5358
# ---------------------------
@@ -58,20 +63,20 @@ end
5863
5964
Generate an empty, disconnected clade.
6065
"""
61-
Clade() = Clade("",nothing,nothing,nothing,Channel{Graph}(2))
66+
Clade() = Clade("", nothing, nothing, nothing, Channel{Message}(2))
6267
"""
6368
Clade(name)
6469
6570
Generate an empty, disconnected clade with name `name`.
6671
"""
67-
Clade(name) = Clade(name,nothing,nothing,nothing,Channel{Graph}(2))
72+
Clade(name) = Clade(name, nothing, nothing, nothing, Channel{Message}(2))
6873
"""
6974
Clade(left::Clade, right::Clade)
7075
7176
Generate an nameless clade with `left` and `right` children.
7277
"""
7378
function Clade(left::Clade, right::Clade)
74-
parent = Clade("",nothing,left,right,Channel{Graph}(2))
79+
parent = Clade("", nothing, left, right, Channel{Message}(2))
7580

7681
left.parent = parent
7782
right.parent = parent
@@ -351,6 +356,31 @@ function preprocess(hits, skip, energy, blocks!)
351356
return hits
352357
end
353358

359+
# DEBUG
360+
function log_alignment(G₁::Graph, G₂::Graph, hits, fname::String)
361+
open(fname, "w") do io
362+
for G in (G₁, G₂)
363+
write(io, "------------ G ------------\n")
364+
PC = pancontigs(G)
365+
for (name, seq) in zip(PC.name, PC.sequence)
366+
write(io, ">$name\n")
367+
write(io, seq, "\n")
368+
end
369+
end
370+
write(io, "------------ hits ------------\n")
371+
for h in hits
372+
write(io, """
373+
.........................
374+
qry -> $(h.qry.name) | $(h.qry.start) -> $(h.qry.stop) | $(h.qry.length)
375+
ref -> $(h.ref.name) | $(h.ref.start) -> $(h.ref.stop) | $(h.ref.length)
376+
len -> $(h.length)
377+
strand -> $(h.orientation)
378+
cigar -> $(h.cigar)
379+
""")
380+
end
381+
end
382+
end
383+
354384
function do_align(G₁::Graph, G₂::Graph, energy::Function, aligner::Function)
355385
hits = if G₁ == G₂
356386
self = pancontigs(G₁)
@@ -359,6 +389,9 @@ function do_align(G₁::Graph, G₂::Graph, energy::Function, aligner::Function)
359389
aligner(pancontigs(G₁), pancontigs(G₂))
360390
end
361391
sort!(hits; by=energy)
392+
393+
# DEBUG
394+
# log_alignment(G₁, G₂, hits, "issue/minimap/$(randstring(10)).log")
362395

363396
return hits
364397
end
@@ -398,7 +431,7 @@ The _lower_ the score, the _better_ the alignment. Only negative energies are co
398431
`minblock` is the minimum size block that will be produced from the algorithm.
399432
`maxiter` is maximum number of duplications that will be considered during this alignment.
400433
"""
401-
function align_self(G₁::Graph, energy::Function, minblock::Int, aligner::Function, verify::Function, verbose::Bool; maxiter=100, sensitivity="asm10")
434+
function align_self(G₁::Graph, energy::Function, minblock::Int, aligner::Function, verify::Function, verbose::Bool; maxiter=100)
402435
G₀ = G₁
403436

404437
for niter in 1:maxiter
@@ -440,6 +473,9 @@ function align_self(G₁::Graph, energy::Function, minblock::Int, aligner::Funct
440473
detransitive!(G₀)
441474
purge!(G₀)
442475
prune!(G₀)
476+
477+
# verify that isolates are correctly reconstructed (-v flag)
478+
verify(G₀, msg="verify align-self $niter")
443479
end
444480

445481
return G₀
@@ -573,6 +609,9 @@ function align_pair(G₁::Graph, G₂::Graph, energy::Function, minblock::Int, a
573609
purge!(G)
574610
prune!(G)
575611

612+
# verify that isolates are correctly reconstructed (-v flag)
613+
verify(G, msg="verify align-pair")
614+
576615
return G
577616
end
578617

@@ -592,8 +631,8 @@ The _lower_ the score, the _better_ the alignment. Only negative energies are co
592631
593632
`compare` is the function to be used to generate pairwise distances that generate the internal guide tree.
594633
"""
595-
function align(aligner::Function, Gs::Graph...; compare=Mash.distance, energy=(hit)->(-Inf), minblock=100, reference=nothing, maxiter=100)
596-
function verify(graph, msg="")
634+
function align(aligner::Function, Gs::Graph...; compare=Mash.distance, energy=(hit)->(-Inf), minblock=100, reference=nothing, maxiter=100, verbose=false, rand_seed=0)
635+
function verify(graph; msg="")
597636
if reference !== nothing
598637
for (name,path) graph.sequence
599638
seq = sequence(path)
@@ -630,7 +669,7 @@ function align(aligner::Function, Gs::Graph...; compare=Mash.distance, energy=(h
630669
println("--> insert: $(path.node[i].block.insert[path.node[i]])")
631670
println("--> delete: $(path.node[i].block.delete[path.node[i]])")
632671

633-
error("--> isolate '$name' incorrectly reconstructed")
672+
error("$msg\n--> isolate '$name' incorrectly reconstructed")
634673
end
635674
end
636675
end
@@ -655,30 +694,53 @@ function align(aligner::Function, Gs::Graph...; compare=Mash.distance, energy=(h
655694
meter_lock = ReentrantLock()
656695

657696
G = nothing
658-
for clade postorder(tree)
697+
for (n_clade, clade) enumerate(postorder(tree))
659698
@spawn try
699+
700+
# random seed for the thread - to ensure deterministic reproducibility
701+
# in block names
702+
Random.seed!(rand_seed+n_clade)
703+
660704
if isleaf(clade)
661705
close(clade.graph)
662-
put!(clade.parent.graph, tips[clade.name])
706+
msg = (tips[clade.name], n_clade)
707+
put!(clade.parent.graph, msg)
663708
else
664-
Gₗ = take!(clade.graph)
665-
Gᵣ = take!(clade.graph)
709+
Gₗ, Pₗ = take!(clade.graph)
710+
Gᵣ, Pᵣ = take!(clade.graph)
666711
close(clade.graph)
667-
712+
# ensure a consistent ordering of the two graphs,
713+
# irrespective of which process is sending the message first.
714+
if Pₗ > Pᵣ
715+
Gₗ, Gᵣ = Gᵣ, Gₗ
716+
end
717+
668718
# the lock ensures that at most N=Threads.nthreads() processes are
669719
# spawning run(`cmd`) instances at the same time
670720
G₀ = lock_semaphore(s) do
671-
G₀ = align_pair(Gₗ, Gᵣ, energy, minblock, aligner, verify, false)
672-
align_self(G₀, energy, minblock, aligner, verify, false)
721+
verbose && log("--> align-pair for clade n. $n_clade")
722+
G₀ = align_pair(Gₗ, Gᵣ, energy, minblock, aligner, verify, verbose)
723+
verbose && log("--> align-self for clade n. $n_clade")
724+
G₀ = align_self(G₀, energy, minblock, aligner, verify, verbose, maxiter=maxiter)
725+
verbose && log("--> graph merging for clade n. $n_clade completed")
726+
G₀
673727
end
674728

729+
# DEBUG : save graph at each iteration in a file
730+
# open("issue/comp/graph_iteration_$(n_clade).json", "w") do io
731+
# finalize!(G₀)
732+
# marshal(io, G₀; fmt=:json)
733+
# end
734+
735+
675736
# advance progress bar in a thread-safe way
676737
lock(meter_lock) do
677738
next!(meter)
678739
end
679740

680741
if clade.parent !== nothing
681-
put!(clade.parent.graph, G₀)
742+
msg = (G₀, n_clade)
743+
put!(clade.parent.graph, msg)
682744
else
683745
G = G₀
684746
close(error_channel)

0 commit comments

Comments
 (0)