ACEnglish
diff --git a/‎README.md
Lines changed: 2 additions & 2 deletions b/‎README.md
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/Updates.md
Lines changed: 25 additions & 1 deletion b/‎docs/Updates.md
Lines changed: 25 additions & 1 deletion
diff --git a/‎docs/api/truvari.package.rst
Lines changed: 0 additions & 2 deletions b/‎docs/api/truvari.package.rst
Lines changed: 0 additions & 2 deletions
diff --git a/‎docs/bench.md
Lines changed: 11 additions & 6 deletions b/‎docs/bench.md
Lines changed: 11 additions & 6 deletions
diff --git a/‎docs/collapse.md
Lines changed: 13 additions & 1 deletion b/‎docs/collapse.md
Lines changed: 13 additions & 1 deletion
diff --git a/‎docs/phab.md
Lines changed: 38 additions & 32 deletions b/‎docs/phab.md
Lines changed: 38 additions & 32 deletions
diff --git a/‎docs/requirements.txt
Lines changed: 1 addition & 0 deletions b/‎docs/requirements.txt
Lines changed: 1 addition & 0 deletions
diff --git a/‎imgs/coverage.svg
Lines changed: 2 additions & 2 deletions b/‎imgs/coverage.svg
Lines changed: 2 additions & 2 deletions
diff --git a/‎pyproject.toml
Lines changed: 6 additions & 5 deletions b/‎pyproject.toml
Lines changed: 6 additions & 5 deletions
@@ -2,7 +2,7 @@
 [![pylint](imgs/pylint.svg)](https://github.com/acenglish/truvari/actions/workflows/pylint.yml)
 [![FuncTests](https://github.com/acenglish/truvari/actions/workflows/func_tests.yml/badge.svg?branch=develop&event=push)](https://github.com/acenglish/truvari/actions/workflows/func_tests.yml)
 [![coverage](imgs/coverage.svg)](https://github.com/acenglish/truvari/actions/workflows/func_tests.yml)
-[![develop](https://img.shields.io/github/commits-since/acenglish/truvari/v5.1.1)](https://github.com/ACEnglish/truvari/compare/v5.1.1...develop)
+[![develop](https://img.shields.io/github/commits-since/acenglish/truvari/v5.2.0)](https://github.com/ACEnglish/truvari/compare/v5.2.0...develop)
 [![Downloads](https://static.pepy.tech/badge/truvari)](https://pepy.tech/project/truvari)
 
 ![Logo](https://raw.githubusercontent.com/ACEnglish/truvari/develop/imgs/BoxScale1_DarkBG.png)  
@@ -29,7 +29,7 @@ The current most common Truvari use case is for structural variation benchmarkin
   truvari bench -b base.vcf.gz -c comp.vcf.gz -f reference.fa -o output_dir/
 ```
 
-Find more matches by harmonizing phased varians using refine:
+Find more matches by harmonizing phased variants using refine:
 ```
    truvari refine output_dir/
 ```
 
@@ -1,6 +1,30 @@
-# Truvari 5.1.1
+# Truvari 5.3
 *in progress*
 
+* Fixed FP BNDs being dropped [details](https://github.com/ACEnglish/truvari/discussions/263). 
+* Restore default `--sizemax` - Some callers make SVs that span the entire chromosome, which disrupts truvari's chunking strategy
+* `phab`
+  * Can now harmonize samples' variants across any number of VCFs. This entails a UI change of no more `-b/-c`.
+  * For a large jobs, a `--lowmem` flag:
+    * turns on progress bars and process pool monitoring for better tracking of failed harmonization jobs
+    * Much lower memory usage, and fewer failures
+  * Api refactor to programmatically build haplotypes with `phab.VCFtoHaplotypes` and other phab functions
+* `bench`
+  * Can now run on vcfs without SAMPLE columns (e.g. annotation files)
+* Fixed `anno grpaf` header tags. Some tools (e.g. IGV) don't link Type before Number.
+* Fixed edge case when iterating variants not inside regions
+
+# Truvari 5.2.0
+*February 16, 2025*
+
+* The default `--align` method for `phab` and `bench` switched to POA. See [discussion](https://github.com/ACEnglish/truvari/discussions/261) for details.
+* Fix bug in `--pick ac` where FN/FP variants were not being counted/output.
+* Fix `--dup-to-ins` Ticket [#258](https://github.com/ACEnglish/truvari/issues/258)
+* `ga4gh` now also writes a variant count summary json
+
+# Truvari 5.1.1
+*February 5, 2025*
+
 * `bench`
   *  new automatic hook into the refine step via `truvari bench --refine`
 * `refine`
 
@@ -104,8 +104,6 @@ Extra Methods
 
 .. autofunction:: performance_metrics
 
-.. autofunction:: phab
-
 .. autofunction:: reciprocal_overlap
 
 .. autofunction:: restricted_float
 
@@ -149,13 +149,18 @@ Refining bench output
 =====================
 As described in the [[refine wiki|refine]], a limitation of Truvari bench is 1-to-1 variant comparison. However, `truvari refine` can harmonize the variants to give them more consistent representations. A bed file named `candidate.refine.bed` is created by `truvari bench` and holds a set of regions which may benefit from refinement. To use it, simply run
 ```bash
-truvari bench -b base.vcf.gz -c comp.vcf.gz -o result/
-truvari refine --regions result/candidate.refine.bed \
-               --reference reference.fasta \
-               --recount --use-region-coords \
-               result/
+truvari bench -b base.vcf.gz -c comp.vcf.gz --reference reference.fasta  -o result/
+truvari refine result/
+```
+
+Refine has a few parameters detailed in the [[refine wiki|refine]]. But if you'd ike to run refinement automatically with defaults, you can simply use
+```bash
+truvari bench -b base.vcf.gz 
+     -c comp.vcf.gz 
+     --reference reference.fasta 
+     -o result/
+     --refine
 ```
-See [[refine wiki|refine]] for details.
 
 Comparing Sequences of Variants
 ===============================
 
@@ -76,7 +76,19 @@ For some results, one may not want to collapse variants with conflicting genotyp
 
 --intra
 =======
-When a single sample is run through multiple SV callers, one may wish to consolidate those results. After the `bcftools merge` of the VCFs, there will be one SAMPLE column per-input. With `--intra`, collapse will consolidate the sample information so that only a single sample column is present in the output. Since the multiple callers may have different genotypes or other FORMAT fields with conflicting information, `--intra` takes the first column from the VCF, then second, etc. For example, if we have an entry with:
+When a single sample is run through multiple SV callers, one may wish to consolidate those results. After the `bcftools merge` of the VCFs, there will be one SAMPLE column per-input. With `--intra`, collapse will consolidate the sample information so that only a single sample column is present in the output. This will also add a `FORMAT/SUPP` field which will indicates which columns had a present GT (e.g. 0/1 or 1/1). For example, these two calls would collapse:
+
+```
+#Call    FORMAT    S1     S2
+Call1   GT:GQ:AD  1/1    ./.
+Call2   GT:GQ:AD  ./.    1/1
+
+# Into
+#Call    FORMAT         S1
+Call1   GT:GQ:AD:SUPP  1/1:3
+```
+
+Since the multiple callers may have different genotypes or other FORMAT fields with conflicting information, `--intra` takes the first column from the VCF, then second, etc. For example, if we have an entry with:
 ```
 FORMAT    RESULT1     RESULT2
 GT:GQ:AD  ./.:.:3,0  1/1:20:0,30
 
@@ -5,17 +5,18 @@ Truvari's comparison engine can match variants using a wide range of thresholds.
 
 This problem is easiest to conceptualize in the case of 'split' variants: imagine a pipeline calls a single 100bp DEL that can also be represented as two 50bp DELs. To match these variants, we would need to loosen our thresholds to `--pick multi --pctsim 0.50 --pctsize 0.50`. Plus, these thresholds leave no margin for error. If the variant caller erroneously deleted an extra base to make a 101bp DEL we would have to lower our thresholds even further. These thresholds are already too low because there's plenty of distinct alleles with >= 50% homology.
 
-So how do we deal with inconsistent representations? In an ideal world, we would simply get rid of them by harmonizing the variants. This is the aim of `truvari phab` 
+How do we deal with inconsistent representations? In an ideal world, we would simply get rid of them by harmonizing the variants. This is the aim of `truvari phab` (pronounced 'fab')
 
 `truvari phab` is designed to remove variant representation inconsistencies through harmonization. By reconstructing haplotypes from variants, running multiple-sequence alignment of the haplotypes along with the reference, and then recalling variants, we expect to remove discordance between variant representations and simplify the work required to perform variant comparison.
 
-Requirements
-------------
-Since `truvari phab` uses mafft v7.505 via a command-line call, it expects it to be in the environment path. Download mafft and have its executable available in the `$PATH` [mafft](https://mafft.cbrc.jp/alignment/software/)
+Alignment Methods
+-----------------
+By default, `phab` will make the haplotypes and use `abpoa` to perform a multiple sequence alignment between them and the reference to harmonize variants. Note that `abpoa` may be non-deterministic in some cases.  
+
+Another available MSA algorithm is an external call `mafft` which can have its parameters customized via e.g. `--mafft-params "--maxiterate 1000"`. While `mafft` is often a more accurate alignment technique, it isn't fast. This makes an external call to mafft v7.505. Therefore, phab expects the executable to be in the environment path. Download mafft and place its executable available in the `$PATH` [mafft](https://mafft.cbrc.jp/alignment/software/). Alternatively, you can use the Truvari [Docker container](Development#docker) which already has mafft ready for use.
 
-Alternatively, you can use the Truvari [Docker container](Development#docker) which already has mafft ready for use.
+If you're willing to sacrifice accuracy for a huge speed increase, you can use `--align wfa`. However, this isn't an MSA technique and instead independently aligns each haplotype, which will not produce the most parsimonious set of variants.
 
-Also, you can use wave front aligner (pyWFA) or partial order alignment (pyabpoa). While wfa is the fastest approach, it will independently align haplotypes and therefore may produce less parsimonous aligments. And while poa is more accurate than wfa and faster than mafft, it is less accurate than mafft.
 
 Example
 -------
@@ -31,13 +32,13 @@ To start, let's use `truvari bench` to see how similar the variant calls are in
 ```bash
 truvari bench --base phab_base.vcf.gz \
 	--comp phab_comp.vcf.gz \
-	--sizemin 1 --sizefilt 1 \
+	--sizemin 0 \
 	--bSample HG002 \
 	--cSample syndip \
 	--no-ref a \
 	--output initial_bench
 ```
-This will compare all variants greater than 1bp ( `-S 1 -s 1` which includes SNPs) from the `HG002` sample to the `syndip` sample. We're also excluding any uncalled or reference homozygous sites with `--no-ref a`. The report in `initial_bench/summary.txt` shows:
+This will compare all variants greater than 1bp ( `-s 0` includes SNPs, `-s 1` would be 1bp INDELs) from the `HG002` sample to the `syndip` sample. We're also excluding any uncalled or reference homozygous sites with `--no-ref a`. The report in `initial_bench/summary.txt` shows:
 ```json
 {
     "TP-base": 5,
@@ -50,24 +51,23 @@ This will compare all variants greater than 1bp ( `-S 1 -s 1` which includes SNP
 }
 ```
 
-These variants are pretty poorly matched, especially considering the `HG002` and `syndip` samples are using the same sequencing experiment. We can also inspect the `initial_bench/fn.vcf.gz` and see a lot of these discordant calls are concentrated in a 200bp window. Let's use `truvari phab` to harmonize the variants in this region.
+These variants are pretty poorly matched, especially considering the `HG002` and `syndip` samples are using the same sequencing experiment. We can also inspect the `initial_bench/fn.vcf.gz` and see a lot of these discordant calls are concentrated in a 200bp window. 
+
+Let's use `truvari phab` to harmonize the variants in this region.
 ```bash
-truvari phab --base phab_base.vcf.gz \
-	--comp phab_comp.vcf.gz \
-	--bSample HG002 \
-	--cSample syndip \
-	--reference phab_ref.fa \
+truvari phab --reference phab_ref.fa \
 	--region chr1:700-900 \
-	-o harmonized.vcf.gz
+	--samples HG002,syndip \
+	-o harmonized.vcf.gz \
+        phab_base.vcf.gz phab_comp.vcf.gz
 ```
 
 In our `harmonized.vcf.gz` we can see there are now only 9 variants. Let's run `truvari bench` again on the output to see how well the variants match after being harmonized.
 
 ```bash
 truvari bench -b harmonized.vcf.gz \
 	-c harmonized.vcf.gz \
-	-S 1 -s 1 \
-	--no-ref a \
+	-s --no-ref a \
 	--bSample HG002 \
 	--cSample syndip \
 	-o harmonized_bench/
@@ -95,15 +95,16 @@ $ bcftools query -f "[%GT ]\n" phab_result/output.vcf.gz | sort | uniq -c
 ```
 (We can ignore the phasing differences (`0/1` vs. `1/0`). These pipelines reported the parental alleles in a different order)
 
-MSA
----
+MSA as Merging
+--------------
+
+If you read the `truvari phab --help` , you may have noticed that multiple VCFs can be provided and there's a `--samples` parameter. This is by design so that we can harmonize the variants across any number of VCFs. Even a single, multi-sample VCF. By performing a multiple-sequence alignment across samples, we can better represent variation across a population. 
 
-If you read the `truvari phab --help` , you may have noticed that the `--comp` VCF is optional. This is by design so that we can also harmonize the variants inside a single VCF. By performing a multiple-sequence alignment across samples, we can better represent variation across a population. To see this in action, let's run `phab` on all 86 samples in the `repo_utils/test_files/phab_base.vcf.gz`
+To see this in action, let's run `phab` on all 86 samples in the `repo_utils/test_files/phab_base.vcf.gz`
 ```bash
-truvari phab -b phab_base.vcf.gz \
-	-f phab_ref.fa \
+truvari phab -f phab_ref.fa \
 	-r chr1:700-900 \
-	-o msa_example.vcf.gz
+	-o msa_example.vcf.gz \ phab_base.vcf.gz
 ```
 
 As a simple check, we can count the number of variants before/after `phab`:
@@ -115,7 +116,6 @@ The `160` original variants given to `phab` became just `60`.
 
 Better yet, these fewer variants occur on fewer positions:
 ```bash
-
 bcftools query -r chr1:700-900 -f "%POS\n" phab_base.vcf.gz | sort | uniq | wc -l
 bcftools query -r chr1:700-900 -f "%POS\n" msa_example.vcf.gz | sort | uniq | wc -l
 ```
@@ -143,15 +143,21 @@ The allele-count (AC) shows a 15% reduction in singletons and removal of all var
     1 150    1 81
 ```
 
-(TODO: pull the adotto TR region annotations and run this example through `truvari anno trf`. I bet we'll get a nice spectrum of copy-diff of the same motif in the `phab` calls.)
-
-`--align`
-=========
-By default, `phab` will make the haplotypes and use an external call `mafft` to perform a multiple sequence alignment between them and the reference to harmonize the variants. While this is the most accurate alignment technique, it isn't fast. If you're willing to sacrifice some accuracy for a huge speed increase, you can use `--align wfa`, which also doesn't require an external tool. Another option is `--align poa` which performs a partial order alignment which is faster than mafft but less accurate and slower than wfa but more accurate. However, `poa` appears to be non-deterministic which is not ideal for some benchmarking purposes.
-
 Limitations
 -----------
 * Creating and aligning haplotypes is impractical for very long sequences and maybe practically impossible for entire human chromosomes. Therefore, `truvari phab` is recommended to only be run on sub-regions.
-* By giving the variants new representations, variant counts will likely change. 
-* Early testing on `phab` is on phased variants. While it can run on unphased variants, we can't yet recommend it. If regions contain unphased Hets or overlapping variants, it becomes more difficult to build a consensus sequence. So you can try out unphased variants, but proceed with caution.
+* By giving the variants new representations, variant counts will likely change and all original variant metadata (e.g. `FORMAT/DP`) is lost. 
+* `phab` should only be used on phased variants unless you really know what you're doing. While it can run on unphased variants, it isn't recommended. If regions contain unphased Hets or overlapping variants, it becomes more difficult to build an accurate haplotype from the variants, and in many cases the final constructed haplotype will not represent any sequence observed in a sample, which will create entirely false variants. So you can try out unphased variants, but proceed with caution.
+
+Failed Jobs
+-----------
+Sometimes the haplotype alignment job can fail outside of python's control. For example, `abpoa` can log an error on jobs with a large number of haplotypes.
+```
+[SIMDMalloc] posix_memalign fail!
+Size: 274877906944, Error: ENOMEM
+```
+Phab will monitor these processes and catch their errors. Therefore, it is possible for phab to have an exit status of 0, even though some (or even all) of the harmonization tasks did not complete. The variants inside these failed regions will not be in the output vcf. 
 
+`--no-dedup`
+------------
+By default, phab will only harmonize one representation of each observed haplotype. This gives huge performance boosts and produces identical results for `--align wfa/poa`. However, `mafft` results are not identical between when deduplicating haplotypes because it considers the relative weight of sequences based on their frequency.
@@ -1,3 +1,4 @@
+psutil>=7.0.0
 pywfa>=0.5.1
 sphinx>=7.2
 sphinx_rtd_theme>=2
 
@@ -13,17 +13,18 @@ license = { text = "MIT" }
 dynamic = ["version"]
 requires-python = ">=3.8"
 dependencies = [
-    "pywfa>=0.5.1",
-    "rich>=12.5.1",
+    "bwapy>=0.1.4",
     "edlib>=1.3.9",
-    "pysam>=0.22",
     "intervaltree>=3.1",
     "joblib>=1.2.0",
     "numpy>=1.24.4",
-    "pytabix>=0.1",
-    "bwapy>=0.1.4",
+    "rich>=12.5.1",
     "pandas>=1.5.3",
+    "psutil>=7.0.0",
     "pyabpoa>=1.4.3",
+    "pysam>=0.22",
+    "pytabix>=0.1",
+    "pywfa>=0.5.1",
 ]
 
 [project.scripts]
Original file line number	Diff line number	Diff line change
`@@ -1,3 +1,4 @@`
	`1`	`+psutil>=7.0.0`
`1`	`2`	`pywfa>=0.5.1`
`2`	`3`	`sphinx>=7.2`
`3`	`4`	`sphinx_rtd_theme>=2`