Skip to content

Conversation

@bmesuere
Copy link
Member

@bmesuere bmesuere commented Feb 2, 2026

This PR prepares for a set of potential performance improvements by adding a validation and benchmark script. Both scripts run on the 3 files in the examples directory.

Performance baseline on my macbook:

Benchmark: Short reads (NC_000913-454.fna)
Benchmark 1: /Users/bart/Code/FragGeneScanRs/target/release/FragGeneScanRs -s /Users/bart/Code/FragGeneScanRs/example/NC_000913-454.fna -t 454_10 -w 0 -o /var/folders/j3/38fskpy159v07np8syk3p_2m0000gn/T/tmp.nzJs6Qr9D1/NC_000913-454
  Time (mean ± σ):     657.4 ms ±   5.8 ms    [User: 640.4 ms, System: 11.7 ms]
  Range (min … max):   652.1 ms … 672.0 ms    20 runs


Benchmark: Complete genome (NC_000913.fna)
Benchmark 1: /Users/bart/Code/FragGeneScanRs/target/release/FragGeneScanRs -s /Users/bart/Code/FragGeneScanRs/example/NC_000913.fna -t complete -w 1 -o /var/folders/j3/38fskpy159v07np8syk3p_2m0000gn/T/tmp.nzJs6Qr9D1/NC_000913
  Time (mean ± σ):     971.7 ms ±  12.8 ms    [User: 847.1 ms, System: 114.7 ms]
  Range (min … max):   958.8 ms … 1018.3 ms    20 runs


Benchmark: Long reads (contigs.fna)
Benchmark 1: /Users/bart/Code/FragGeneScanRs/target/release/FragGeneScanRs -s /Users/bart/Code/FragGeneScanRs/example/contigs.fna -t complete -w 1 -o /var/folders/j3/38fskpy159v07np8syk3p_2m0000gn/T/tmp.nzJs6Qr9D1/contigs
  Time (mean ± σ):      6.619 s ±  0.023 s    [User: 6.530 s, System: 0.059 s]
  Range (min … max):    6.598 s …  6.672 s    10 runs
PR Short reads Complete genome Long reads
Baseline 657.4 ± 5.8 ms
1.000×
971.7 ± 12.8 ms
1.000×
6.619 ± 0.023 s
1.000×
#18 bitwise operations 532.2 ± 4.8 ms
1.235×
695.9 ± 8.3 ms
1.396×
4.488 ± 0.015 s
1.475×
#19 vector initialization 450.7 ± 7.4 ms
1.459×
626.8 ± 3.9 ms
1.550×
3.937 ± 0.009 s
1.681×
#20 string formatting 440.7 ± 6.7 ms
1.492×
642.2 ± 24.9 ms
1.513×
3.931 ± 0.009 s
1.684×
#21 precompute penalties 393.5 ± 3.8 ms
1.671×
620.2 ± 6.8 ms
1.567×
3.861 ± 0.021 s
1.714×
  • Between 19 and 20 there was a small regression. Not because of 20, but because of tweaks to 18 and 19 for which I didn't run the benchmark again.

Other lessons learned

  • There can be (big) differences in optimization effectiveness, depending on the machine/cpu architecture. replace manual vector initialization #19 had a big effect on my macbook, but @ninewise didn't see any improvements. A SIMD change saw a big improvement on an x86 VM, but on my macbook the execution time doubled.
  • The vectors for replace manual vector initialization #19 are recreated each time. It seems that reusing them would be faster because of the reduced memory pressure. However, on my macbook this was noticeably slower, possibly because of the value reset that is needed and initialisation being very fast on apple silicon.
  • Using get_unchecked would shave off 10%, but I didn't create a PR for this

Comment on lines +105 to +119
# Parse arguments
MODE="check"
if [[ $# -gt 0 ]]; then
case "$1" in
--baseline)
MODE="baseline"
;;
--check)
MODE="check"
;;
*)
usage
;;
esac
fi
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels weird to combine the check and validate in here to deduplicate the 3 example calls, and not do the same for the benchmark. I'd merge all three.

@@ -0,0 +1,131 @@
#!/bin/bash
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use env.

Comment on lines +52 to +54
IFS=':' read -r input train whole name <<< "$example"
echo " Processing $name..."
run_example "$input" "$train" "$whole" "$BASELINE_DIR/$name"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than putting the example in a string array, splitting and naming them, then naming them again in the run method, I'd rather write three methods

NC454() { "$BINARY" -s example/NC_000913-454.fna -t 454_10 -w 0 -o NC_000913-454; }
...
examples=(NC454 ...)

And loop through the methods to call them directly.

BINARY="$PROJECT_ROOT/target/release/FragGeneScanRs"

# Check for hyperfine
if ! command -v hyperfine &> /dev/null; then
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use ! hyperfine -V, no need for command.

Comment on lines +83 to +84
echo " Warning: Baseline file $baseline_file not found"
continue
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather be defensive and have this fail if there is no baseline to be found.

@bmesuere
Copy link
Member Author

bmesuere commented Feb 3, 2026

I'll stop putting any effort into these branches. I'm sure there are large performance gains possible, but they are too dependant on CPU architecture to reliably benchmark. Every optimization I made on apple silicon had the opposite effect on @ninewise his machine.

On x86, the biggest gains are in restructuring memory and operations to make use of additional vectorisation and AVX512 instructions, but these are not available on Apple Silicon. In addition, memory pressure can be reduced by reusing alpha and path instead of allocation memory each time. On apple silicon hower, this is slower because it is extremely fast in allocating zeroed memory by use of "zero pages".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants