Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index loading #10

Closed
keiranmraine opened this issue Apr 27, 2022 · 14 comments
Closed

Index loading #10

keiranmraine opened this issue Apr 27, 2022 · 14 comments

Comments

@keiranmraine
Copy link

keiranmraine commented Apr 27, 2022

I'm finding it takes ~25 minutes to load the various components of the indexes, without -7 this is only a couple of minutes.

The loading of the core reference files up to the following message runs at ~100% CPU:

* Reading reference genome..
* Binary seq file = /home/kr525/rds/hpc-work/data/ref/Homo_sapiens_assembly38.fasta.0123
* Reference genome size: 6434693834 bp
* Done reading reference genome !!

The section as follows runs at 5-15% CPU indicating disk-wait:

[M::memoryAllocLearned::LEARNED] Reading kmer SA index File to memory
[M::memoryAllocLearned::LEARNED] Reading ref2sa index File to memory

Is there anything obvious relating to the file reading that could account for this?

I expect unrelated, but I did notice that ref.suffixarray_uint64 can be compressed with gzip -1 for 50% reduction in size.
Decompression cost for L1 is negligible compared to the disk latency (and will be more cost effective for systems with IOPs accounting).

@quito418
Copy link
Collaborator

quito418 commented Apr 27, 2022

This is a good point. It is true it takes long to load all indexes with low disk I/O.

I assume you might be loading indexes from HDD disk, where the disk read speed seems around ~100MB/sec.

The problem is that file is just too big and it takes time to load it. When you use SSD (~500MB/s), it can be loaded within ~4 minutes.

We thought about some solutions to this problem before and below are the solutions we have until now.

1) Load only the pos_packed file (~29GB) and build possa_packed (78GB) and inverse suffix array (29GB) at startup.

  • We tested this option before and it turns out the speed is similar to loading from SSD. (building at startup time is fast)
  • It is in-memory operation and does not require extra memory space while doing it.
  • With your current disk read speed, it might still take ~8-9 minutes (read 29GB file + build other indexes).
  • But it is not enabled in the current codebase. (Enabled as default index loading master branch and v1.0.4 release)

2) Rely on OS cache (OS caches files read or written, on free memory space)

  • This scenario works when the user has a large memory and rerun BWA-MEME sequentially.

3) Depending on the amount of (workload, disk type) use mode2 or mode1 which can be faster.

4) Store index files in place with higher disk I/O

If you want to try solution 1), I can provide it within a few(2-3) days.
This option is enabled as default index loading in master branch, since v1.0.4.

@keiranmraine
Copy link
Author

I think (1) is the best option to allow cloud solutions to work where you have 2 factors against you:

  1. IOP throttling, heavy blocks of read exhaust quota
  2. Instances are generally destroyed on completion of a job

I'm currently running this on a university cluster, so I was expecting better disk performance than 100MB/s. At Sanger lustre was exceptionally fast, I'll follow up with UoC to find out if there's something I'm missing here.

@quito418
Copy link
Collaborator

quito418 commented Apr 28, 2022

I like to share the numbers I got with option 1.

Tested on 150MB/s hdd , Intel(R) Xeon(R) Silver 4208 CPU @ 2.10GHz, 
---------------------------------------
RMI models: 54 sec (~10GB)
pos_packed file: 191 sec (29GB)
Build index in-memory: 86 sec (used 32 thread)
Total: 5 min 30 sec, 

Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz,  
---------------------------------------
Build index in-memory: 43sec (used 48 thread)

If you want to test it right now, the code is available in dev branch.
( you don't need to rebuild the index, just recompile the code)

# recompile the code
make -j 32
# run alignment with runtime-build, 
bwa-meme mem -7 -t 32 <reference> <fasta> 

At startup, bwa-meme will read the .pos_packed file and build other indexes based on it.
The in-memory index building is an alternative to reading 77GB of indices.

By the way, I found that bwa-meme alignment speed seems to be moderately higher using mimalloc, even with small threads, while bwa-mem2 is marginally improved.
I will share the number here soon.

@keiranmraine
Copy link
Author

keiranmraine commented Apr 28, 2022

This looks good, rerunning this with some sample data (2x human genome, ~5.2GB of fastq input) I now get close to the run time of bwa-mem2 (550s vs 580s). Previously the best I could do was ~170s longer (best of 5, I suspect network congestion plays a part as they range from 700s-2000s).

I will note that excluding the loading of the reference the sum of the Processed 86958 reads in 3.702 CPU sec values indicates good performance, so the impact of mimalloc will be interesting to compare too:

  • bwa mem - 7013s (legacy)
  • bwa-mem2 - 4349s
  • bwa-meme - 2342s

(grep 'reads in' job.out | perl -ne 'm/(\d+\.\d{3})/;$s+=$1;END{print qq{$s\n};};')

The real sec component sum:

  • bwa mem - 245s (legacy)
  • bwa-mem2 - 170s
  • bwa-meme - 97s

(grep 'reads in' job.out | perl -ne 'm/,\s+(\d+\.\d{3})/;$s+=$1;END{print qq{$s\n};};')

@keiranmraine
Copy link
Author

(edited previous comment to add legacy bwa counts and fix reg-exps for longer times)

@quito418
Copy link
Collaborator

quito418 commented May 2, 2022

I ran some benchmarks regarding mimalloc and below are the results.

Env: Intel(R) Xeon(R) Silver 4208 CPU @ 2.10GHz, 16 physical core (32vCPU)
Dataset: single-end ERR194147_1.fastq
Config: 32 threads

Findings:

  • Mimalloc does not require much memory
  • Performance can be affected even with a small number of threads (32 in this case). In another machine with 48 physical core, the performance was not degraded a lot using 48 threads.
  • It would be better to use mimalloc in general.
Results              
  Alignment (without disk IO) Mimalloc No mimalloc   Seeding Mimalloc No mimalloc
  BWA-MEME 2518 (sec) 3122   BWA-MEME 949 (sec) 1428
  BWA-MEM2 3533.25 3829.96   BWA-MEM2 1970 2117
               
  Memory (GB) Mimalloc No mimalloc   BSW Mimalloc No mimalloc
  BWA-MEME 156.8 (GB) 152.5   BWA-MEME 1143 (sec) 1207
  BWA-MEM2 49.1 45.5   BWA-MEM2 1161 1230
  • max resident memory measured with /usr/bin/time -v

@keiranmraine
Copy link
Author

This looks like a great addition again. Do you have an idea when it will be possible to get this pushed out to conda?

I'm currently looking at pipelines incorporating various flavours of bwa-mem[2,e]? and I'm pretty sure that you'll need an instance of mbuffer between bwa-meme and samtools fixmate | samtools sort as it's already close to flat out with bwa-mem2 on 32 threads. It will be worth noting this in the README as at the speeds being achieved it isn't simply a drop in replacement (unless no pipes are used). If people want to use with bwa-postalt.js it's better to write to disk (lz4 compressed, gzip too slow, results in blocking) and start a lower resource job

@quito418
Copy link
Collaborator

quito418 commented May 4, 2022

We will update the mimalloc and readme within a few days! I'll notice it here as soon as it's updated.

That's an interesting point, thank you for sharing your tips.

@quito418
Copy link
Collaborator

quito418 commented May 6, 2022

My PR in Bioconda package was merged few hours ago, now BWA-MEME uses mimalloc as default (also the master branch).

@keiranmraine
Copy link
Author

I'll run some tests and give findings on streaming data into samtools etc.

@keiranmraine
Copy link
Author

1.0.4, Intel skylake, 32 threads, 200GB RAM allocated via scheduler. 30X WGS from paired fastq, bwa processing includes presence of *.fasta.alt

Real world use case:

#!/bin/bash -ue
rm -f sorttmp*
set -o pipefail
bwa-meme mem -7 -K 100000000 -t 32 \
 -R '@RG\tID:1\tLB:NA24385_lib\tPL:ILLUMINA\tSM:NA24385\tPU:R1_L1' \
 Homo_sapiens_assembly38.fasta NA24385-30X_1.fastq.gz NA24385-30X_2.fastq.gz \
 | samtools fixmate -m --output-fmt bam,level=0 -@ 1 - - \
 | samtools reheader -c 'grep -v ^@SQ > tmp-head && cat Homo_sapiens_assembly38.dict tmp-head' - \
 | samtools sort -m 1G --output-fmt bam,level=1 -T ./sorttmp -@ 32 - > sorted.bam
ALG CPU(s) WALL(s) REAL %CPU peak_rss peak vmem
bwa 305546.745 9966.03700000001 3h 16m 15s 2655.9% 43.5 GB 48.3 GB
mem2 184099.304 6197.36299999999 2h 6m 13s 2520.8% 60.3 GB 78.4 GB
meme 137476.289 5272.686 1h 53m 43s 2147.1% 164 GB 181.5 GB
  • CPU(s) and WALL(s) use the pattern matches from comment above.
  • REAL is the wall time for the full pipeline above as are peak_*

As you can see due to the larger fraction of runtime specific to the sort process as described above it is worth considering changing the process to minimise the CPU wastage, introducing additional steps in a workflow:

Map with rapid compression:

#!/bin/bash -ue
set -o pipefail
bwa-meme mem -7 -K 100000000 -t 32 \
 -R '@RG\tID:1\tLB:NA24385_lib\tPL:ILLUMINA\tSM:NA24385\tPU:R1_L1' \
 Homo_sapiens_assembly38.fasta NA24385-30X_1.fastq.gz NA24385-30X_2.fastq.gz \
 | lz4 -c1 > mapped.sam.lz4

Correct and sort 4cpu, 8GB:

#!/bin/bash -ue
set -o pipefail
lz4 -cd mapped.sam.lz4 \
 | samtools fixmate -m --output-fmt bam,level=0 -@ 1 - - \
 | samtools reheader -c 'grep -v ^@SQ > tmp-head && cat Homo_sapiens_assembly38.dict tmp-head' - \
 | samtools sort -m 1G --output-fmt bam,level=1 -T ./sorttmp -@ 4 - > sorted.bam

This would be essential should you want to introduce bwa-postalt.js to the flow (consumes bwa output ~1/3 required speed). If not the initial example is most likely sufficient, ~15 min extra run time on a 32 cpu instance. for 30x is mostly negated by the overhead of additional I/O, starting a new job with lower resource requirements and the additional 40-60 minutes of run time.

@quito418
Copy link
Collaborator

quito418 commented May 12, 2022

Thank you for your valuable tips and suggestions! We really appreciate it.

To summarize,

  1. pipeline of (alignment, fixmate, reheader, sort) results in low CPU utilization (bottleneck on sorting operation).
  2. An alternative way is to run alignment alone (with CPU fully utilized), and get the compressed alignment outputs to the low resource executor (i.e. container with 4 CPU, 8GB). But results in longer time, because of Disk I/O + container startup time.

I think the first option pipelining the whole thing is the best option in general. I would notice this result in README within few days.

  • Improving sorting throughput looks necessary for bwa-postalt.js pipeline. I will look into it and maybe come up with a solution in the future.

@keiranmraine
Copy link
Author

FYI, bwa-meme has a peak efficiency of ~2500 when using 32 cores piped into bwa-meme ... | lz4 -c > mapped.sam.lz4.

bwa-postalt.js consumes the raw sam output from bwa.

@quito418
Copy link
Collaborator

quito418 commented Oct 4, 2022

Edit: Samtools sort have higher throughput in general, using mbuffer (same amount of buffer used for samtools sort) would resolve the write hang issue described below.

Hi, I recently looked into Samtools Sort code and found thing that might improve the bottleneck in the sorting step.

I found there is “Write hang” problem when using large memory buffer for Samtools Sort pipelined with BWA-MEME.

Samtools Sort works in two phases. phase 1 is done concurrently with BWA-MEME, phase 2 is executed after BWA-MEME alignment is finished.

Phase 1. Prepare small temporary bins. For given memory buffer (set by argument -m and -@), Samtools-Sort reads input and fills in the buffer. When memory buffer is full, divide the buffer by number of threads, sort it, and each thread writes temporary file (last processed bins are kept in memory without writing to files).

Phase 2. Open all temporary bins (all files from disk and memory), merge-sort the bins using heap (repeat write output, read temp file, heapsort) in single-thread (compression is multi-threaded).

The “Write hang” happens in BWA-MEME, when Samtools Sort is writing temporary files (phase 1). During the writing stage in Samtools Sort, it does not read input, which makes BWA-MEME write function to hang. In particular, when large memory buffer is used in Samtools Sort (e.g. 20G= -m1G x -@ 20 option), periodically at some point, BWA-MEME waits for the Samtools Sort process to write all memory buffer to temporary files (compressing and writing 20GB memory buffer to disk).

Simple solution is to use small memory buffer (-m 40m -@10 in my machine), but this can affect the time in phase 2, where using large memory can reduce the disk I/O. However, it can still improve the overall CPU usage in certain cases.

Below is my experiment results ((32 thread BWA-MEME, 20x paired read alignment, machine with HDD ~160MB/sec disk I/O).

  1. using small memory buffer improves "write hang" problem. This hides Samtools-Sort phase 1 time inside pipeline of BWA-MEME.
  2. Large memory can benefit Samtools Sort phase 2.
 (second) Time taken for disk I/O in MEME and Samtools Sort Phase1 Time taken for Samtools Sort Phase2
meme only 151.78
meme + Sort -m 2G -@ 10 1195.08 3047
meme + Sort -m 40m -@ 10 308.26 2954
meme + Sort -m 2G -@ 20 1035.83 2276
meme + Sort -m 40m -@ 20 402.19 3,091
  • used command like bwa-meme mem -t 32 -7 ... | samtools sort -m 2G -@20 -l1 -o ~/tmp/out.bam -

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants