-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index loading #10
Comments
This is a good point. It is true it takes long to load all indexes with low disk I/O. I assume you might be loading indexes from HDD disk, where the disk read speed seems around ~100MB/sec. The problem is that file is just too big and it takes time to load it. When you use SSD (~500MB/s), it can be loaded within ~4 minutes. We thought about some solutions to this problem before and below are the solutions we have until now. 1) Load only the pos_packed file (~29GB) and build possa_packed (78GB) and inverse suffix array (29GB) at startup.
2) Rely on OS cache (OS caches files read or written, on free memory space)
3) Depending on the amount of (workload, disk type) use mode2 or mode1 which can be faster. 4) Store index files in place with higher disk I/O
|
I think (1) is the best option to allow cloud solutions to work where you have 2 factors against you:
I'm currently running this on a university cluster, so I was expecting better disk performance than 100MB/s. At Sanger lustre was exceptionally fast, I'll follow up with UoC to find out if there's something I'm missing here. |
I like to share the numbers I got with option 1. Tested on 150MB/s hdd , Intel(R) Xeon(R) Silver 4208 CPU @ 2.10GHz,
---------------------------------------
RMI models: 54 sec (~10GB)
pos_packed file: 191 sec (29GB)
Build index in-memory: 86 sec (used 32 thread)
Total: 5 min 30 sec,
Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz,
---------------------------------------
Build index in-memory: 43sec (used 48 thread) If you want to test it right now, the code is available in # recompile the code
make -j 32
# run alignment with runtime-build,
bwa-meme mem -7 -t 32 <reference> <fasta> At startup, bwa-meme will read the .pos_packed file and build other indexes based on it. By the way, I found that |
This looks good, rerunning this with some sample data (2x human genome, ~5.2GB of fastq input) I now get close to the run time of bwa-mem2 (550s vs 580s). Previously the best I could do was ~170s longer (best of 5, I suspect network congestion plays a part as they range from 700s-2000s). I will note that excluding the loading of the reference the sum of the
( The
( |
(edited previous comment to add legacy bwa counts and fix reg-exps for longer times) |
I ran some benchmarks regarding mimalloc and below are the results. Env: Intel(R) Xeon(R) Silver 4208 CPU @ 2.10GHz, 16 physical core (32vCPU) Findings:
|
This looks like a great addition again. Do you have an idea when it will be possible to get this pushed out to conda? I'm currently looking at pipelines incorporating various flavours of |
We will update the mimalloc and readme within a few days! I'll notice it here as soon as it's updated. That's an interesting point, thank you for sharing your tips. |
My PR in Bioconda package was merged few hours ago, now BWA-MEME uses mimalloc as default (also the master branch). |
I'll run some tests and give findings on streaming data into samtools etc. |
1.0.4, Intel skylake, 32 threads, 200GB RAM allocated via scheduler. 30X WGS from paired fastq, bwa processing includes presence of Real world use case: #!/bin/bash -ue
rm -f sorttmp*
set -o pipefail
bwa-meme mem -7 -K 100000000 -t 32 \
-R '@RG\tID:1\tLB:NA24385_lib\tPL:ILLUMINA\tSM:NA24385\tPU:R1_L1' \
Homo_sapiens_assembly38.fasta NA24385-30X_1.fastq.gz NA24385-30X_2.fastq.gz \
| samtools fixmate -m --output-fmt bam,level=0 -@ 1 - - \
| samtools reheader -c 'grep -v ^@SQ > tmp-head && cat Homo_sapiens_assembly38.dict tmp-head' - \
| samtools sort -m 1G --output-fmt bam,level=1 -T ./sorttmp -@ 32 - > sorted.bam
As you can see due to the larger fraction of runtime specific to the sort process as described above it is worth considering changing the process to minimise the CPU wastage, introducing additional steps in a workflow: Map with rapid compression: #!/bin/bash -ue
set -o pipefail
bwa-meme mem -7 -K 100000000 -t 32 \
-R '@RG\tID:1\tLB:NA24385_lib\tPL:ILLUMINA\tSM:NA24385\tPU:R1_L1' \
Homo_sapiens_assembly38.fasta NA24385-30X_1.fastq.gz NA24385-30X_2.fastq.gz \
| lz4 -c1 > mapped.sam.lz4 Correct and sort 4cpu, 8GB: #!/bin/bash -ue
set -o pipefail
lz4 -cd mapped.sam.lz4 \
| samtools fixmate -m --output-fmt bam,level=0 -@ 1 - - \
| samtools reheader -c 'grep -v ^@SQ > tmp-head && cat Homo_sapiens_assembly38.dict tmp-head' - \
| samtools sort -m 1G --output-fmt bam,level=1 -T ./sorttmp -@ 4 - > sorted.bam This would be essential should you want to introduce |
Thank you for your valuable tips and suggestions! We really appreciate it. To summarize,
I think the first option
|
FYI, bwa-meme has a peak efficiency of ~2500 when using 32 cores piped into
|
Edit: Samtools sort have higher throughput in general, using mbuffer (same amount of buffer used for samtools sort) would resolve the Hi, I recently looked into Samtools Sort code and found thing that might improve the bottleneck in the sorting step. I found there is “Write hang” problem when using large memory buffer for Samtools Sort pipelined with BWA-MEME. Samtools Sort works in two phases. phase 1 is done concurrently with BWA-MEME, phase 2 is executed after BWA-MEME alignment is finished. Phase 1. Prepare small temporary bins. For given memory buffer (set by argument -m and -@), Samtools-Sort reads input and fills in the buffer. When memory buffer is full, divide the buffer by number of threads, sort it, and each thread writes temporary file (last processed bins are kept in memory without writing to files). Phase 2. Open all temporary bins (all files from disk and memory), merge-sort the bins using heap (repeat write output, read temp file, heapsort) in single-thread (compression is multi-threaded). The “Write hang” happens in BWA-MEME, when Samtools Sort is writing temporary files (phase 1). During the writing stage in Samtools Sort, it does not read input, which makes BWA-MEME write function to hang. In particular, when large memory buffer is used in Samtools Sort (e.g. 20G= -m1G x -@ 20 option), periodically at some point, BWA-MEME waits for the Samtools Sort process to write all memory buffer to temporary files (compressing and writing 20GB memory buffer to disk). Simple solution is to use small memory buffer (-m 40m -@10 in my machine), but this can affect the time in phase 2, where using large memory can reduce the disk I/O. However, it can still improve the overall CPU usage in certain cases. Below is my experiment results ((32 thread BWA-MEME, 20x paired read alignment, machine with HDD ~160MB/sec disk I/O).
|
I'm finding it takes ~25 minutes to load the various components of the indexes, without
-7
this is only a couple of minutes.The loading of the core reference files up to the following message runs at ~100% CPU:
The section as follows runs at 5-15% CPU indicating disk-wait:
Is there anything obvious relating to the file reading that could account for this?
I expect unrelated, but I did notice that
ref.suffixarray_uint64
can be compressed withgzip -1
for 50% reduction in size.Decompression cost for L1 is negligible compared to the disk latency (and will be more cost effective for systems with IOPs accounting).
The text was updated successfully, but these errors were encountered: