You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I took a large pod5 file produced by MinKNOW (after skipping catch-up basecalling from a P2 solo run). I tried basecalling using dorado 0.8.2 and 0.8.1 on both a Linux and a Windows system. The GPUs are a bit different, so this may also complicate figuring out the root cause.
Steps to reproduce the issue:
Prepare libraries with SQK-RBK114-24
Acquire data in MinKNOW with basecalling. Skip any basecalling that didn't finish.
Grab any of the pod5 files labeled skip. They're about 64GB each. Here I'll grab _5.
In Linux, it uses all the available VRAM (on a 3090 Ti, this was 24GB). On Windows, it keeps growing without stopping. It never once decreased.
System memory keeps increasing until 64GB of system memory is used up in addition to the 16GB of VRAM on the RTX 5000 Ada (80GB total, which is the max Windows allows as "total GPU memory" in my system). When the system finally runs out of memory, it shows an error about CUDA gemm functions not allocating, mentions it's trying to clear the CUDA cache and try again, but instead dies or locks up horribly.
The behavior persists with --no-trim
The behavior persists in the Linux x64 version run in WSL.
The behavior persists with BAM output (without --emit-fastq).
The behavior DOES NOT EXIST with 'hac'. Only 'sup' is affected.
Please list any steps to reproduce the issue.
Run environment:
Dorado version: 0.8.1 and 0.8.2. (0.8.0 crashes silently after producing ~70MB of output, no matter what batch size parameters chosen. It does not run out of RAM, but just crashes with no error even in -vv).
CPU
13th Gen Intel(R) Core(TM) i9-13950HX
Base speed: 2.20 GHz
Sockets: 1
Cores: 24
Logical processors: 32
Virtualization: Enabled
L1 cache: 2.1 MB
L2 cache: 32.0 MB
L3 cache: 36.0 MB
Utilization 11%
Speed 1.73 GHz
Up time 0:00:37:40
Processes 309
Threads 5179
Handles 133275
Memory
128 GB
Speed: 3600 MT/s
Slots used: 4 of 4
Form factor: SODIMM
Hardware reserved: 344 MB
Available 101 GB
Cached 19.9 GB
Committed 43/136 GB
Paged pool 822 MB
Non-paged pool 1.3 GB
In use (Compressed) 26.0 GB (0.2 MB)
GPU 1
NVIDIA RTX 5000 Ada Generation Laptop GPU
Driver version: 32.0.15.6094
Driver date: 8/14/2024
DirectX version: 12 (FL 12.1)
Physical location: PCI bus 1, device 0, function 0
Utilization 100%
Dedicated GPU memory 15.5/16.0 GB
Shared GPU memory 38.2/63.8 GB
GPU Memory 53.7/79.8 GB
(Before the run, GPU memory is basically unused, just 0.5/16.0 GB, and shared is 0.2/63.8GB).
Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance):
pod5 from MinKnow (bla_bla_skipped_5.pod5)
Source data location (on device or networked drive - NFS, etc.):
Local SSD
Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB):
enzyme 8.2.1, kit 14 (latest chemistry and kit and pore), read lengths N50 ~6mb, unknown total read number; total pod5 size: 64GB. (There are multiple 64GB pod5 files, but I'm trying one at a time).
Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):
Cannot reproduce on small pod5. In fact, that seems to be the problem, as it runs fine up until it runs out of RAM. I cannot split the pod5 due to apparent bugs in pod5 format making it impossible to split a large pod5 into smaller chunks [edit: see below, I try this anyway and it works, but a folder of split files does not]. This is the original data from MinKNOW, not something fenagled by me or converted from other formats, so there is no option to "regenerate" the pod5 files using a different splitting criteria. (I had instructed MinKnow to split by number of reads, but that apparently doesn't apply to the _skip files, only the basecalled ones).
Logs
Please provide output trace of dorado (run dorado with -v, or -vv on a small subset)
Log with -v provided from 0.8.2 (0.8.1 produces a very similar log).
dorado basecaller -v --emit-fastq -b 96 --kit-name SQK-RBK114-24 --ou
tput-dir basecalled/ sup pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5
[2024-10-26 14:40:29.669] [info] Running: "basecaller" "-v" "--emit-fastq" "-b" "96" "--kit-name" "SQK-RBK114-24" "--output-dir" "basecalled/" "sup" "pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5"
[2024-10-26 14:40:30.061] [info] - Note: FASTQ output is not recommended as not all data can be preserved.
[2024-10-26 14:40:30.206] [info] - downloading [email protected] with httplib
[2024-10-26 14:40:43.499] [info] > Creating basecall pipeline
[2024-10-26 14:40:43.513] [debug] CRFModelConfig { qscale:1.050000 qbias:1.300000 stride:6 bias:1 clamp:0 out_features:4096 state_len:5 outsize:4096 blank_score:0.000000 scale:1.000000 num_features:1 sample_rate:5000 sample_type:DNA mean_qscore_start_pos:60 SignalNormalisationParams { strategy:pa StandardisationScalingParams { standardise:1 mean:93.692398 stdev:23.506744}} BasecallerParams { chunk_size:12288 overlap:600 batch_size:96} convs: { 0: ConvParams { insize:1 size:64 winlen:5 stride:1 activation:swish} 1: ConvParams { insize:64 size:64 winlen:5 stride:1 activation:swish} 2: ConvParams { insize:64 size:128 winlen:9 stride:3 activation:swish} 3: ConvParams { insize:128 size:128 winlen:9 stride:2 activation:swish} 4: ConvParams { insize:128 size:512 winlen:5 stride:2 activation:swish}} model_type: tx { crf_encoder: CRFEncoderParams { insize:512 n_base:4 state_len:5 scale:5.000000 blank_score:2.000000 expand_blanks:1 permute:1} transformer: TxEncoderParams { d_model:512 nhead:8 depth:18 dim_feedforward:2048 deepnorm_alpha:2.449490}}}
[2024-10-26 14:40:44.257] [debug] TxEncoderStack: use_koi_tiled false.
[2024-10-26 14:40:46.339] [debug] cuda:0 memory available: 15.55GB
[2024-10-26 14:40:46.339] [debug] cuda:0 memory limit 14.55GB
[2024-10-26 14:40:46.339] [debug] cuda:0 maximum safe estimated batch size at chunk size 12288 is 160
[2024-10-26 14:40:46.339] [debug] cuda:0 maximum safe estimated batch size at chunk size 6144 is 352
[2024-10-26 14:40:46.339] [info] cuda:0 using chunk size 12288, batch size 96
[2024-10-26 14:40:46.339] [debug] cuda:0 Model memory 6.85GB
[2024-10-26 14:40:46.339] [debug] cuda:0 Decode memory 0.83GB
[2024-10-26 14:40:48.518] [info] cuda:0 using chunk size 6144, batch size 96
[2024-10-26 14:40:48.518] [debug] cuda:0 Model memory 3.43GB
[2024-10-26 14:40:48.518] [debug] cuda:0 Decode memory 0.42GB
[2024-10-26 14:40:48.897] [debug] BasecallerNode chunk size 12288
[2024-10-26 14:40:48.897] [debug] BasecallerNode chunk size 6144
[2024-10-26 14:40:48.943] [debug] Load reads from file pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5
[2024-10-26 14:40:49.664] [debug] > Kits to evaluate: 1
[2024-10-26 14:41:38.731] [debug] Invalid trim interval for read id 9b72d1ed-67b1-4e59-a6a1-7bf8d0fb9762: 117-117. Trimming will be skipped.
[2024-10-26 14:43:21.296] [debug] Invalid trim interval for read id 40d2e852-4b32-46e5-8de0-238af8076f28: 118-113. Trimming will be skipped.
[2024-10-26 14:44:42.855] [debug] Invalid trim interval for read id c642511c-7b9f-477c-82ba-7df1b07bc42c: 115-112. Trimming will be skipped.
...
At the very end as the memory is completely exhausted (80GB used of VRAM + Shared VRAM), it prints something like CUDA kernel couldn't allocate for gemm_something... then the display completely locks up (hard freeze).
Is there a way to split a 64GB pod5 file produced by MinKNOW? (pod5 subset is a bit broken) I can probably work through this glitch if I can just split this thing into a few hundred parts (each just small enough to fit under the 80GB VRAM without crashing). My best run so far had a 350MB fastq output (with -b 32), but it won't accept smaller values of "-b".
[Update]
Currently trying pod5 subset anyway.
Here's how.
basecall using hac. I know it's wasteful, but I need some way to split. Test basecalled file in "test.fq".
printf "read_id\tbarcode\n" > map.tsv; sed -n '1~4p' test.fq | grep -F 'barcode' | sed 's/\t.*_barcode/\tbarcode/' | sed 's/\tDS.*//' | cut -c2- >> map.tsv <-- this is a tsv mapping file pod5 subset expects.
pod5 subset ../pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5 --columns barcode --table map.tsv --threads 1 --missing-ok <-- this spits out a pod5 file per barcode.
This is slow but at least performs some subsetting. If any barcode is still too big, I'll try to add a new column trivially ( awk -F'\t' 'BEGIN{OFS=FS} {print $0, int(NR/1000)}' ) to split it into 1000-record chunks and call subset on that column.
All-in-one to make a tsv where you can split on barcode, raw batch, or batch-within-barcode ("barcodeBatch"): cat test.fq | awk -F'\t' 'BEGIN{print "read_id\tbarcode\tchunk\tbarcodeBatch"; OFS=FS} NR%4==1 && /barcode/ {sub(/^@/,"",$1); match($3,/barcode[0-9]+/,m); print $1,m[0],int(++c/1000),m[0] "-" int(b[m[0]]/1000); ++b[m[0]]}' > map.tsv
But if it works, this is a workaround. The bug is that 'sup' doesn't know how to evict context or old stuff from its memory. This is a modern Ada generation GPU with 16GB of VRAM. It's one of the most common modern GPUs in mobile workstations for AI/ML. It is also among the most performant and efficient. It would make sense to support it.
[Update 2]
Another update:
Splitting them (in bunches of 10,000 reads per barcode), and then calling dorado sup on each individually, works like a charm. Notably, when this happens: The RAM does not continue to rise, and fluctuates nicely over a constant value as expected, for the entire duration of the run. RAM use at end of run (~10 minutes) is the same as at the beginning (+10 seconds).
Also notably, when dorado is pointed to a directory containing all of the pod5 files, the aberrant behavior immediately recurs. Dorado adds the contents of subsequent files in alphabetical order, but even before the second file is added, the RAM has already climbed shockingly high (much more so than when run on the first file alone).
Most shockingly of all, the behavior is noticeable within the first minute of runtime -- the RAM starts to climb at a constant rate. However, even starting from the same file, the dorado run on the "whole directory" of split files immediately misbehaves, showing the unmistakable RAM sinkhole behavior within 1-2 minutes, whereas no such behavior occurs even 5 minutes into the same file when split. Two completely different behaviors are seen with the same settings, parameters, and input reads, depending on whether or not there is a large total number of reads, even if the reads "so far" in a run are exactly the same.
This suggests there is simply a problem counting items. When filesize is large, something immediately overflows (RAM climbs without limit). When filesize is small, there is no overflow, and RAM remains constant throughout the entire run.
The sinkhole occurs even when dorado has seen exactly the same reads in exactly the same order as the single subset file, implying that a parameter governing future behavior (total reads, total allowable padding, something dependent on the filesize) is what's messing things up. Perhaps there is a "context window" used by the transformer that is pre-initialized to the entire sequence space that needs to be reined in. Or an int32 instead of an int64 in CUDA defaults in Windows, resulting in arithmetic underflow. Etc
Hopefully these observations will help you fix the bug.
The text was updated successfully, but these errors were encountered:
GabeAl
changed the title
Dorado >= 0.8.1 exhausts all VRAM, then all system memory, even with tiny -b, on Windows (large pod5+barcodes)
Dorado 'sup' RAM overflow on Windows, only with large pod5 files (weird bug)
Oct 26, 2024
Hi @GabeAl,
Thanks for the very detailed and clear description of the issue. It's very much appreciated.
We'll investigate and get back to you if we have more questions and we'll keep an eye out for more updates.
Regarding subsetting in pod5 you can generate the subsetting mapping using pod5 view reads/ --include "read_id, channel" and subset into ~2k files or extend this further by also including mux which will generate ~4x2k files.
Another unrelated CUDA adventure with block strides, where batches that aren't a perfect denominator of the stride produce a bunch of detritus, led to me a new hypothesis. What if there is such hypersensitivity to the block size and no cleanup of padding?
And lo and behold, across every system tested (2 windows systems, 2 Linux systems), and 3 different GPUs, one trend holds:
At block size 64, all runs are perfect (if the GPU has at least ~8GB of VRAM)
At block size 256, all runs are perfect (if the GPU has at least ~20GB of VRAM)
All other block sizes I tried (including the sizes autoselected by dorado), produce RAM-chewing effects. The Linux systems handle this more gracefully but still indeed show the RAM leak behavior when files are large enough (or run on a combined folder, as before). With -b 64 and -b 256, the faulty behavior disappears.
I think a workaround for now would be to have dorado limit block sizes to a tested few.
Issue Report
Please describe the issue:
I took a large pod5 file produced by MinKNOW (after skipping catch-up basecalling from a P2 solo run). I tried basecalling using dorado 0.8.2 and 0.8.1 on both a Linux and a Windows system. The GPUs are a bit different, so this may also complicate figuring out the root cause.
Steps to reproduce the issue:
dorado.exe basecaller -v --emit-fastq -b 96 --kit-name SQK-RBK114-24 --output-dir basecalled/ sup pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5
System memory keeps increasing until 64GB of system memory is used up in addition to the 16GB of VRAM on the RTX 5000 Ada (80GB total, which is the max Windows allows as "total GPU memory" in my system). When the system finally runs out of memory, it shows an error about CUDA gemm functions not allocating, mentions it's trying to clear the CUDA cache and try again, but instead dies or locks up horribly.
Please list any steps to reproduce the issue.
Run environment:
Dorado version: 0.8.1 and 0.8.2. (0.8.0 crashes silently after producing ~70MB of output, no matter what batch size parameters chosen. It does not run out of RAM, but just crashes with no error even in -vv).
Dorado command:
dorado.exe basecaller -v --emit-fastq -b 32 --kit-name SQK-RBK114-24 --output-dir basecalled/ sup pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5
Operating system: Windows 11 23H2
Hardware (CPUs, Memory, GPUs):
Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance):
pod5 from MinKnow (bla_bla_skipped_5.pod5)
Source data location (on device or networked drive - NFS, etc.):
Local SSD
Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB):
enzyme 8.2.1, kit 14 (latest chemistry and kit and pore), read lengths N50 ~6mb, unknown total read number; total pod5 size: 64GB. (There are multiple 64GB pod5 files, but I'm trying one at a time).
Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):
Cannot reproduce on small pod5. In fact, that seems to be the problem, as it runs fine up until it runs out of RAM. I cannot split the pod5 due to apparent bugs in pod5 format making it impossible to split a large pod5 into smaller chunks [edit: see below, I try this anyway and it works, but a folder of split files does not]. This is the original data from MinKNOW, not something fenagled by me or converted from other formats, so there is no option to "regenerate" the pod5 files using a different splitting criteria. (I had instructed MinKnow to split by number of reads, but that apparently doesn't apply to the _skip files, only the basecalled ones).
Logs
Log with -v provided from 0.8.2 (0.8.1 produces a very similar log).
At the very end as the memory is completely exhausted (80GB used of VRAM + Shared VRAM), it prints something like CUDA kernel couldn't allocate for gemm_something... then the display completely locks up (hard freeze).
Is there a way to split a 64GB pod5 file produced by MinKNOW? (pod5 subset is a bit broken) I can probably work through this glitch if I can just split this thing into a few hundred parts (each just small enough to fit under the 80GB VRAM without crashing). My best run so far had a 350MB fastq output (with -b 32), but it won't accept smaller values of "-b".
[Update]
Currently trying pod5 subset anyway.
Here's how.
printf "read_id\tbarcode\n" > map.tsv; sed -n '1~4p' test.fq | grep -F 'barcode' | sed 's/\t.*_barcode/\tbarcode/' | sed 's/\tDS.*//' | cut -c2- >> map.tsv
<-- this is a tsv mapping file pod5 subset expects.pod5 subset ../pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5 --columns barcode --table map.tsv --threads 1 --missing-ok
<-- this spits out a pod5 file per barcode.This is slow but at least performs some subsetting. If any barcode is still too big, I'll try to add a new column trivially (
awk -F'\t' 'BEGIN{OFS=FS} {print $0, int(NR/1000)}'
) to split it into 1000-record chunks and call subset on that column.All-in-one to make a tsv where you can split on barcode, raw batch, or batch-within-barcode ("barcodeBatch"):
cat test.fq | awk -F'\t' 'BEGIN{print "read_id\tbarcode\tchunk\tbarcodeBatch"; OFS=FS} NR%4==1 && /barcode/ {sub(/^@/,"",$1); match($3,/barcode[0-9]+/,m); print $1,m[0],int(++c/1000),m[0] "-" int(b[m[0]]/1000); ++b[m[0]]}' > map.tsv
But if it works, this is a workaround. The bug is that 'sup' doesn't know how to evict context or old stuff from its memory. This is a modern Ada generation GPU with 16GB of VRAM. It's one of the most common modern GPUs in mobile workstations for AI/ML. It is also among the most performant and efficient. It would make sense to support it.
[Update 2]
Another update:
The sinkhole occurs even when dorado has seen exactly the same reads in exactly the same order as the single subset file, implying that a parameter governing future behavior (total reads, total allowable padding, something dependent on the filesize) is what's messing things up. Perhaps there is a "context window" used by the transformer that is pre-initialized to the entire sequence space that needs to be reined in. Or an int32 instead of an int64 in CUDA defaults in Windows, resulting in arithmetic underflow. Etc
Hopefully these observations will help you fix the bug.
The text was updated successfully, but these errors were encountered: