Description
Hello,
I am currently doing a basecalling model refinement and am running into some issues in the pre-processing basecalling step surrounding memory management.
Specifically, I have a folder of pod5 files totalling ~24GB I am passing to bonito basecaller
in the following way:
bonito basecaller [email protected] --save-ctc --min-accuracy-save-ctc 0.9 -v --alignment-threads 10 --device 'cuda' --reference ~/Documents/genomes/T7_V01146.1.fasta ./T7/pod5s/ > ./T7/bonito_mapped_hac_ctc/basecalls_ctc.bam
However, after the initial basecalling, the process is killed due to a maxing out of available RAM on my machine.
> reading pod5
> outputting aligned bam
> loading model [email protected]
> model basecaller params: {'batchsize': 512, 'chunksize': 10000, 'overlap': 500, 'quantize': None}
> loading reference
> calling: 1290710 reads [59:34, 361.04 reads/s]Killed
For now, I am attempting to subset the initial data, but this is obviously not ideal as I am discarding potentially useful signal from being used in the training step. It appears to me that bonito train
only accepts a single --directory
, so breaking up the basecalling by pod5 or similar would also not work. Is there an alternate approach?
Thanks in advance for your input.
All the best,
Falko Noé