Skip to content

bonito basecall model refinement preprocessing memory issues #361

Closed
@CodingKaiser

Description

@CodingKaiser

Hello,

I am currently doing a basecalling model refinement and am running into some issues in the pre-processing basecalling step surrounding memory management.

Specifically, I have a folder of pod5 files totalling ~24GB I am passing to bonito basecaller in the following way:
bonito basecaller [email protected] --save-ctc --min-accuracy-save-ctc 0.9 -v --alignment-threads 10 --device 'cuda' --reference ~/Documents/genomes/T7_V01146.1.fasta ./T7/pod5s/ > ./T7/bonito_mapped_hac_ctc/basecalls_ctc.bam

However, after the initial basecalling, the process is killed due to a maxing out of available RAM on my machine.

> reading pod5
> outputting aligned bam
> loading model [email protected]
> model basecaller params: {'batchsize': 512, 'chunksize': 10000, 'overlap': 500, 'quantize': None}
> loading reference
> calling: 1290710 reads [59:34, 361.04 reads/s]Killed

For now, I am attempting to subset the initial data, but this is obviously not ideal as I am discarding potentially useful signal from being used in the training step. It appears to me that bonito train only accepts a single --directory, so breaking up the basecalling by pod5 or similar would also not work. Is there an alternate approach?

Thanks in advance for your input.

All the best,
Falko Noé

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions