bonito basecall model refinement preprocessing memory issues

Hello,

I am currently doing a basecalling model refinement and am running into some issues in the pre-processing basecalling step surrounding memory management.

Specifically, I have a folder of pod5 files totalling ~24GB I am passing to `bonito basecaller` in the following way:
`bonito basecaller dna_r9.4.1_e8_hac@v3.3 --save-ctc --min-accuracy-save-ctc 0.9 -v --alignment-threads 10 --device 'cuda' --reference ~/Documents/genomes/T7_V01146.1.fasta ./T7/pod5s/ > ./T7/bonito_mapped_hac_ctc/basecalls_ctc.bam`

However, after the initial basecalling, the process is killed due to a maxing out of available RAM on my machine.
```text
> reading pod5
> outputting aligned bam
> loading model dna_r9.4.1_e8_hac@v3.3
> model basecaller params: {'batchsize': 512, 'chunksize': 10000, 'overlap': 500, 'quantize': None}
> loading reference
> calling: 1290710 reads [59:34, 361.04 reads/s]Killed
```

For now, I am attempting to subset the initial data, but this is obviously not ideal as I am discarding potentially useful signal from being used in the training step. It appears to me that `bonito train` only accepts a single `--directory`, so breaking up the basecalling by pod5 or similar would also not work. Is there an alternate approach?

Thanks in advance for your input.

All the best,
Falko Noé

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bonito basecall model refinement preprocessing memory issues #361

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bonito basecall model refinement preprocessing memory issues #361

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions