Skip to content

[Feature] SCDL: Ability to load data from multiple files as one instance #1086

@mlgill

Description

@mlgill

Problem & Motivation

As datasets grow in size, it will become more common to have them split into multiple files. The Tahoe dataset is a good example of there where each plate's data is saved to its own file. There are times when it would be useful to access all of this data as a unified set.

BioNeMo Framework Version

v2.6.3

Category

API/Interface

Proposed Solution

This could be implemented at the level of SingleCellMemMapDataset, or it could be a higher level class, e.g. SingleCellMemMapCollection, that chains together multiple instances of SingleCellMemMapDataset. The later appears to be what scDataset does. Note that PyTorch does have native capability to chain datasets together, e.g. IterableDataset and ConcatDataset, so it's also possible that this could be a good way to start.

Expected Benefits

Usability

Code Example

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions