Skip to content

Recommendations about chunk sizes #22

Open
@jeromekelleher

Description

@jeromekelleher

We currently say nothing at all about chunk sizes, but I think we will need to provide some rules/guidance in order to make processing arrays efficient. For example, it really does help a lot of call-level arrays all have the same chunking (in the variants and samples dimension) so that code can read in (say) genotypes and DP values chunk-by-chunk in the same loop.

Currently vcf2zarr enforces a uniform chunk size across dimensions, so that we have one variants_chunk_size. While this is a useful simplification, it does have some drawbacks, particularly when we want to read in all of a low-dimensional array at once (e.g., ``variant_position). See #21 for discussion and some benchmarks on this point.

This would need some feedback from a variety of implementations and use-cases, I think.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions