Skip to content

consider outputting extra info in pangenome databases #14

@ctb

Description

@ctb

we have two problems with our current pangenomics databases -

first, they present as "regular" sourmash sketches, with abundances. This could lead to misuse/mistakes.

second, it is annoying to track extra information (e.g. lineage counts as in #13) in a separate file.

there is an analogous issue over in sourmash, sourmash-bio/sourmash#2216, that talks about including taxonomy files in zip databases: the idea is that we can provide various standard lineage files in the actual .zip file databases, and then switch between them using CLI options (--gtdb and --ncbi, etc.)

so one idea here would be to produce the pangenome zip file full of sketches, and then add an extra file or two that indicate it's a pangenome database. This wouldn't necessarily prevent misuse (item 1 above) unless we adopted more metadata-in-zip-files in sourmash generally, but would help a great deal with carting around extra files (item 2). and the extra files would help with debugging, potentially.

it is kinda interesting to think about how to add more metadata in generally; this is the closest thing we have over in sourmash-land: sourmash-bio/sourmash#2180

Related issues:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions