Skip to content

Add validation of metadata #1896

@huddlej

Description

@huddlej

Context

Most Nextstrain phylogenetic workflows assume that the user has curated a FASTA file of sequences and a TSV file of metadata with matching records and the correct metadata fields. However, it is possible (maybe even common) for metadata to have structural issues that cause workflows to silently fail or to fail loudly in confusing ways. For example, nextstrain/seasonal-flu#257 occurred when there was a mismatch between the strain name for a record in the sequences and metadata, the flu workflow outer-joined metadata with Nextclade annotations (using augur merge), and the workflow crashed when the resulting metadata had a new record from the Nextclade annotations that was completely missing metadata values. The user only noticed something was wrong with the metadata when a downstream step in the workflow tried to parse the date field and crashed with a confusing error.

Description

Ideally, we could provide a way for users and workflows to automatically validate metadata files to check for minimum requirements like the presence of specific columns, the presence of specific values in those columns, and potentially even the correct format for specific types of columns (e.g., dates would be the most obvious candidate for standard metadata, but Nextclade fields with integer and float types could also be potentially useful).

Examples

The validation could occur through a new subcommand of augur validate. Using the metadata example in the seasonal flu workflow issue linked above, the workflow might run the following command after augur merge:

augur validate metadata \
    --metadata builds/h1n1pdm/metadata_with_nextclade.tsv 2> builds/h1n1pdm/metadata_validation.log

This command would produce validate the metadata TSV against a default schema that we bundle with Augur (like we do for other validation subcommands), print validation errors to std err, and exit with a nonzero exit code. The validation log output above should provide enough details for users to resolve all issues with the metadata.

Note that Snakemake provides a similar validation for TSV-based configuration files that could be a useful example.

The ncov workflow's guide to preparing metadata has a nice summary of required columns and their expected value formats. We may want to provide a way to pass a custom schema to the command that overrides the defaults with workflow-specific requirements like the way ncov requires the virus column (even if this probably should not be required).

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions