Skip to content

Conversation

@corneliusroemer
Copy link
Member

Description of proposed changes

We don't warn very explicitly when there's a mismatch between metadata and sequence ids. I helped @aleskunder debug a situation where a more explicit warning would have helped him save quite some time. This PR adds such a warning. If a timetree is done and less than half of sequences have a corresponding row in the metadata, an informative warning is printed.

Checklist

  • Automated checks pass
  • Check if you need to add a changelog message
  • Check if you need to add tests
  • Check if you need to update docs

corneliusroemer and others added 2 commits October 10, 2025 16:31
When building a time tree, warn users if sequence IDs in the tree don't match
the metadata IDs. This helps identify when --metadata-id-columns needs to be set
explicitly, especially when the strain column exists but contains different values
than the sequence IDs, causing time tree inference to fail.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
corneliusroemer and others added 3 commits October 10, 2025 16:53
- Unified the warning message for both 0% and <50% match cases
- Improved message clarity: "For X sequence IDs, only Y corresponding
  metadata rows could be matched"
- Added functional CRAM test to verify the warning appears with
  mismatched IDs and disappears when using --metadata-id-columns
- Added test metadata file with mismatched IDs in strain column

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Instead of adding a new metadata_mismatch.tsv file, generate the
mismatched metadata inline in the CRAM test using awk. This reduces
the number of new files and makes the test self-contained.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
CRAM requires blank lines in expected output to have exactly 2 spaces
of indentation. Updated the test to properly match the actual stderr
output including the TreeTime error messages.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@codecov
Copy link

codecov bot commented Oct 10, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.26%. Comparing base (4a31ada) to head (5adc31b).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1902      +/-   ##
==========================================
+ Coverage   74.09%   74.26%   +0.16%     
==========================================
  Files          82       82              
  Lines        8972     8980       +8     
  Branches     1823     1824       +1     
==========================================
+ Hits         6648     6669      +21     
+ Misses       2019     2011       -8     
+ Partials      305      300       -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


## __NEXT__

* `augur refine` will now warn when building a time tree if sequence IDs in the tree don't match metadata IDs, suggesting the use of `--metadata-id-columns` to explicitly set the correct ID column. [#1902][] (@corneliusroemer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest we make more explicit, something like "if more than half of the tip names are missing corresponding metadata"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Otherwise PR looks good!)

terminal_names = {n.name for n in T.get_terminals()}
matched_ids = terminal_names & set(dates.keys())

if len(matched_ids) < len(terminal_names) * 0.5:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: warn on any missing metadata. That would match the expectation for typical Augur workflows. augur filter drops data on either side if it's missing on the other side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants