Skip to content

Conversation

@victorlin
Copy link
Member

Description of proposed changes

This PR adds a new command that applies lower and/or upper bounds to values in an existing date column. More details in #1494. Tests provide an overview of the functionality, which I've tried to make flexible to support various use cases.

Related issue(s)

Closes #1494

Checklist

This is the value used by augur curate format-dates to represent a lack
of date information. Conceptually, this should be represent a maximally
wide date interval.

Previously, this value went through AmbiguousDate.range() which returned
`[0001-01-01, <today>]`. The lower end is a limitation of the ISO 8601
format, and the upper limit is a somewhat opinionated default behavior.
Since dates are stored numerically, the maximally wide date interval is
(-∞,∞).

The motivation for making this change is for the upcoming
apply-date-bounds command which will need to check whether ends of an
interval are defined. It doesn't make sense to check against
`0001-01-01` or `<today>` since those are values that can be used to
define an interval.
This is a new command that applies lower and/or upper bounds to values
in an existing date column.
@victorlin victorlin self-assigned this Jun 11, 2025
@codecov
Copy link

codecov bot commented Jun 11, 2025

Codecov Report

Attention: Patch coverage is 88.88889% with 14 lines in your changes missing coverage. Please review.

Project coverage is 73.85%. Comparing base (18422e7) to head (b450545).

Files with missing lines Patch % Lines
augur/curate/apply_date_bounds.py 88.42% 8 Missing and 6 partials ⚠️
Additional details and impacted files
@@                           Coverage Diff                            @@
##           victorlin/support-iso-time-intervals    #1828      +/-   ##
========================================================================
+ Coverage                                 73.64%   73.85%   +0.21%     
========================================================================
  Files                                        81       82       +1     
  Lines                                      8650     8775     +125     
  Branches                                   1767     1798      +31     
========================================================================
+ Hits                                       6370     6481     +111     
- Misses                                     1977     1985       +8     
- Partials                                    303      309       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking in my first round of review, will continue tomorrow.

# special handling.

if RE_AUGUR_UNKNOWN_DATE.match(value):
return (float("-inf"), float("inf"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this is not handled in treetime which is leading to the errors in a couple of the pathogen-repo-ci tests?

Comment on lines +142 to +147
original_date = self.data.get(date_field)

if original_date is None:
self.raise_data_error(
f"Missing date field {date_field!r}."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this reminded me of the issue with null values getting converted to None that needs to get fixed throughout augur curate.

Comment on lines +20 to +21
# An exact date is converted to an interval for explicitness.
("2020-01-15" , "2020-01-10" , "2020-01-20" , "2020-01-15/2020-01-15"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'd expect an exact date to just be returned as-is without changing format, but maybe others will disagree here.

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing the work of converting this feature to a new command! I flagged some small potential errors, but you can consider my other comments as non-blocking.

help=first_line(__doc__))

required = parser.add_argument_group(title="REQUIRED")
required.add_argument("--date-field", metavar="NAME",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be explicitly marked as required

Suggested change
required.add_argument("--date-field", metavar="NAME",
required.add_argument("--date-field", metavar="NAME", required=True,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking

Just wanted to walk through a slightly more complicated use-case to get people's thoughts. Definitely not saying we need to support in this with apply-date-bounds.

Say we are curating SARS-CoV-2 data with unknown collection dates, but we know the clades and the sequence release date. I want to set the lower bound of the collection date to be the first detection date of the clade. I think the steps to do this now is

  1. Create a clade_detection_date.tsv with the columns clade and first_detection_date
  2. Join the metadata.tsv with clade_detection_date.tsv
    csvtk join -t -f "clade" --left-join \
        metadata.tsv \
        clade_detection_date.tsv \
        > metadata-with-detection-date.tsv
    
  3. Then apply date bounds
    augur curate apply-date-bounds \
        --date-field date \
        --lower-bound first_detection_date \
        --upper-bound release_date \
        --metadata metadata-with-detection-date.tsv
    

Would it be overcomplicated to pass the clade_detection_date.tsv directly to the apply-date-bounds command?

Comment on lines +116 to +128
# Error if start or end are out of bounds.
if lower_bound and start < lower_bound and end < lower_bound:
self.raise_data_error(
f"{args.date_field!r}={self.data[args.date_field]!r} "
f"is earlier than the lower bound of "
f"{args.lower_bound!r}={self.data[args.lower_bound]!r}"
)
if upper_bound and start > upper_bound and end > upper_bound:
self.raise_data_error(
f"{args.date_field!r}={self.data[args.date_field]!r} "
f"is later than the upper bound of "
f"{args.upper_bound!r}={self.data[args.upper_bound]!r}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.data[args.lower_bound] would lead to a KeyError if the args.lower_bound was just a date string (same for the args.upper_bound.

Comment on lines +138 to +139
# ISO 8601 interval in <start>/<end> format
return f"{datestring_from_numeric(start)}/{datestring_from_numeric(end)}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think ISO 8601 explicitly supports or prohibits negative dates, but should they be allowed here?

$ echo '{"date": "-200.0"}' | ./bin/augur curate apply-date-bounds --date-field date --lower-bound -300.0 
--upper-bound 2025-01-01
{"date": "-200-01-01/-200-01-01"}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants