-
Notifications
You must be signed in to change notification settings - Fork 136
New command: augur curate apply-date-bounds
#1828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: victorlin/support-iso-time-intervals
Are you sure you want to change the base?
New command: augur curate apply-date-bounds
#1828
Conversation
This is the value used by augur curate format-dates to represent a lack of date information. Conceptually, this should be represent a maximally wide date interval. Previously, this value went through AmbiguousDate.range() which returned `[0001-01-01, <today>]`. The lower end is a limitation of the ISO 8601 format, and the upper limit is a somewhat opinionated default behavior. Since dates are stored numerically, the maximally wide date interval is (-∞,∞). The motivation for making this change is for the upcoming apply-date-bounds command which will need to check whether ends of an interval are defined. It doesn't make sense to check against `0001-01-01` or `<today>` since those are values that can be used to define an interval.
This is a new command that applies lower and/or upper bounds to values in an existing date column.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## victorlin/support-iso-time-intervals #1828 +/- ##
========================================================================
+ Coverage 73.64% 73.85% +0.21%
========================================================================
Files 81 82 +1
Lines 8650 8775 +125
Branches 1767 1798 +31
========================================================================
+ Hits 6370 6481 +111
- Misses 1977 1985 +8
- Partials 303 309 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
joverlee521
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking in my first round of review, will continue tomorrow.
| # special handling. | ||
|
|
||
| if RE_AUGUR_UNKNOWN_DATE.match(value): | ||
| return (float("-inf"), float("inf")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing this is not handled in treetime which is leading to the errors in a couple of the pathogen-repo-ci tests?
| original_date = self.data.get(date_field) | ||
|
|
||
| if original_date is None: | ||
| self.raise_data_error( | ||
| f"Missing date field {date_field!r}." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this reminded me of the issue with null values getting converted to None that needs to get fixed throughout augur curate.
| # An exact date is converted to an interval for explicitness. | ||
| ("2020-01-15" , "2020-01-10" , "2020-01-20" , "2020-01-15/2020-01-15"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'd expect an exact date to just be returned as-is without changing format, but maybe others will disagree here.
joverlee521
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing the work of converting this feature to a new command! I flagged some small potential errors, but you can consider my other comments as non-blocking.
| help=first_line(__doc__)) | ||
|
|
||
| required = parser.add_argument_group(title="REQUIRED") | ||
| required.add_argument("--date-field", metavar="NAME", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be explicitly marked as required
| required.add_argument("--date-field", metavar="NAME", | |
| required.add_argument("--date-field", metavar="NAME", required=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking
Just wanted to walk through a slightly more complicated use-case to get people's thoughts. Definitely not saying we need to support in this with apply-date-bounds.
Say we are curating SARS-CoV-2 data with unknown collection dates, but we know the clades and the sequence release date. I want to set the lower bound of the collection date to be the first detection date of the clade. I think the steps to do this now is
- Create a
clade_detection_date.tsvwith the columnscladeandfirst_detection_date - Join the
metadata.tsvwithclade_detection_date.tsvcsvtk join -t -f "clade" --left-join \ metadata.tsv \ clade_detection_date.tsv \ > metadata-with-detection-date.tsv - Then apply date bounds
augur curate apply-date-bounds \ --date-field date \ --lower-bound first_detection_date \ --upper-bound release_date \ --metadata metadata-with-detection-date.tsv
Would it be overcomplicated to pass the clade_detection_date.tsv directly to the apply-date-bounds command?
| # Error if start or end are out of bounds. | ||
| if lower_bound and start < lower_bound and end < lower_bound: | ||
| self.raise_data_error( | ||
| f"{args.date_field!r}={self.data[args.date_field]!r} " | ||
| f"is earlier than the lower bound of " | ||
| f"{args.lower_bound!r}={self.data[args.lower_bound]!r}" | ||
| ) | ||
| if upper_bound and start > upper_bound and end > upper_bound: | ||
| self.raise_data_error( | ||
| f"{args.date_field!r}={self.data[args.date_field]!r} " | ||
| f"is later than the upper bound of " | ||
| f"{args.upper_bound!r}={self.data[args.upper_bound]!r}" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.data[args.lower_bound] would lead to a KeyError if the args.lower_bound was just a date string (same for the args.upper_bound.
| # ISO 8601 interval in <start>/<end> format | ||
| return f"{datestring_from_numeric(start)}/{datestring_from_numeric(end)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think ISO 8601 explicitly supports or prohibits negative dates, but should they be allowed here?
$ echo '{"date": "-200.0"}' | ./bin/augur curate apply-date-bounds --date-field date --lower-bound -300.0
--upper-bound 2025-01-01
{"date": "-200-01-01/-200-01-01"}
Description of proposed changes
This PR adds a new command that applies lower and/or upper bounds to values in an existing date column. More details in #1494. Tests provide an overview of the functionality, which I've tried to make flexible to support various use cases.
Related issue(s)
Closes #1494
Checklist