-
Notifications
You must be signed in to change notification settings - Fork 136
Description
Scope
filter, frequencies, subsample
Description
Take this example dataset:
echo -e 'strain\tdate
SEQ1\t2019-01-01
SEQ2\t2020-01-01
SEQ3\t2020-12-31
SEQ4\t2021-01-01
SEQ5\t2021-12-31
SEQ6\t2022-01-01
' > metadata.tsv--min-date is inclusive. With --min-date 2020, both 2020-01-01 and 2020-12-31 pass as expected:
augur filter \
--metadata metadata.tsv \
--min-date 2020 \
--output-metadata filtered.tsv
# strain date
# SEQ3 2020-12-31
# SEQ4 2021-01-01
# SEQ2 2020-01-01
# SEQ6 2022-01-01
# SEQ5 2021-12-31However, --max-date is not inclusive. With --max-date 2021, both 2021-01-01 and 2021-12-31 are expected to pass, but instead they get filtered out:
augur filter \
--metadata metadata.tsv \
--max-date 2021 \
--output-metadata filtered.tsv
# strain date
# SEQ1 2019-01-01
# SEQ3 2020-12-31
# SEQ2 2020-01-01Reason
In --max-date 2021, the value 2021 gets evaluated as 2021.0 by the type converter function numeric_date:
Lines 30 to 32 in c264580
| # date is numeric | |
| try: | |
| return float(date) |
and that value is used as max_date here:
Lines 332 to 333 in c264580
| if max_date: | |
| filtered = {s for s in filtered if (np.isscalar(dates[s]) or all(dates[s])) and np.min(dates[s]) <= max_date} |
This means the <= max_date is effectively < 2021 since the earliest ISO date 2021-01-01 ~= 2021.001.
Possible solution:
This has already been solved in #854. Two parts:
-
Treat
2021as2021-XX-XX:Lines 123 to 133 in 110af66
# Absolute date in numeric format. if RE_NUMERIC_DATE.match(date_in): return float(date_in) # Absolute date in potentially incomplete/ambiguous ISO 8601 date format. if (RE_ISO_8601_DATE.match(date_in) or RE_AMBIGUOUS_ISO_8601_DATE.match(date_in) or RE_AMBIGUOUS_ISO_8601_DATE_YEAR_MONTH.match(date_in) or RE_YEAR_ONLY.match(date_in) ): return iso_to_numeric(date_in, ambiguity_resolver) -
Use different type converters for
--min-dateand--max-date, taking minimum of ambiguity for--min-dateand maximum for--max-date:Lines 22 to 23 in 110af66
metadata_filter_group.add_argument('--min-date', type=any_to_numeric_type_min, help="minimal cutoff for date, the cutoff date is inclusive; may be specified as an Augur-style numeric date (with the year as the integer part) or YYYY-MM-DD") metadata_filter_group.add_argument('--max-date', type=any_to_numeric_type_max, help="maximal cutoff for date, the cutoff date is inclusive; may be specified as an Augur-style numeric date (with the year as the integer part) or YYYY-MM-DD")
Your environment: if running Nextstrain locally
- Version: augur 15.0.0