Skip to content

--max-date with year only is not inclusive #893

@victorlin

Description

@victorlin

Scope

filter, frequencies, subsample

Description

Take this example dataset:

echo -e 'strain\tdate
SEQ1\t2019-01-01
SEQ2\t2020-01-01
SEQ3\t2020-12-31
SEQ4\t2021-01-01
SEQ5\t2021-12-31
SEQ6\t2022-01-01
' > metadata.tsv

--min-date is inclusive. With --min-date 2020, both 2020-01-01 and 2020-12-31 pass as expected:

augur filter \
  --metadata metadata.tsv \
  --min-date 2020 \
  --output-metadata filtered.tsv
# strain	date
# SEQ3	2020-12-31
# SEQ4	2021-01-01
# SEQ2	2020-01-01
# SEQ6	2022-01-01
# SEQ5	2021-12-31

However, --max-date is not inclusive. With --max-date 2021, both 2021-01-01 and 2021-12-31 are expected to pass, but instead they get filtered out:

augur filter \
  --metadata metadata.tsv \
  --max-date 2021 \
  --output-metadata filtered.tsv
# strain	date
# SEQ1	2019-01-01
# SEQ3	2020-12-31
# SEQ2	2020-01-01

Reason

In --max-date 2021, the value 2021 gets evaluated as 2021.0 by the type converter function numeric_date:

augur/augur/dates.py

Lines 30 to 32 in c264580

# date is numeric
try:
return float(date)

and that value is used as max_date here:

augur/augur/filter.py

Lines 332 to 333 in c264580

if max_date:
filtered = {s for s in filtered if (np.isscalar(dates[s]) or all(dates[s])) and np.min(dates[s]) <= max_date}

This means the <= max_date is effectively < 2021 since the earliest ISO date 2021-01-01 ~= 2021.001.

Possible solution:

This has already been solved in #854. Two parts:

  1. Treat 2021 as 2021-XX-XX:

    augur/augur/dates.py

    Lines 123 to 133 in 110af66

    # Absolute date in numeric format.
    if RE_NUMERIC_DATE.match(date_in):
    return float(date_in)
    # Absolute date in potentially incomplete/ambiguous ISO 8601 date format.
    if (RE_ISO_8601_DATE.match(date_in) or
    RE_AMBIGUOUS_ISO_8601_DATE.match(date_in) or
    RE_AMBIGUOUS_ISO_8601_DATE_YEAR_MONTH.match(date_in) or
    RE_YEAR_ONLY.match(date_in)
    ):
    return iso_to_numeric(date_in, ambiguity_resolver)

  2. Use different type converters for --min-date and --max-date, taking minimum of ambiguity for --min-date and maximum for --max-date:

    augur/augur/filter.py

    Lines 22 to 23 in 110af66

    metadata_filter_group.add_argument('--min-date', type=any_to_numeric_type_min, help="minimal cutoff for date, the cutoff date is inclusive; may be specified as an Augur-style numeric date (with the year as the integer part) or YYYY-MM-DD")
    metadata_filter_group.add_argument('--max-date', type=any_to_numeric_type_max, help="maximal cutoff for date, the cutoff date is inclusive; may be specified as an Augur-style numeric date (with the year as the integer part) or YYYY-MM-DD")

Your environment: if running Nextstrain locally

  • Version: augur 15.0.0

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions