Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace use of internal, non-public pandas API in parse.fix_dates #1670

Open
corneliusroemer opened this issue Nov 10, 2024 · 2 comments
Open

Comments

@corneliusroemer
Copy link
Member

corneliusroemer commented Nov 10, 2024

Current Behavior

We're currently using parts of the private pandas API in parse.fix_dates, specifically the parsing.parse_time_string function which was renamed to parsing.parse_datetime_string_with_reso in
pandas v2:

augur/augur/parse.py

Lines 36 to 42 in 8731851

from pandas.core.tools.datetimes import parsing
try:
# pandas 2.x
results = parsing.parse_datetime_string_with_reso(d, dayfirst=dayfirst)
except AttributeError:
# pandas 1.x
results = parsing.parse_time_string(d, dayfirst=dayfirst)

Now, in pandas v2.2 (which we don't yet support, not even v2 but I was looking into doing that right now) the import suddenly breaks - no surprise there, it's not in the public API so that can happen anytime.

While we can change the import for this to continue working, it's probably a good idea to move away from using an internal function like this.

However, it's hard to find a replacement. I've searched for quite a bit but haven't found something that is a drop in replacement, sadly.

Some things I looked into:

For now we can hope that pandas doesn't completely remove the function, but the moving around is a warning that we cannot rely on this indefinitely. I also don't think we promise total reliability of the date parsing, so maybe a bit of breaking change here would be ok.

Would be good to unit test this a little more in any case - it's a perfect example of something that's very well unit testable. Here's what we currently have:

def test_fix_dates(self, capsys):
full_date = "4-5-2020"
assert parse.fix_dates(full_date) == "2020-05-04"
assert parse.fix_dates(full_date, dayfirst=False) == "2020-04-05"
partial_date_no_month = "2020-04"
assert parse.fix_dates(partial_date_no_month) == "2020-04-XX"
assert parse.fix_dates(partial_date_no_month, dayfirst=False) == "2020-04-XX"
partial_date_only_year = "2020"
assert parse.fix_dates(partial_date_only_year) == "2020-XX-XX"
# We should allow valid ambiguous dates to pass through the function
# without a warning to the user.
valid_ambiguous_date = "2020-XX-XX"
assert parse.fix_dates(valid_ambiguous_date) == valid_ambiguous_date
assert "WARNING" not in capsys.readouterr().err
# We should warn the user when their dates cannot be parsed by pandas or
# as a valid ambiguous date.
malformed_date = "Aaasd123AS"
assert(parse.fix_dates(malformed_date)) == malformed_date
assert "WARNING" in capsys.readouterr().err

@victorlin
Copy link
Member

I think all the date formats suppported by augur parse are also supported in other subcommands such as augur filter, yet there are no calls to the pandas API there.

Maybe we can address this in #936?

@corneliusroemer
Copy link
Member Author

There's a case to be made that augur parse should be more lenient because it's supposed to take a broad range of input files and then normalize them to something more standard.

We could see however how often augur parse is actually used in practice. I have a feeling it's not used that often anymore, at least not in our workflows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants