-
-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handle dates and timedeltas in aggregators and DescribeSheet #2380
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together, @midichef. I've been looking at it the past few days and I have a few comments and questions. I think you have a keen understanding of what we would want the behavior to be, and I'd like to take this opportunity to think deeply and clean up some things around aggregators, so that we don't make them a lot messier to work with and reason about.
@@ -107,13 +108,48 @@ def _funcRows(col, rows): # wrap builtins so they can have a .type | |||
def mean(vals): | |||
vals = list(vals) | |||
if vals: | |||
return float(sum(vals))/len(vals) | |||
if type(vals[0]) is date: | |||
vals = [d.timestamp() for d in vals] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these special cases belong in the aggregators. The idea is that people can define their own aggregators with 'obvious' implementations and they should work reasonably. (and regardless, where these checks are, they should use isinstance
with the Python stdlib datetime as you're doing below).
To that end, the visidata date
and datedelta
classes both already have __float__
methods which convert them to float as you're doing here. So maybe the aggregator callsite can call aggregator.intype
on each value, and for these "more numeric" aggregators would set their intype to float
. Then they would need to convert the float result back into some result type, which as you've noted below can be tricky for date formats.
return sum(vals, start=type(vals[0] if len(vals) else 0)()) #1996 | ||
if vals: | ||
if type(vals[0]) is date: | ||
vd.error('dates cannot be summed') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a vd.fail
as it's more of a user error. But it should also be outside the aggregation function.
if vals and len(vals) >= 2: | ||
if type(vals[0]) is date: | ||
vals = [d.timestamp() for d in vals] | ||
return datetime.timedelta(seconds=statistics.stdev(vals)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very interesting, I wonder how we can codify that the output type of stdev for both date and datedelta are datedelta. It's not quite as simple as making the result type datedelta
and passing it the float from the generic aggregations, since Python's timedelta (which datedelta is a subclass of) has a constructor with a first parameter of 'days' instead of 'seconds'.
vd.aggregator('avg', mean, 'arithmetic mean of values', type=float) | ||
vd.aggregator('mean', mean, 'arithmetic mean of values', type=float) | ||
vd.aggregator('median', statistics.median, 'median of values') | ||
vd.aggregator('min', min, 'minimum value', type=anytype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are min/max forcibly set to anytype
instead of letting them take the type of their source column?
I understand why avg/mean/median want to be more generic; maybe this is where we have both an 'input type' which remains float
for these, and a 'result type' which is more variable.
for agg_choice in agg_choices: | ||
agg = vd.aggregators.get(agg_choice) | ||
if not agg: continue | ||
aggs = agg if isinstance(agg, list) else [agg] | ||
for agg in aggs: | ||
aggval = agg(col, rows) | ||
typedval = wrapply(agg.type or col.type, aggval) | ||
dispval = col.format(typedval) | ||
if agg.name == 'stdev' and (col.type is date): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to figure out how to not have a very special type check here.
@@ -44,10 +45,6 @@ class DescribeSheet(ColumnsSheet): | |||
DescribeColumn('nulls', type=vlen), | |||
DescribeColumn('distinct',type=vlen), | |||
DescribeColumn('mode', type=str), | |||
DescribeColumn('min', type=str), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any functional reason to moving these into describe_aggrs? I think this must be why min/max types were changed to anytype
(and maybe they should not be str
here in the original code), so maybe this is not necessary.
For a column of dates with no rows, the sum was shown as "Dec 31, 1969".
Fixes the display of mean of a column of dates.
The sum aggregator can sum timedeltas. The sum() function could not.
Fixes type-related errors when using the aggregators in DescribeSheet.
92e80cf
to
ca2fc92
Compare
I just force-pushed, but these commits are just the original set of commits, moved to the latest develop branch, fixing some merge conflicts. I didn't change anything substantial. I'll respond to saulpw's review comments soon. |
This PR is makes the current aggregators (mean, median, sum) work with dates and timedeltas. It's a continuation of the date-handling in #1996.
One notable change is that the
stdev
of adate
is atimedelta
, not adate
. The benefits of this are, it formats more readably in the describe sheet (such as '200 days'). And it simplifies doing arithmetic with the result ofstdev
, like expressing another timedelta as a multiple of the stdev. The downside is that it breaks the assumption that an aggregator's result is the same type as the column being aggregated. That leads to some special case handling:visidata/visidata/aggregators.py
Line 249 in 5183411
And note that the
stdev
of atimedelta
is still atimedelta
.In order to make the aggregators and the DescribeSheet handle
date
andtimedeltas
, I've had to change the type of several aggregators toanytype
. Is this acceptable? Are there drawbacks?I've submitted this as a series of small commits for easier review. If this all looks okay, let me know, I'll squash it to a smaller number of commits before deploying.