Open
Description
Feature request
Today, we have these ways to aggregate a single nested column values:
nf.reduce(np.mean, "lc.mag")
- good, but not cheap and requires to join the output back to the framenf.eval("lc.mag.groupby(by=lc.mag.index).mean()")
- expansive and not intuitive
It would be nice if we can develop an easier way of doing such aggregations. Options I see:
- Currently, we can do
nf.eval("lc.mag.mean()")
/nf["lc.mag"].mean()
, but it would output the aggregation over all the flat values, which is, especially in the first case, not intuitive. We can redefine it. - Add special interface for nested aggregations with
.nest
accessor, e.g.nf.lc.nest.mean()
would returnnf.shape[0]
mean values. - Add special methods which would work in
eval/query
environment only, e.g.nf.eval("lc.mag.nest_mean()")
However I'm not sure how we'd make all these performant, it looks like pyarrow
provides almost zero tooling for that. Maybe we can use things like numpy.ufunc.reduceat
and scipy.ndimage.mean
.
Before submitting
Please check the following:
- I have described the purpose of the suggested change, specifying what I need the enhancement to accomplish, i.e. what problem it solves.
- I have included any relevant links, screenshots, environment information, and data relevant to implementing the requested feature, as well as pseudocode for how I want to access the new functionality.
- If I have ideas for how the new feature could be implemented, I have provided explanations and/or pseudocode and/or task lists for the steps.