How to get the last value in a time-binned aggregation? #4079

slmg · 2025-03-26T19:46:12Z

slmg
Mar 26, 2025

Hey everyone,

I'm working with a 100GB+ timeseries dataset where records come in every few milliseconds, and I'm trying to downsample it into 5-minute bins to reduce the data size. My approach was to create a new column like this:

df = df.with_column("time_bin", daft.col("timestamp").dt.truncate("5 minute"))

So far, so good. The next logical step for me was to group by time_bin and get the last value of a column within each 5-minute bin, something like:

df.groupby("time_bin").agg(df["value"].last())

However, I couldn't find any built-in last(), first(), or even arg_max("other_column") aggregation methods. Is there an existing way to achieve this in Daft?

I’d rather not use .mean(), since I just need a snapshot of the last recorded value in each bin. Did I miss something obvious?

I'm new to Daft, but it has given me a great first impression! I'd love to understand if there’s a recommended approach for this, or if there’s a reason these aggregation methods aren’t implemented (maybe due to distributed processing constraints?).

Thanks in advance! 🙏

colin-ho · 2025-04-16T20:47:09Z

colin-ho
Apr 16, 2025
Maintainer

One idea that could work is df.groupby("time_bin").agg(daft.col("value").agg_list().list.get(-1))

Essentially we concat the values into a list and then use list expressions to get the last element.

Also apologies for the late reply, we missed this.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to get the last value in a time-binned aggregation? #4079

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to get the last value in a time-binned aggregation? #4079

Uh oh!

slmg Mar 26, 2025

Replies: 1 comment

Uh oh!

colin-ho Apr 16, 2025 Maintainer

slmg
Mar 26, 2025

colin-ho
Apr 16, 2025
Maintainer