Droping duplicates based on single column #4263

mikeprince4 · 2025-04-28T18:37:26Z

mikeprince4
Apr 28, 2025

Hi,

I am trying to drop duplicates, not based on the entire row matching, but based only on the"time" column. I was able to get this to work with daft 4.11, by using a window function.

Is there a way to get around having to order_by("time")? It's unnecessary for our purpose, but the window function won't work otherwise.

Or perhaps there's some entirely other, better way to do this?

def drop_time_duplicates(
    df: daft.DataFrame,
) -> daft.DataFrame:
    window_spec = Window().partition_by("time").order_by("time")
    result_df = df.with_column("row_num", row_number().over(window_spec))
    return result_df.filter(daft.col("row_num") == 1).exclude("row_num")

Thanks!

rchowell · 2025-04-28T18:51:18Z

rchowell
Apr 28, 2025
Maintainer

AFAIK your options are to use GROUP BY or a WINDOW function. Think of each set of rows which share column "time" as a "group" — then you will need to select one of the rows from this group. In your case, you've used a window function to get the first one and ordered by time, that's how the engine chose a row out of this grouping.

If you use a GROUP BY, you will not have the remaining values like you do in the window. You must tell the engine which row within the group to return. You could do this with aggregation functions like min/max albeit it's a bit strange.

SELECT time, agg(col1), agg(col2), ... FROM T GROUP BY time

This would only work if your additional columns make sense with aggregation functions, otherwise WINDOW function is the way to do this.

2 replies

mikeprince4 Apr 28, 2025
Author

Thanks -- yeah we don't want to aggregate the values in the other columns. So we need to stick with the WINDOW function

But we don't actually care which row is selected, so the sort is not necessary in our case.

rchowell Apr 28, 2025
Maintainer

Unfortunately I don't believe there is any way to express this and it would be undefined behavior. I'm curious if other systems have something? Groupings are by definition unordered, in the absence of some explicit ordering, and there's no concept of "take one" from an unordered collection. So you take one via order+limit which is the same as your row_no == 1.

This piqued my interest, and I thought it might be possible with a JOIN — you can do this with a JOIN but (1) daft does not have a lateral join (2) it's likely worst perf than window+order and (3) it's undefined behavior that could produce different results on consecutive runs.

-- tested on postgres 17

CREATE TABLE T ( v text, ts int);

INSERT INTO T VALUES
	('a', 1),
    ('b', 1),
    ('c', 1),
    ('x', 2),
    ('y', 2),
    ('z', 2);

SELECT * FROM 
   (SELECT DISTINCT ts FROM T AS lhs) AS lhs,
    LATERAL
   (SELECT v FROM T AS rhs WHERE rhs.ts = lhs.ts LIMIT 1);  -- !! limit without order by "takes" one

ts	v
2	x
1	a

View on DB Fiddle

rchowell · 2025-05-02T16:49:54Z

rchowell
May 2, 2025
Maintainer

What about groupby with anyvalue? I just came across this .. I have a SQL background so this is new to me 😄

In [1]: import daft

In [2]: df = daft.from_pylist([
   ...:     { "v": "a", "t": 1 },
   ...:     { "v": "b", "t": 1 },
   ...:     { "v": "c", "t": 1 },
   ...:     { "v": "x", "t": 2 },
   ...:     { "v": "y", "t": 2 },
   ...:     { "v": "z", "t": 2 },
   ...: ])

In [3]: df.groupby("t").any_value("v").show()
╭───────┬──────╮
│ t     ┆ v    │
│ ---   ┆ ---  │
│ Int64 ┆ Utf8 │
╞═══════╪══════╡
│ 1     ┆ a    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2     ┆ x    │
╰───────┴──────╯

(Showing first 2 of 2 rows)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Droping duplicates based on single column #4263

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Droping duplicates based on single column #4263

Uh oh!

Uh oh!

mikeprince4 Apr 28, 2025

Replies: 2 comments · 2 replies

Uh oh!

rchowell Apr 28, 2025 Maintainer

Uh oh!

mikeprince4 Apr 28, 2025 Author

Uh oh!

rchowell Apr 28, 2025 Maintainer

Uh oh!

Uh oh!

rchowell May 2, 2025 Maintainer

mikeprince4
Apr 28, 2025

Replies: 2 comments 2 replies

rchowell
Apr 28, 2025
Maintainer

mikeprince4 Apr 28, 2025
Author

rchowell Apr 28, 2025
Maintainer

rchowell
May 2, 2025
Maintainer