[aggr-] allow ranking rows by key column #2417

midichef · 2024-05-31T06:54:34Z

This PR adds a rank aggregator that returns a list, and a command addcol-rank, which adds a new column with the rank of each row. Ranks are calculated by comparing key columns.

It also fixes a bug in memo-aggregate where long output takes an extremely long time to show up in the statusbar.
For example: seq 1222333 |vd -, then z+ list. After the list is calculated, visidata will get stuck for many seconds showing processing…, because it's very slow to run format() on a long sequence.

I think it's worth having an aggregator for rank, and the need for a simpler solution than the current method has come up before. On the other hand, I know part of Visidata philosophy is that it's not a spreadsheet. How do people feel about having a rank aggregator?

Also, in its current form, the rank aggregator will give errors when comparing key columns with different types across 2 rows:

File "/home/midichef/.local/lib/python3.10/site-packages/visidata/aggregators.py", line 169, in rank
    keys_sorted = sorted(((rowkey, i) for i, rowkey in enumerate(keys)), key=_key_progress(prog))
TypeError: '<' not supported between instances of 'float' and 'list'

What's the standard way to handle sorting mixed types for Visidata?

saulpw · 2024-06-06T00:04:06Z

What's the standard way to handle sorting mixed types for Visidata?

The standard way is to convert the column into a known type, and then anything that can't be converted (errors and nulls) become TypedWrappers which are sortable with any type. Does that work acceptably here too?

midichef · 2024-06-22T08:17:37Z

Yes, that seems like it should work. Should the rank aggregator pick the known type, and if so, which one? Or is it the user who should convert the column?

saulpw · 2024-07-01T05:58:51Z

Since it's not obvious which type to pick, the user can convert the column.

saulpw

I love what this is adding, and I think with a few tweaks it would be even more powerful!

visidata/aggregators.py

midichef · 2024-07-20T07:49:09Z

There are two kinds of ranking operations people may want.

keycol-based rank within sheet: what is the rank of this row, vs. all rows in the sheet, ranking by the value of its key columns? In this example, the key column is keycol, and the current column col is ignored:

keycol	col	keycol_sheetrank
1	10	1
1	20	1

2	60	2
2	50	2
2	30	2

column-based rank within group: when grouping the rows by key columns, what is the rank of this row, within its group? The current column determines the rank. In this example, the current column is col:

keycol	col	col_grouprank
1	10	1
1	20	2

2	60	3
2	50	2
2	30	1

What is a good name for these two aggregators? sheetrank and grouprank? Or maybe rank_key and rank_col?
Any suggestions?

midichef · 2024-07-29T04:38:21Z

Okay, I implemented a command that adds a column and applies an aggregator to rows after grouping them by key columns. It's addcol-aggregate.

To get this to work with list aggregators, I made a new class ListAggregator for aggregators that return lists. Their most common use would be with addcol-aggregate. Right now the only two ListAggregators are list and rank.

I also tried making a sheetrank aggregator, but it's too different from normal aggregators. Normal aggregators apply to a column, but sheetrank is more for the sheet. So I broke it out into a separate command, addcol-sheetrank.

I'm a bit unsure about the new behavior of the list aggregator when used with addcol-aggregate. Right now, if the input column has cells with Exceptions, they show up in the new column. But the error text shows up on the display, it's not hidden behind an error note. !. So I could use guidance on a couple of issues here:

Should these Exceptions be passed through by the list aggregator, or should they be translated to null?
If they should be passed through the aggregator, how do I make them look/behave like the original cell with an exception?
The relevant code is here:

visidata/visidata/aggregators.py

Line 120 in c6c608e

vals = [ col.getTypedValue(r) for r in row_group ]

To see it in action, vd sample_data/test.jsonl, then addcol-aggregate list. The key1 and key1_list columns ought to look the same, but the fourth cell in key1_list reads Expecting ':' delimiter: line 1 column 34 (char 33) instead of empty.

midichef · 2024-07-29T04:54:09Z

There is one detail about the grouping in addcol-aggregate. If the key column holds multiple cells with null, all nulls are grouped together as one group of rows. But if the key column holds multiple error cells, each error cell forms its own unique group of 1 row, even if all the errors have the same traceback text. I didn't design this, it's just how it behaved on sorting. Does that error cell treatment sound reasonable?

saulpw

Can we move most of this into features/addcol_sheetrank.py? It's a fair amount of code and I'd like to make it self-contained as much as possible.

midichef · 2024-10-08T04:48:24Z

Okay, I moved the addcol-sheetrank code into a features/ file.

I had to move the RankAggregator class there too, because it relies on functions that moved with addcol-sheetrank. Because the new file is not just for the addcol-sheetrank functionality, I named it features/rank.py instead of features/addcol_sheetrank.py.

One notable consequence of this move is that the new features/rank.py file modifies vd.aggregators outside of aggregators.py:

visidata/visidata/features/rank.py

Line 41 in ec28446

    
           vd.aggregators['rank'] = RankAggregator('rank', anytype, helpstr='list of ranks, when grouping by key columns', listtype=int)

saulpw · 2025-01-11T23:45:29Z

visidata/aggregators.py

+        for aggr in aggrs:
+            rows = aggregate_groups(sheet, col, sheet.rows, aggr)
+            if isinstance(aggr, ListAggregator):
+                t = aggr.listtype or col.type


Do we need a separate listtype? Seems like we could just use the same aggr.type in both cases, and remove this isinstance (which is usually a code smell for me).

The way list aggregators work now, there is a need for two distinct types, type and listtype. type is for the result of the aggregator. For example, this is used by memo-aggregate. That's why type is anytype for ListAggregators. This type would be used whenever we want to hold the entire result (a list) in a cell.

But we also need a separate type for the elements of the list. This is for when the aggregator result goes in a column, like for addcol-aggregate, where each cell holds not the result itself, but an element of the list result.

If I try to get rid of one or the other types, I run into problems. For RankAggregator, if I get rid of the listtype=int switch to type=int, I get an error in the statusbar for z+ rank:
'''
text_rank=int() argument must be a string, a bytes-like object or a real number, not 'list'
'''
But if instead I make RankAggregator use type=anytype, the column added by addcol-aggregate rank does not get the type int.

The need for two types is awkward. And I see your point about isinstance being a code smell. (That is a helpful heuristic, and I'll use it in the future.) It's accurately pointing out strain in the design: most aggregators produce a single value, list aggregators produce a list.

Maybe rank should not be an aggregator. It's unlikely people want a list object holding the ranks. Most people want a column holding the ranks. What if we replace addcol-aggregate+rank with an equivalent command addcol-grouprank (in addition to the existing addcol-sheetrank)? And we would reserve addcol-aggregate for finding group values like sum, mean, median, as you suggested earlier. What do you think?
(I would also consider changing the name addcol-aggregate. Maybe to addcol-group-aggregate.)

saulpw · 2025-01-12T00:28:04Z

Okay, this one is quite old and quite big at this point! I gave it another pass through, and my remaining questions have to do with aggr.listtype (in review comment), and the vd.aggregate_list function, which seems like it might be an unnecessary and confusing API function. I've asked @anjakefala to look over the behavior too. It'll be nice to finally get this one merged; thanks for your patience on this!

midichef · 2025-01-12T02:04:56Z

Let me look into why I added a custom stdev function with b41afba and then removed it in ec28446. That's probably an oversight by me.

anjakefala · 2025-01-12T06:33:47Z

@midichef I think your PR successfully meets your goals. This is such thoughtfully considered work.

Once you have your fnal changes in, I am going to add an update to the guide in this PR, with documentation.

I have one question for my own clarity: what is rank's intended behaviour when there are multiple key columns?

Say

keycol keycol_2 Item
1 10 Pen
1 20 Pencil
2 60 Book
2 50 Pen
2 30 Book

midichef · 2025-01-17T04:14:57Z

Okay I submitted two minor changes, 20ba354 (which corrects an accidental reversion of a line from a previous commit) and 9ed9f4b (which adds a few errors/warnings as guardrails, since I expect these operations to be difficult for users to understand at first).
Once these are approved, can you squash them into d35738b ?
I would do it myself now, but I want to make it easy to see that I have only made minor changes today.

anjakefala · 2025-01-17T18:50:38Z

@midichef Yes, this is why I was going to update the docs as part of the merging of this PR. I think it's really hard to understand what these commands are solely from their name and one-line description (which isn't your fault, these are inherently complicated concepts).

But, I will mull over if I have better names at hand....Maybe a good question for the VisiData discord.

midichef · 2025-05-11T01:47:05Z

I fixed some merge conflicts with the current develop branch, and rebased all previous fixes onto the latest develop. I also did the squashing that I had requested above. These new commits don't change the substance of any code from the previous commits.

midichef · 2025-05-11T01:54:49Z

Now I've added f5210e5.
When doing within-group rankings via addcol-aggregate rank, the rank comes after ordering the column elements in ascending order. This commit instead uses the sort order on the column. If it's descending, the elements will be ranked in descending order.

I believe it's the behavior users will expect when the column is already sorted in descending order. That's how I realized I wanted this feature. I had a column sorted descending, and ranked it, and was surprised when it was ranked ascending instead.

midichef · 2025-05-17T18:18:57Z

I still have a little more to add to this PR. Now that the rank operator uses the sort order of the current column, for consistency I'll make addcol-sheetrank also use the sort order of the key columns. And I need to add Progress updates while ranking is underway.

It is needed now that addcol-aggregate can apply stdev to groups, which may include lists of size 1.

The 'rank' aggregator uses the sort direction of the current column. addcol-sheetrank uses the sort order and directions of keycolumns only.

midichef · 2025-05-20T00:49:24Z

I've added the pending features. This PR is done in terms of implementing features.

What remains to do is to come up with intuitive names and explanations for addcol-aggregate rank vs addcol-sheetrank. I don't have any ideas on how to better do that.

Also, I'm not sure why the ci-build tests are failing, as the failing test tests/load-http.vd passes for me on my system, but is failing here on Github.

anjakefala · 2025-05-20T06:05:53Z

@midichef It might be a one-off CI blunder. Though, the test output is supposed to show when the error code is 1 😭 I'll have to investigate that.

midichef · 2025-06-10T07:10:14Z

I've got one more small change queued up to add to this PR: better error handling if the sort fails during ranking. It's a minor change and shouldn't require much review.

anjakefala · 2025-06-13T05:19:54Z

I changed addcol-sheetrank to addcol-rank-sheet to better fit the naming scheme. =)

I do have a guide I'm working on, but my brain has been a bit cloudy, and I don't want to block this PR on me finishing it. So I'll merge for now, and push the guide when it's ready.

Thanks for all your work @midichef!

saulpw requested changes Jul 1, 2024

View reviewed changes

visidata/aggregators.py Outdated Show resolved Hide resolved

visidata/aggregators.py Outdated Show resolved Hide resolved

visidata/aggregators.py Outdated Show resolved Hide resolved

visidata/aggregators.py Outdated Show resolved Hide resolved

midichef force-pushed the aggr_rank branch 2 times, most recently from 8078fb6 to c6c608e Compare July 29, 2024 04:20

midichef requested a review from saulpw July 29, 2024 04:38

midichef force-pushed the aggr_rank branch from c6c608e to 3ef6238 Compare July 29, 2024 04:53

midichef mentioned this pull request Sep 9, 2024

Should addcol-window pad first list with None to indicate no rows above first row? #2279

Closed

anjakefala added 3.1 waiting on maintainer labels Sep 22, 2024

saulpw requested changes Oct 4, 2024

View reviewed changes

anjakefala added waiting on contributor and removed waiting on maintainer labels Oct 4, 2024

midichef force-pushed the aggr_rank branch from 3ef6238 to ec28446 Compare October 8, 2024 04:39

midichef requested a review from saulpw October 8, 2024 04:49

anjakefala added waiting on maintainer and removed waiting on contributor labels Oct 12, 2024

saulpw removed the 3.1 label Oct 15, 2024

saulpw reviewed Jan 11, 2025

View reviewed changes

anjakefala self-requested a review January 12, 2025 03:49

anjakefala added waiting on contributor and removed waiting on maintainer labels Jan 12, 2025

anjakefala removed the waiting on contributor label Mar 14, 2025

midichef force-pushed the aggr_rank branch from 9ed9f4b to be98fa2 Compare May 11, 2025 01:44

anjakefala added waiting on contributor and removed confirm-fix labels May 18, 2025

midichef added 6 commits May 19, 2025 17:29

[aggr-] cap runtime when formatting memo status

97e1812

[aggr-] fix chooser lacking aggs starting with 'p'

5417f19

[aggr-] display stdev error note for lists of size 1

bfffb47

It is needed now that addcol-aggregate can apply stdev to groups, which may include lists of size 1.

[aggr-] add rank aggregator, cmds addcol-aggregate/sheetrank

3ff8104

[rank-] make rank and sheetrank use column sort orderings

149370a

The 'rank' aggregator uses the sort direction of the current column. addcol-sheetrank uses the sort order and directions of keycolumns only.

[rank-] add Progress indicators to rank and sheetrank

a06ace7

midichef force-pushed the aggr_rank branch from 82b338d to a06ace7 Compare May 20, 2025 00:29

anjakefala added waiting on maintainer and removed waiting on contributor labels May 20, 2025

saulpw added confirm-fix and removed waiting on maintainer labels Jun 9, 2025

saulpw approved these changes Jun 9, 2025

View reviewed changes

saulpw and others added 2 commits June 8, 2025 19:55

remove extra import

351796f

Merge branch 'develop' into aggr_rank

3d44bca

[rank-] rename to addcol-rank-sheet

ef9f383

anjakefala force-pushed the aggr_rank branch from ecfadb8 to ef9f383 Compare June 13, 2025 05:12

anjakefala merged commit 2c4dbbc into saulpw:develop Jun 13, 2025
14 checks passed

Uh oh!

[aggr-] allow ranking rows by key column #2417

[aggr-] allow ranking rows by key column #2417

Uh oh!

Conversation

midichef commented May 31, 2024

Uh oh!

saulpw commented Jun 6, 2024

Uh oh!

midichef commented Jun 22, 2024

Uh oh!

saulpw commented Jul 1, 2024

Uh oh!

saulpw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

midichef commented Jul 20, 2024

Uh oh!

midichef commented Jul 29, 2024

Uh oh!

midichef commented Jul 29, 2024

Uh oh!

saulpw left a comment

Choose a reason for hiding this comment

Uh oh!

midichef commented Oct 8, 2024

Uh oh!

saulpw Jan 11, 2025

Choose a reason for hiding this comment

Uh oh!

midichef Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saulpw commented Jan 12, 2025

Uh oh!

midichef commented Jan 12, 2025

Uh oh!

anjakefala commented Jan 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

midichef commented Jan 17, 2025

Uh oh!

anjakefala commented Jan 17, 2025

Uh oh!

midichef commented May 11, 2025

Uh oh!

midichef commented May 11, 2025

Uh oh!

midichef commented May 17, 2025

Uh oh!

midichef commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anjakefala commented May 20, 2025

Uh oh!

midichef commented Jun 10, 2025

Uh oh!

anjakefala commented Jun 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

midichef Jan 17, 2025 •

edited

Loading

anjakefala commented Jan 12, 2025 •

edited

Loading

midichef commented May 20, 2025 •

edited

Loading