-
-
Notifications
You must be signed in to change notification settings - Fork 312
[aggr-] allow ranking rows by key column #2417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The standard way is to convert the column into a known type, and then anything that can't be converted (errors and nulls) become |
Yes, that seems like it should work. Should the rank aggregator pick the known type, and if so, which one? Or is it the user who should convert the column? |
Since it's not obvious which type to pick, the user can convert the column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love what this is adding, and I think with a few tweaks it would be even more powerful!
There are two kinds of ranking operations people may want.
What is a good name for these two aggregators? |
8078fb6
to
c6c608e
Compare
Okay, I implemented a command that adds a column and applies an aggregator to rows after grouping them by key columns. It's To get this to work with list aggregators, I made a new class I also tried making a I'm a bit unsure about the new behavior of the
|
There is one detail about the grouping in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move most of this into features/addcol_sheetrank.py
? It's a fair amount of code and I'd like to make it self-contained as much as possible.
Okay, I moved the I had to move the One notable consequence of this move is that the new visidata/visidata/features/rank.py Line 41 in ec28446
|
for aggr in aggrs: | ||
rows = aggregate_groups(sheet, col, sheet.rows, aggr) | ||
if isinstance(aggr, ListAggregator): | ||
t = aggr.listtype or col.type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a separate listtype
? Seems like we could just use the same aggr.type
in both cases, and remove this isinstance
(which is usually a code smell for me).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way list aggregators work now, there is a need for two distinct types, type
and listtype
. type
is for the result of the aggregator. For example, this is used by memo-aggregate
. That's why type
is anytype
for ListAggregators
. This type would be used whenever we want to hold the entire result (a list) in a cell.
But we also need a separate type for the elements of the list. This is for when the aggregator result goes in a column, like for addcol-aggregate
, where each cell holds not the result itself, but an element of the list result.
If I try to get rid of one or the other types, I run into problems. For RankAggregator
, if I get rid of the listtype=int
switch to type=int
, I get an error in the statusbar for z+
rank
:
'''
text_rank=int() argument must be a string, a bytes-like object or a real number, not 'list'
'''
But if instead I make RankAggregator
use type=anytype
, the column added by addcol-aggregate
rank
does not get the type int
.
The need for two types is awkward. And I see your point about isinstance
being a code smell. (That is a helpful heuristic, and I'll use it in the future.) It's accurately pointing out strain in the design: most aggregators produce a single value, list aggregators produce a list.
Maybe rank
should not be an aggregator. It's unlikely people want a list object holding the ranks. Most people want a column holding the ranks. What if we replace addcol-aggregate
+rank
with an equivalent command addcol-grouprank
(in addition to the existing addcol-sheetrank
)? And we would reserve addcol-aggregate
for finding group values like sum
, mean
, median
, as you suggested earlier. What do you think?
(I would also consider changing the name addcol-aggregate
. Maybe to addcol-group-aggregate
.)
Okay, this one is quite old and quite big at this point! I gave it another pass through, and my remaining questions have to do with |
@midichef I think your PR successfully meets your goals. This is such thoughtfully considered work. Once you have your fnal changes in, I am going to add an update to the guide in this PR, with documentation. I have one question for my own clarity: what is rank's intended behaviour when there are multiple key columns? Say keycol keycol_2 Item |
Okay I submitted two minor changes, 20ba354 (which corrects an accidental reversion of a line from a previous commit) and 9ed9f4b (which adds a few errors/warnings as guardrails, since I expect these operations to be difficult for users to understand at first). |
@midichef Yes, this is why I was going to update the docs as part of the merging of this PR. I think it's really hard to understand what these commands are solely from their name and one-line description (which isn't your fault, these are inherently complicated concepts). But, I will mull over if I have better names at hand....Maybe a good question for the VisiData discord. |
I fixed some merge conflicts with the current develop branch, and rebased all previous fixes onto the latest develop. I also did the squashing that I had requested above. These new commits don't change the substance of any code from the previous commits. |
Now I've added f5210e5. I believe it's the behavior users will expect when the column is already sorted in descending order. That's how I realized I wanted this feature. I had a column sorted descending, and ranked it, and was surprised when it was ranked ascending instead. |
I still have a little more to add to this PR. Now that the |
It is needed now that addcol-aggregate can apply stdev to groups, which may include lists of size 1.
The 'rank' aggregator uses the sort direction of the current column. addcol-sheetrank uses the sort order and directions of keycolumns only.
I've added the pending features. This PR is done in terms of implementing features. What remains to do is to come up with intuitive names and explanations for Also, I'm not sure why the ci-build tests are failing, as the failing test |
@midichef It might be a one-off CI blunder. Though, the test output is supposed to show when the error code is 1 😭 I'll have to investigate that. |
I've got one more small change queued up to add to this PR: better error handling if the sort fails during ranking. It's a minor change and shouldn't require much review. |
I changed I do have a guide I'm working on, but my brain has been a bit cloudy, and I don't want to block this PR on me finishing it. So I'll merge for now, and push the guide when it's ready. Thanks for all your work @midichef! |
This PR adds a
rank
aggregator that returns a list, and a commandaddcol-rank
, which adds a new column with the rank of each row. Ranks are calculated by comparing key columns.It also fixes a bug in
memo-aggregate
where long output takes an extremely long time to show up in the statusbar.For example:
seq 1222333 |vd -
, thenz+
list
. After the list is calculated, visidata will get stuck for many seconds showingprocessing…
, because it's very slow to runformat()
on a long sequence.I think it's worth having an aggregator for rank, and the need for a simpler solution than the current method has come up before. On the other hand, I know part of Visidata philosophy is that it's not a spreadsheet. How do people feel about having a rank aggregator?
Also, in its current form, the rank aggregator will give errors when comparing key columns with different types across 2 rows:
What's the standard way to handle sorting mixed types for Visidata?