-
Notifications
You must be signed in to change notification settings - Fork 25
Description
The problem
There is a .iat missing in the definition of percentile that is used by .ww.describe() (see here).
Reproducible example, using woodwork==0.31.0
This demonstrates the bug:
import pandas as pd
df = pd.DataFrame({'a':[1, 2, 3, 4, 5]})
df['new_index'] = [3, 4, 5, 6, 7] # Note that for this example, the 1st index value can't be 1. Use any other integer.
df.set_index('new_index', inplace=True)
import woodwork as ww
df.ww.init()
df.ww.describe()
The error is KeyError: 1 because it's trying to look up index 1 in the a Series. (Here I've set it to 3).
The bug comes from the woodwork definition of percentile for a pandas series:
import math
def percentile(N, percent, count):
"""
Source: https://stackoverflow.com/questions/2374640/how-do-i-calculate-percentiles-with-python-nump
Find the percentile of a list of values.
Args:
N (pd.Series): Series is a list of values. Note N MUST BE already sorted.
percent (float): float value from 0.0 to 1.0.
count (int): Count of values in series
Returns:
Value at percentile.
"""
k = (count - 1) * percent
f = math.floor(k)
c = math.ceil(k)
if f == c:
return N[int(k)] # <-- the stack overflow question was written for lists, and this line needs to be modified for pd.Series.
d0 = N.iat[int(f)] * (c - k)
d1 = N.iat[int(c)] * (k - f)
return d0 + d1
Run this example by setting series = df['a'] and running percentile() directly.
series = df['a']
percentile(N=series, percent=0.25, count=len(series))
Result:
In [3]: series = df['a']
...: percentile(N=series, percent=0.25, count=len(series))
...:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/.pyenv/versions/3.11.9/envs/240912_woodwork_bug_minimal_example/lib/python3.11/site-packages/pandas/core/indexes/base.py:3653, in Index.get_loc(self, key)
3652 try:
-> 3653 return self._engine.get_loc(casted_key)
3654 except KeyError as err:
File ~/.pyenv/versions/3.11.9/envs/240912_woodwork_bug_minimal_example/lib/python3.11/site-packages/pandas/_libs/index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()
File ~/.pyenv/versions/3.11.9/envs/240912_woodwork_bug_minimal_example/lib/python3.11/site-packages/pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:2606, in pandas._libs.hashtable.Int64HashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:2630, in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 1
Note that in cases when int(k) happens to be a value in the index it will give you an answer, but there is no reason it will actually correspond to the intended position!
If I'd had
df2 = pd.DataFrame({'a':[1, 2, 3, 4, 5]})
df2['new_index'] = [3, 4, 5, 6, 1]
df2.set_index('new_index', inplace=True)
then percentile(N=df2['a'], percent=0.25, count=len(series)) gives 5 for the 25th percentile 😱
In [9]: df2 = pd.DataFrame({'a':[1, 2, 3, 4, 5]})
...: df2['new_index'] = [3, 4, 5, 6, 1]
...: df2.set_index('new_index', inplace=True)
In [10]: df2
Out[10]:
a
new_index
3 1
4 2
5 3
6 4
1 5
In [11]: percentile(N=df2['a'], percent=0.25, count=len(series))
Out[11]: 5
The fix
The fix is quite simple. Add a .iat like the lines below have.
import math
# Proposed fix (add .iat to the if f==c case):
def percentile(N, percent, count):
"""
Source: https://stackoverflow.com/questions/2374640/how-do-i-calculate-percentiles-with-python-nump
Find the percentile of a list of values.
Args:
N (pd.Series): Series is a list of values. Note N MUST BE already sorted.
percent (float): float value from 0.0 to 1.0.
count (int): Count of values in series
Returns:
Value at percentile.
"""
k = (count - 1) * percent
f = math.floor(k)
c = math.ceil(k)
if f == c:
return N.iat[int(k)] # Fix is just addint .iat
d0 = N.iat[int(f)] * (c - k)
d1 = N.iat[int(c)] * (k - f)
return d0 + d1
Now it works:
In [5]: series = df['a']
...: percentile(N=series, percent=0.25, count=len(series))
Out[5]: 2