-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
potentially relevant usage patterns / targets for a developer-focused API #71
Comments
Other libraries that were suggested as candidates to look into: Xarray, cuDF (utilities), PyJanitor (cleaning functionality, not the |
PyJanitor (non
|
Hey, very cool initiative — it would be great to be more agnostic to dataframe libraries. I wanted to flag that seaborn is in the midst of a very extensive internal refactor, which means that the survey of pandas usage in the library is likely to be out of date after future releases. But there's an upside: it's a perfect time to be revisiting how the pandas API is used in seaborn and to proactively think about working with a more general dataframe interface. I could see the ongoing work evolving in parallel with this project (hopefully in a way that's mutually beneficial). Let me know if I can be helpful here! |
Scikit-learn mostly treats a DataFrame as a "2D ndarray with column names". Only the When scikit-learn's models start returning DataFrames, it will depend on the fact that there is a zero-copy round-trip from numpy: pandas-dev/pandas#27211. In detail:
Scikit-learn requires that 2d ndarray -> DataFrame -> 2d ndarray not make any copies so no additional memory is allocated. |
Interesting, thanks for sharing @thomasjpfan.
The answers from the Pandas devs there are along the lines of what I'd expect: this isn't necessarily guaranteed in the future. That's more a "labeled array" use case which is Xarray like. Did anything change after that 2019 discussion @thomasjpfan, or is it more a "fingers crossed that Pandas doesn't change this"?
I think pragmatically there is likely to always be a way for Pandas to do this; scikit-learn is probably important enough that it can have even its own method for this if needed. Conceptually it's not a nice fit though for a standardized dataframe behavior; it only works for a subset of supported dtypes, and it's going to need support for a constructor which accepts 2-D arrays to begin with. |
It's fingers crossed. I've seen a proposal for a 2D extension array, but I think there is a lot more momentum for 1d extension arrays & a columnar store. I want to add: There are certain models, such as |
Looks like they only really use The trickier part is this decorator, which also uses
|
It looks to me like there are two separate things in PyJanitor:
(2) looks motivated only by UX reasons (I could well be wrong here, not being an active user) - dataframe users tend to like methods over functions. It seems unhealthy to me, because one library monkeypatching another library is a big no-no in library design. Any It's actually an interesting question whether (2) should be allowed through a registration mechanism, or it should be discouraged. I'd lean towards the latter, but then again I'm coming from a domain where a functional programming style is preferred over an object-oriented one. If dataframe library author prefer the former, then a well-defined extension mechanism seems useful. Even for PyJanitor + Pandas only. |
OK true, their methods do work as functions too: In [2]: from janitor.functions.clean_names import clean_names
In [3]: df = pd.DataFrame({'A ': [1, 2, 3]})
In [4]: df
Out[4]:
A
0 1
1 2
2 3
In [5]: clean_names(df)
Out[5]:
a_
0 1
1 2
2 3 So, perhaps that's the part which the standard can target. It might be worthwhile to try taking a handful of functions from them, say:
Then try implementing the Standard for each DataFrame library, seeing if it's sufficient, and whether this would let pyjanitor "just work" on all of them if it was rewritten to use the standard api |
FWIW, for pandas itself this already exists (https://pandas.pydata.org/docs/dev/development/extending.html#registering-custom-accessors), and this is also what pyjanitor / pandas_flavor use under the hood ( Whether this would also be useful for a DataFrame standard is of course a different question. I think if our goal is to provide a developer-oriented standard API, this is much less needed. |
Other tools which have been mentioned as potential targets:
|
This one would be a good candidate, namely because they already support both pandas and polars: https://github.com/Kanaries/pygwalker |
Well this is encouraging:
|
altair have added support for polars by using the interchange protocol: https://github.com/altair-viz/altair pyarrow is required as a dependency for this to work though - with the standard, they could potentially support polars (and many others) without requiring extra deps? one to look into EDIT: I don't think altair is a good candidate, see #133 |
Dropping Dask for now, as they've said this wouldn't solve an actual pain-point of theirs Anyway, https://github.com/feature-engine/feature_engine looks like a good candidate, and exactly the kind of library where this might be useful! |
Here's a really good one they literally have if isinstance(self.dataframe, pl.DataFrame):
# pandas-specific logic
elif isinstance(self.dataframe, pd.DataFrame):
# polars-specific logic
else:
raise So yeah, really solid candidate here |
another one, where they've already said that their objective is to support multiple dataframe backends https://github.com/skrub-data/skrub others:
|
hi all! Not sure how far along this project is but would love to get some tips on how to design the polars validation backend as described in this mini-roadmap: unionai-oss/pandera#1064 (comment). Was planning on forging ahead with polars-specific implementations for various things that pandera does during the validation pipeline (see anywhere there's a |
In other issues we find some detailed analyses of how the pandas API is used today, e.g. gh-3 (on Kaggle notebooks) and in https://github.com/data-apis/python-record-api/tree/master/data/api (for a set of well-known packages). That data is either not relevant for a developer-focused API though, or is so detailed that it's hard to get a good feel for what's important. So I thought it'd be useful to revisit the topic. I used https://libraries.io/pypi/pandas and looked at some of the top repos that declare a dependency on
pandas
.Top 10 listed:
Seaborn
Perhaps the most interesting pandas usage. It's a hard dependency, is used a fair amount and for more than just data access, however it all still seems fairly standard and common so may be a reasonable target to make work with multiple libraries. Uses a lot of
isinstance
checks (onpd.DataFrame
,pd.Series
).seaborn/_core.py
:Series
,to_numeric
seaborn/matrix.py
:DataFrame
,isnull
,.index.equals
,.column.equals
,seaborn/utils.py
:DataFrame
,Categorical
,notnull
seaborn/regression.py
: onlypd.notnull
seaborn/distributions.py
:.values
,.copy
,.iloc
,.loc
,.reset_index
,.index
,set_index
,MultiIndex.from_arrays
,Index
,Series
,concat
,merge
seaborn/relational.py
:DataFrame
,merge
,.rename
seaborn/categorical.py
:DataFrame
,iteritems
,Series
,notnull
,option_context
,isnull
,groupby
,get_group
,seaborn/_statistics.py
: onlySeries
Folium
just a single non-test usage, in pd.py:
PyJanitor
Interesting/unusual common pattern, which extends
pd.DataFrame
through pandas_flavor with either accessors or methods:. E.g. from [janitor/biology.py]https://github.com/pyjanitor-devs/pyjanitor/blob/a6832d47d2cc86b0aef101bfbdf03404bba01f3e/janitor/biology.py):Statsmodels
A huge amount of usage, using a large API surface in a messy way - not easy to do anything with or draw conclusions from.
NetworkX
Mostly just conversions to support pandas dataframes as input/output values. E.g., from convert.py and convert_matrix.py:
And using the
.drop
method in group.py:Perspective
A multi-language (streaming) viz and analytics library. The Python version uses pandas in
core/pd.py
. It uses a small but nontrivial amount of the API, includingMultiIndex
,CategoricalDtype
, and time series functionality.Scikit-learn
TODO: the usage of Pandas in scikit-learn is very much in flux, and more support for "dataframe in, dataframe out" is being added. So it did not seem to make much sense to just look at the code, rather it makes sense to have a chat with the people doing the work there.
Matplotlib
Added because it comes up a lot. Matplotlib uses just a "dictionary of array-likes" approach, no dependence on pandas directly. So it will work today with other dataframe libraries as well, as long as their columns can convert to a numpy array.
The text was updated successfully, but these errors were encountered: