There are more and more dataframe implementations (which can be justified by the fact that they couple to different runtimes / database backends). It would useful to define a dataframe protocol to facilitate transparent exchange across libraries.
Goal: get dataframe libraries to agree on providing a method for easy conversions across the implementations
Why: easier exchange across libraries
An example usecase: plotting with seaborn
Given a dataframe-like, the user should be able to call seaborn plotting functions on it transparently.
Currently seaborn tests whether inputs are pandas dataframes. It could be changed to call the
__dataframe_interface__ method (or any other name) on the input, if that method, to convert to its favorite internal representation. Hence, seaborn would accept any dataframe-like, and support transparently basic features such as column names.
Bigger picture: programming to an interface In the bigger picture, the goal is to enable library designer (such as seaborn, plotly, statmodels, scikit-learn) to accept a well-defined interface, rather than a specific implementation.
Edit this was badly formulated, as it evoked the idea of a computational language / interface; that’s not the goal here. The goal is to have an API to pass around dataframes.
What: common subset, rather than aligning world-view
The success of such endeavor is tightly bound to its ability to enable working with many many dataframe implementations. For this, it will probably require accepting to lose some of the descriptive power and flexibility of some dataframe implementations. Yet, there is a reason that these dataframe implementations differ (different assumptions on the runtime, different tradeoffs), so finding commonalities across them is by construction a reductionist endeavor.
The benefit of such a proposal is that it can be put to practice by the ecosystem in a reasonably short time frame. A more ambitious proposal, such as consistency of dataframe APIs across implementations or multi-dispatch patterns, can be envisaged in the long run, but it will take much longer. In the mean time, libraries that do operations on dataframes or that output dataframes struggles with the multiplicity of implementations and are likely to fall back on the most popular one: pandas.
Calling something like “pd.asdataframe” on an input data frame to cast it to a pandas DataFrame, rather than calling its methods (eg groupby) implies that the optimizations of the dataframe engine will be lost, for instance out-of-core computation. This shortcoming should be documented and acknowledged. Yet, a library such as seaborn already supports only pandas DataFrames. Hence the users would be better off, not worse off, with the current proposal. The goal would be to reduce (fight?) the use of
isinstance(data, pd.DataFrame) which is widespread in the pydata codebase.
The buffer protocol and
This proposal is inspired from the buffer protocol that serves as an API in Python to exchange continuous memory layouts of homogeneously typed data. This protocol is at the core of the numpy array interface, which has greatly helped exposing to the numerical-computing world the data from C libraries.
Topic on what is the dataframe abstraction
This topic is related to dataframe data model topic on this discuss however, I thought that it was better to open a new discussion thread, as the goals here are slightly different.