A dataframe protocol for the PyData ecosystem

There are more and more dataframe implementations (which can be justified by the fact that they couple to different runtimes / database backends). It would useful to define a dataframe protocol to facilitate transparent exchange across libraries.

Goal: get dataframe libraries to agree on providing a method for easy conversions across the implementations

Why: easier exchange across libraries

An example usecase: plotting with seaborn

Given a dataframe-like, the user should be able to call seaborn plotting functions on it transparently.

Currently seaborn tests whether inputs are pandas dataframes. It could be changed to call the __dataframe_interface__ method (or any other name) on the input, if that method, to convert to its favorite internal representation. Hence, seaborn would accept any dataframe-like, and support transparently basic features such as column names.

Bigger picture: programming to an interface

In the bigger picture, the goal is to enable library designer (such as seaborn, plotly, statmodels, scikit-learn) to accept a well-defined interface, rather than a specific implementation.

Edit this was badly formulated, as it evoked the idea of a computational language / interface; that’s not the goal here. The goal is to have an API to pass around dataframes.

What: common subset, rather than aligning world-view

The success of such endeavor is tightly bound to its ability to enable working with many many dataframe implementations. For this, it will probably require accepting to lose some of the descriptive power and flexibility of some dataframe implementations. Yet, there is a reason that these dataframe implementations differ (different assumptions on the runtime, different tradeoffs), so finding commonalities across them is by construction a reductionist endeavor.

The benefit of such a proposal is that it can be put to practice by the ecosystem in a reasonably short time frame. A more ambitious proposal, such as consistency of dataframe APIs across implementations or multi-dispatch patterns, can be envisaged in the long run, but it will take much longer. In the mean time, libraries that do operations on dataframes or that output dataframes struggles with the multiplicity of implementations and are likely to fall back on the most popular one: pandas.


Calling something like “pd.asdataframe” on an input data frame to cast it to a pandas DataFrame, rather than calling its methods (eg groupby) implies that the optimizations of the dataframe engine will be lost, for instance out-of-core computation. This shortcoming should be documented and acknowledged. Yet, a library such as seaborn already supports only pandas DataFrames. Hence the users would be better off, not worse off, with the current proposal. The goal would be to reduce (fight?) the use of isinstance(data, pd.DataFrame) which is widespread in the pydata codebase.

Related material

The buffer protocol and __array_interface__

This proposal is inspired from the buffer protocol that serves as an API in Python to exchange continuous memory layouts of homogeneously typed data. This protocol is at the core of the numpy array interface, which has greatly helped exposing to the numerical-computing world the data from C libraries.

Topic on what is the dataframe abstraction

This topic is related to dataframe data model topic on this discuss however, I thought that it was better to open a new discussion thread, as the goals here are slightly different.


One thing I would ask is that the protocol is explicit about when evaluation happens instead of implicit.

Force downstream users of it to call an execute method or something, so that they can build up a number of expression before forcing eval. This leaves open the door to support backends which support full program optimization and compilation, like ibis or dask.

1 Like

I agree that this is well motivated.

In the spirit of looking for a minimal complexity solution, let me throw out a concrete proposal: The return value of __dataframe_interface__() should be a Python dict with values that are each convertible to a 1D array, all of which must have the same length.

Array convertible

Exactly what “array convertible” means should be defined more carefully. One necessary property is that array convertible objects must satisfy np.asarray(obj).ndim == 1. But we probably should be more restrictive than that, to avoid over-specialization on quirks of NumPy.

Here are some examples of possibly “array convertible” values that are worth considering:

  • Unnested Python list objects with “scalar” values.
  • 1D NumPy arrays.
  • Objects that support Python’s buffer protocol with ndim == 1.
  • Objects that support NumPy’s __array_interface__ or __array__ (and that return 1D arrays).

Not every Python Sequence should be array convertible. For example, str definitely is not. tuple is a bit of a gray area: NumPy and pandas often interpret tuples as lists, but in idiomatic Python tuples correspond to heterogeneous collections, so it’s a little strange to treat them as arrays.

Example uses

A plotting library like seaborn should be able to sanitize a dataframe into a type-safe format with something like the following:

import numpy as np
import pandas as pd

def convert_to_dataframe(obj):
    if not hasattr(obj, "__dataframe_interface__"):
        obj = pd.DataFrame(obj)
    df = obj.__dataframe_inteface__()
    return {str(k): np.asarray(v) for k, v in df.items()}

A more sophisticated computational library like scikit-learn might choose to check for more expansive protocols for working with array data, like those from NEP 18 or NEP 37, before falling back to using np.asarray(). This might allow for manipulating DataFrame data in a much more efficient manner, e.g., using distributed backend like Dask or on a GPU.

Implementation notes

  • I chose dict for simplicity, but perhaps it would be better to allow for any object that satisfies the Mapping interface from collections.abc, e.g., to allow for immutability. If so, we should be clear that the full Mapping interface must be satisfied, and the semantics of operations should not deviate from those of dict.
  • The protocol itself might need to consider immutability, so users can choose whether they want to copy data or use views.
  • My proposal intentionally omits row labels, because I don’t think those are a core part of the dataframe abstraction.

Something else worth noting is that “things that can be converted into a DataFrame” is a bigger category than just “DataFrame” objects. Good examples might include patsy.DesignMatrix and xarray.Dataset.

This is a challenging problem because the “protocol” is intimately related to the in-memory data model of the returned data – not only type metadata but the byte/bit-level layout.

Per Stephan’s comment:

Even what is meant by “1D array” must be analyzed. NumPy arrays have been shown (can add references) to be inadequate as a memory model (in general) for structured data analytics and for very large datasets. Some reasons:

  • Lack of coherent/consistent missing value representation (pandas does its own bespoke things)
  • Strings an afterthought relative to numeric data
  • Lack of support for nested types (values that are themselves arrays, structs, or unions)

Of course, I’ve spent the last 4+ years working on Apache Arrow which seeks to be a universal language-agnostic data frame memory model, but I’m not going to advocate necessarily for adopting that (even though that’s what I want everyone to be using). Note that Arrow just adopted a C Data Interface inspired by the Python buffer protocol to greatly simplify adding Arrow import/export to third party libraries.

If the scope of what you are trying to accomplish is narrow, namely:

  • The amount of data you intend to interchange is not large
  • You don’t care much about strings and nested data
  • You don’t care much about serialization / data structure conversion performance
  • You don’t care much about interop with the extra-Python ecosystem

then I think a dictionary of NumPy arrays is fine.

Some other notes:

  • It’s actually very difficult to zero-copy construct a pandas.DataFrame, so whatever input format to go back to pandas is probably going to copy / serialize. You at least want to make this as fast as possible
  • The benefits of a memory-mappable data frame representation are significant

This is a bit self serving, but I agree with @wesm that Apache Arrow would be a great start. I think strings are important enough alone to design with them in mind from the beginning. Again, this is a bit self serving, but I also believe in Apache Arrow.

1 Like

About Seaborn:

Due to the author’s refusal to expose even basic calculated
parameters for review,
I don’t think it’s appropriate to regard Seaborn as anything more
than pretty default charts.

For visualization-driven analyses where calculated parameters
are more important than pretty charts, I suggest something like
Yellowbrick instead (which I also have no stake in):

Yellowbrick extends the Scikit-Learn API to make model selection and hyperparameter tuning easier. Under the hood, it’s using Matplotlib.


Refocusing discussion to the OT:

That being said,
what are the featues we need in a .to_dataframe() interface
that [pydata ecosystem] tools can easily call
so that we can limit further fragmentation
due to performance optimizations?

Do we have a list of all of the different DataFrame API implementations?

https://pandas.pydata.org/pandas-docs/stable/ecosystem.html#out-of-core lists a few.

DataFrame APIs:

Data selection and manipulation abstractions:

Though the name “dataframe” comes from R,
an obj.to_dataframe() could call obj.__dataframe_interface__();
but IDK what/which API we would expect such an object to implement
(or how to indicate whether that will make a copy)?


This may be your opinion, but factually it is not true (does R have row indexing?). Go back and look at pandas 0.1 (still on PyPI) – these were distinct visions from day one.

This sounds like an ad hominem. Can we avoid such comments on this forum?


Part of my motivation for starting the discussion on the data model was to have a principled place to start a discussion like this. It is tempting to jump in to start discussing protocols and APIs, but there’s not even a consensus on what a dataframe is, so that will make a protocol definition impossible. It’s also made more complicated because no project wants to be labeled “not a dataframe” or “dataframe-like” because it does not conform to some standard data model/API (think SQL and NOSQL).

As I said in the other thread, being inclusive of projects that self-brand as dataframes will be to the detriment of whatever protocol or API that gets developed, and thus the user. I do not think it useful to tie future APIs to the past APIs, but I do think it’s important to tie future semantics to past semantics.

If we focus on the common subset of all dataframe projects, we will land on relational algebra/SQL, which I think is uninteresting for the majority of the Pydata community and solved from an academic perspective. There would be little to discuss in this context.

I agree with above that Apache Arrow is a good start for the standard format. There are some things I’d like to be supported for it to replace compute kernels completely (transpose, for example). I’m not convinced that an API/protocol must be tied to a standard memory format, the caller just needs a standard way of consuming the result of the API/protocol call. This can be decoupled from the memory layout, especially if the caller is just going to iterate over the values.

Even though this scope is narrow, I still think it would be extremely helpful. On one hand the benefits of going the direction of Arrow are apparent, but on the other hand the have been a long time coming (because it is a hard problem!).

Having a name and a library independent way to ask “can I view you as a Mapping of equal length 1D arrays?” would make awesome from a down-stream point of view.

In my day-to-day work I see a lot of problems that do fit in a laptop / workstation and we should not lose sight of the long tail of “small data” problems while focusing on getting things right at scale for “big data”. [1]

In many use-cases (ex plotting) if you have nested structured data you are hosed anyway because the tools that would consume this interface don’t know how to deal with it.

I read this proposal as orthogonal to extra-python interop story as this is about intra-python interop.

This is very close to the required API for the data kwarg in Matplotlib that we have had from 1.5, but I would be in favor of not requiring a literal dict so that pd.DataFrame and friends can return them selves. I think abc.Mapping is sufficient and should assume to be immutable by default.

I enthusiastically endorse this idea in general and @shoyer’s concrete suggestion

[1] In many fields of science if you look at the aggregate data size it looks like “big data”, but it is split up into so many small data sets that you never have to grapple with it “at scale”. What we really have is a truck load of small-to-medium sized data problems.


I think it’s important to carefully consider the things that would be traded away by continuing to build on top of NumPy in 2020:

  • Missing values (NumPy doesn’t have this, except via numpy.ma which few people use)
  • Native strings (people use heap PyString/PyBytes, fixed size strings are too limited – these are slow to process)
  • Nested data (people use PyObject arrays of dicts, lists, tuples, etc., which use a lot of memory and are slow to process)
  • Memory mapping (for all data types – including strings – not just numbers)
  • Categorical data (pandas implements synthetic categorical data on top of NumPy)

If many projects adopt a data frame interface / protocol that does not have these things as 1st class citizens, in the long run it will create a lot of problems and ultimately fragment use cases (the ones that can live without these features, and the ones that can live with them).

I have the feeling that by layering the API, we could split the problem. At the higher level, a SQL-like API would provide a well-understood abstraction that can be plugged in all the implementations, with obvious performance issues, but if we agree that small data is still pervasive and that we could also mainly split and aggregate, this high-level API is still useful to access data in a portable way.
The second level would be more Pandas-like, with unknown complexity for some operations and magic behind the scenes but not too unreasonable. Pandas returns views for slices but not for fancy indexing, well, ok. The doc is clear except all the beginners stumble on it. Pandas can wrap a HDF5 container, and you pay a price, but there are trade-offs described in the doc.
The lower level would expose a hopefully reasonable set of low-level layouts and implementations. Relying on raw numpy arrays would be one, but chunked arrays could be another one (for Dask and PyArrow), and probably a few others.
An application code could decide to use any level, and, if it does not understand the 3rd level, it would revert to the 2nd level or the first.
Still, the semantics of tables is well understood, but xarray is different and does not map easily to tables. Should multidimensional tables with named axes be supported?
The format of Python strings is a huge issue in my experience, since they cannot be memory mapped. There are workarounds but they cost quite a lot. But that might something to address later.

The point of view that I (and I assume @GaelVaroquaux) am coming from is that there already exist libraries that build on top of numpy. Currently we have apis that look like

df = DataFrame(..)
ret = do_some_work(df['by'], df['ripping'], df['the'], df['df'], df['apart']) 

At this point we are already assuming numpy arrays of numbers (more-or-less). This proposal gives us a path to:

df = DataFrame(...)
ret = do_work_smarter(df)

and is almost completely orthogonal to if this is literally built on numpy, how it is encoded at the memory level, or the philosophical under pinnings of what is a DataFrame.

This is a way for those of us at the next level closer to the users to be able to exploit all of the work y’all are doing without going all-in on any implementation. If in the long run we start talking about how numpy arrays duck-type as an Arrow column or the other way around :woman_shrugging:, in the short term this gets us a path to start using this stuff natively ASAP.

Part of the Matplotlib CZI grant is supporting @story645 to think about how Matplotlib wants to interact with structured data in a consistent way.


To me this seems to be calling for a middleware / wrapper layer that inserts an abstraction between the developer-consumer (e.g. matplotlib) and the data representation (which could be pandas, Arrow, or something else). So rather than protocol_df[col] returning its naked data representation, you would call a method protocol_df[col].to_numpy() to ask for the data to be returned to you as a NumPy array. Then it would be up to libraries “that provide DataFrames” to implement and export this middleware API.

pandas for example could implement the middleware API and provide methods for the data to be exported to you in the format you require. So then matplotlib codes against the middleware API and not pandas directly.

One of the challenges with this will be implementation-specific serialization options (for example, how to coerce the underlying representation into the desired output data structure – NumPy datetime64 types spring to mind)


Yes, that seems fair. col.to_numpy() seems to overlap with what __array_protocol__ is trying to do?

To first order, failing seem OK to me. Same thing with columns of nested data, if it can’t be cleanly cast to a nd array (I am also interested in columns of (n-1)D arrays) then that is a good sign that the consumer could not cope with it anyway.

I also suspect that this will be the first of number of increasingly rich interfaces/middle layers:

  1. “I am a mapping of array like things (or things you can get array like things from)”
  2. “… and I support groupby”
  3. “… and I support row labels”

At the end of the day I would like to be able to ask the object I get in “do you support the feature I need?” and then I can decide to cast it to some type that does or raise.

I want to stress again, even level 0 (how ever it is spelled) would me a major step forward and does not require boiling the ocean to get there.

Sure, but def to_numpy(self, ...) might have some additional options that cannot be passed with __array_protocol__. So it would look like this:

# "Protoframe" implementation for Library

class LibraryProtoframe(protoframe.DataFrame):

    def __getitem__(self, key):
        # implementation

class LibraryColumnWrapper(protoframe.Column):

    def __array_protocol__(self, ...):
        return self.to_numpy()

    def to_numpy(self, null_sentinel=DEFAULT_PLACEHOLDER):
        # implementation

class LibraryFrame:

    def __dataframe_protocol__(self):
        return LibraryProtoframe(self)

So in matplotlib you would write

def function(data, attr='x', ...):
    if hasattr(data, '__dataframe_protocol__'):
        protoframe_data = data.__dataframe_protocol__()
        attr = np.asarray(protoframe_data[attr]))
    # and so on
1 Like

For what it’s worth, Modin has this “middleware abstraction” agnostic to the API layer, which multiple groups are using to program against. In Modin, this abstraction is the QueryCompiler interface, and is not tied to any specific API or implementation. An abstraction like this probably does belong in pandas, and I have offered to contribute the parts of it that make sense to pandas once it’s more mature.

I’ll be interested to take a closer look at this.

If I’m understanding some of the motivations for this discussion, many projects want to accept “dataframe-like data” at call sites but want to avoid:

  • Having a hard dependency on the pandas library
  • Requiring users (or other third party libraries they want to interoperate with) to coerce their data into pandas

I agree with both of these motivations.


Just to clarify, it doesn’t need to literally be a Numpy array, just something that can be cast to a numpy array or better yet supports Numpy semantics & duck typing for a key set of operations that haven’t totally been pinned down yet and might intersect w/ @devin-petersohn’s work (the mpl middlewear is also a projection-query model).

Also by returned as a NumPy array, do you mean a view of the data or a copy?

1 Like