A dataframe protocol for the PyData ecosystem

I think adding Categorical to that list is also important.

If we want to add some __dataframe_interface__ (or similar) method, it should be further discussed what that would look like. There’s agreement here from a couple of people, but I’m not convinced myself that a dictionary is the best way for downstream applications to consume dataframe data. Constructing a dictionary column-by-column is going to be very expensive for many implementations so it likely won’t be adopted.

@tacaswell Why aren’t we discussing a to_pyarrow for downstream consumption? This would be much better than a dictionary and it’s a standard format that many systems already are using/implementing. I don’t see why we need to discuss a completely new interchange when we have pyarrow which can interchange even between languages. Is there a reason we should not be considering Arrow? If types are the issue, there are ways to handle that.

@wesm Modin has this ABC notion in the dataframe query compiler interface. It does need some work and better documentation, but I think it is worth having a solid place to start that is grounded in theory. This interface is currently being implemented by multiple groups in academia, larger companies, and startups, so if adoption is a major concern there’s already headway here. I don’t think many people here are familiar with Modin’s architecture, so I will outline the interface here.

The QueryCompiler interface is designed to be a smaller API than pandas while still capturing all its unique behaviors, such that all operators only have one way of being written. We place importance on the data model discussed in this other discussion. The interface provides a uniform way for the dataframe to be consumed, whether by an API layer or by some other application.

I think it is better to let implementations decide which parts they will implement or how they will implement them (similar to some historical SQL systems) to keep things generic. If a downstream application calls df.index it is up to the implementation how that is handled (throwing error, getting column(s), etc.).

This spec can be moved into a separate project, it is rather easy to do since it’s just an interface anyway. This interface does solve the “common data structure” problem.

Yes, definitely.

I don’t think that the dictionary representation should be leaked in the API.

I don’t think that the interface/API should be prescriptive about the underlying representation specific to a particular implementation of that API. If the underlying representation uses Arrow then to_arrow() would be zero-copy. Analogously with to_numpy().

If you require a particular internal representation then you will force a serialization in some implementations that is not necessary (for example if both the producer and consumer are NumPy-based).

I’m interested in this. Without polluting my mind too much with someone else’s work I am going to draft an abstract API that captures the essence of what I’ve stated in my other comments and then we should compare that with Modin’s API to see if they do the same things. (To be clear I don’t really care if any work that I do ends up in the final specification but I think it could be helpful with capturing requirements from the applications I’ve worked on)

Here’s my attempt at a barebones abstract API for data interchange

There is an example implementation where the data is stored as a dict of NumPy arrays, e.g.

In [2]: paste                                                                  
    data = {
        'a': np.array([1, 2, 3, 4, 5], dtype='int64'),
        'b': np.array(['a', 'b', 'c', 'd', 'e']),
        'c': np.array([True, False, True, False, True])
    }
    names = ['a', 'b', 'c']

In [3]: df = DictDataFrame(data, names)                                        

In [4]: df.column_names                                                        
Out[4]: ['a', 'b', 'c']

In [5]: df['a']                                                                
Out[5]: <__main__.NumPyColumn at 0x7f9cae23ed10>

In [6]: df['a'].type                                                           
Out[6]: int64

In [7]: df['a'].to_numpy()                                                     
Out[7]: array([1, 2, 3, 4, 5])
3 Likes

This is a very interesting thread.

+1 to what @wesm says here. There’s several topics mixed in this thread, at least:

  1. A method to interchange different types of dataframes.
  2. A common dataframe API
  3. Execution of that API, potentially lazily

In addition, all the goals and example code for use cases are quite implicit or missing. I would suggest it’s very valuable to be explicit about that. For example, the Seaborn case that @GaelVaroquaux brought up:

A typical example of data parameter parsing in Seaborn accepts pandas.Dataframe, numpy.ndarray, flat lists and nested lists in an establish_variables method. The if isinstance(data, pd.DataFrame): there would be replaced by

if hasattr(data, __dataframe__):
    data = pd.Dataframe(data)

allowing Seaborn to support Modin, Vaex, etc. dataframe objects without any other changes (besides some documentation).

It seems to me that without a minimal “dataframe API standard” for computation (topic 2), there’s little else one can do with __dataframe__ without choosing one concrete implementation. Which mostly will be Pandas for now, but of course it allows a way forward for libraries to choose Vaex/Modin/etc. and convert pd.Dataframe input to that internal representation.

I would be interested in concrete usage examples others have in mind once libraries implement a __dataframe__.

1 Like

https://github.com/pandas-dev/pandas/issues/28409 proposes moving rarely used I/O connectors to third party modules. Maybe the interchange protocol could be used in this context.

nitpick. pandas would likely also support the dataframe protocol.

if hasattr(data, __dataframe__) and not isinstance(data, pd.DataFrame):
    data = pd.Dataframe(data)

I’ve started to play with roundtripping a pandas DataFrame through Wes’ strawman API and depending on the final spec, this is likely to be lossy.

Update: Sorry, the above code sample doesn’t actually make sense as is as the pandas DataFrame constructor would just return self. (too many edits! )

for an application that already supports pandas and for a data object that supports the dataframe protocol (which maybe a pandas object) and accessing data through the api, then would be something like

if isinstance(data, pd.DataFrame):
   do something directly with pandas 
elif hasattr(data, __dataframe__):
    use _dataframe__ protocol.

if most applications, just do

if hasattr(data, __dataframe__):
    data = pd.Dataframe(data)

then maybe it would make more sense if the dataframe protocol specifies returning a pandas DataFrame.

Just as a summary, with more questions than answers, hoping it helps focus the discussion. Hopefully not biasing the conversation, let me know if something worth discussing is missing here.

I think there is agreement that different projects aiming compatibility with dataframe consumers will implement a method __dataframe__ (not necessarily this name).

First question to me would be:

What about memory copy?

From the discussion I think some people is happy to start by solving a small data problem where a copy is returned. And some people would prefer to return “a pointer” to the original data and avoid copies. For big data a copy is surely not an option.

I guess the key here is to identify the use cases. Some that come to my mind:

A plotting library now expecting a pandas dataframe, that would like to also receive data from other libraries in a transparent way. I’d say data should be small, since it needs to be plotted, so no big data and copy is ok. I guess returning data as a pure-Python structure should be enough.

Machine learning libraries (e.g. scikit-learn) receiving dataframes from any library to train models. Without having to worry about compatibility. Data is likely to be big (e.g. 1M rows), and we surely don’t want a intermediate Python representation for the data (I guess for the metadata is ok). So, the question is how every column should be accessed. Some questions:

  • Is the buffer protocol appropriate for this?
  • My understanding is that Arrow solves this problem, would simply returning an Arrow RecordBatch be reasonable?
  • If we support both (I think @wesm wrapper exposes both), are we assuming that the data is likely to be stored using one of them, and we would copy memory if the consumer is accessing the other?

Another use case would be to transform data from one project to another (like converting a pandas dataframe to a Vaex dataframe). I guess this case would be similar to the previous.

Should we define a computation API?

If the consumer of a dataframe is going to perform computations on the data (reductions, aggregations,…), how is this going to happen? I guess the main options are:

  • The consumer implements them, and it’s out of scope for this discussion
  • The consumer reloads the data in their library of choice (e.g. the multi-dataframe plotting library calling pandas.DataFrame(df.__dataframe__()) on the received df of unknown type). Computations would also be out of scope
  • We agree on a computation API, and consumers can use it no matter what project implements them

If we try to define a computation standard, I guess some considerations are:

  • How big should this API be? Can we find a minimal set of atomic operations that are good enough?
  • Do we implement both eager and lazy modes?
  • Should this API be exposed to end users, or just internal?

What existing projects are we thinking they should implementing this?

  • pandas?
  • Single host alternatives? (Vaex, cuDF)
  • Distributed dataframes? (Dask, Modin, Koalas)
  • Database table wrappers? (Ibis, SQLAlchemy)
  • Others?

Should we define just one “standard”, or does it make sense to support more than one?

Next code illustrates what I mean:

def __dataframe__(self, format):
    if format == 'pandas':
        return pandas.DataFrame(self)
    elif format == 'arrow':
        return pyarrow.RecordBatch(self)
    elif format == 'dict_of_numpy':
        return {col_name: numpy.array(col_data) for col_name, col_data in self.items()}
    ...

Consumers would then be able to choose the most convenient format for their needs. And dataframe libraries would be expected to implement a small subset of formats.

I guess supporting more than one format has the advantage that some of those are immediate to implement. And it’s possible to experiment more than one approach. Of course at the cost of added complexity and possibly less standardization (some projects only implementing a subset), and consumers using suboptimal choices.

2 Likes

My argument throughout this process (as expressed more concretely in https://github.com/wesm/dataframe-protocol/pull/1) is that we should provide APIs on the object returned that allow the consumer to request the data be returned to them in the format they want, and defer any necessary serialization / copying until the moment in which the consumer has made their request clear. One of the most common requests would obviously be as a NumPy array.

I don’t think it would be a good idea to force producers to convert to a particular intermediate memory model – that’s one of the motivations for Gael’s initial proposal. In the worst case that could cause two serializations (copies), once by the producer and then again by the consumer. I think we need to avoid this, providing for either zero or one serialization to occur, but never two.

My position is that if we can’t agree on an interchange API, we probably won’t be able to agree on a computation API either. I also see the problems as mostly orthogonal to each other. If there are groups that want to talk about a computation API that should probably be split off from this discussion into a new topic.

I think the interface will need to provide for multiple “consumer” formats. A helper library can provide some utilties to simplify the implementation of this. For example, if you know how to convert a single column to a NumPy array, then the default implementation of “dict_of_numpy” can be provided by the helper library.

The problem with the example API that you’ve proposed is that it may do “too much serialization”. A consumer might only need one column out of 1000. I think if an invocation of __dataframe__ causes a serialization of the entire dataset, even the portions unused, that’s a bad thing. I definitely think that avoiding unnecessary serialization / copying is essential. So the API needs to provide some kind of “middleware” object with a known API where the user can make a further selection of rows and columns and then convert to their target data structure.

3 Likes

I just opened a PR to pandas which lets downstream libraries override how the Dataframe constructor operates on their objects, using a singledispatch function.

If this is merged, then scikit-learn can simply call pandas.Dataframe on their input dataframe-like types to see if they can be converted. Other libraries, like Modin or Dask can register their own functions so their dataframes can be properly converted.

It does not begin to address any of the number of larger questions here, around supporting “dataframe like” objects, but is a concrete option for the short term. I am not sure if it’s useful or not. I think that would depend on whether scikit-learn wants to be able to injest things like dask dataframe’s by simply eagerly converting them to pandas dataframes. It obviously doesn’t scale to out of core sizes, but neither does requiring a conversion to numpy arrays or buffers.

@simonjayhawkins thanks for the I/O connectors example, that makes perfect sense. I imagine other dataframe libraries may have optimized versions for some of the key formats, but in general this helps getting broad I/O format support quickly.

Re your code examples, you’ve already partially corrected it, but: all isinstance usages are undesired and unnecessary. The above is better written as

if hasattr(data, __dataframe__):
    # this is a no-op if `data` already is a pd.DataFrame, potentially a copy otherwise
    data = pd.DataFrame(data)

do_something(data)

No, @wesm and others have already argued why they don’t want that. The protocol is now library-agnostic, so you can just as easily do:

import modin.pandas as pd

if hasattr(data, __dataframe__):
    # this is a no-op if `data` already is a Modin dataframe, potentially a copy otherwise
    data = pd.DataFrame(data)

do_something(data)

The do_something here is using (a subset of) the pandas api? so where I said __dataframe__ returns a pandas DataFrame, maybe I could have said returns an object that conforms to the pandas api?

I think your point about concrete usage examples is key to the discussion.

My initial thoughts are providers could be classified as those that can be converted to pandas DataFrames and those that cannot. (Or maybe those that can appear to be pandas DataFrames, like Modin, and those that cannot/don’t want to/not yet implemented)

Those that cannot be converted to pandas DataFrames are likely to be out-of-core or distributed providers and for these the interchange protocol would be most beneficial but on the other hand consumers would need to change more that one line of code to support them, and without a computation api this could be non-trivial. (unless they use Modin)

For those that can be converted to pandas DataFrames, for me, the interchange protocol does not make so much sense, since these are likely to be in-memory and therefore the conversion to pandas DataFrame may be better done by the provider. In some cases, this could be no-copy? Of course, this is not library agnostic and therefore having concrete cases for where a consumer uses an in-memory DataFrame but does not want use pandas would be useful. (pandas provides a computation api that the interchange protocol currently lacks)

IIUC the architecture of Modin is such that the user facing api, the computation api (or query processor), execution engine and storage being separate means that consumers could use Modin to achieve the the goals set out in this issue. So therefore the computation api is not required as part of the protocol since Modin would be used? This is currently a subset of the pandas api but in time the (leaner and simpler) Modin API could replace that.

If the protocol requires a computation api to avoid unnecessary materialisation, would the protocol effectively mirror the Modin API since IIUC this uses the minimum algebra to describe operations on dataframe like structures.

maybe a one-size fits all solution is not appropriate and consumers using “small data” are happy to use pandas and consumers tackling “big data” could use Modin.

For Seaborn yes, in general no - it conforms to the API of whichever dataframe library is used internally in the package containing this code.

No. Aligning on a single API for dataframe objects is a separate topic (it was topic (2) in my first post), which is probably two orders of magnitude more work than simply an interchange format. That’s why @wesm already requested to discuss that in a separate thread. I’m actually skeptical a Discourse thread on a common API without a serious amount of pre-work - such as a detailed RFC-style document, including an overview of what’s common and different between the various libraries - is going to get very far.

I think it is appropriate and desirable. It’d be very nice not to have to assume. For problems that fit in memory, one may still (e.g.) want to use Vaex for more performance, and for larger problems (e.g.) dask.DataFrame instead of Modin.

1 Like

It should be possible to write:

obj.__dataframe__().to_pandas()

The producer is free to implement to_pandas in whichever way they want. If the underlying representation is pandas, then this simply returns the internal representation.

It’s important to remember that conversion to pandas is not free, and generally requires memory doubling.

Well, one obvious example is data frame data represented using the Arrow format. To be able to pass Arrow tables into scikit-learn without forcing the user to copy into some intermediate data structure would be highly beneficial.

Other projects deal with dicts of NumPy arrays. pd.DataFrame(dict_of_numpy) is almost never zero-copy.

I 100% agree with Ralf on this. I don’t think we should hold an immensely useful interchange API hostage over the desire to develop standardized data frame computation APIs. We can do both, and one does not need to be coupled to the other.

1 Like

good to know. addresses some of the concerns I had.

I assume same for the column representation. As well as to_numpy and to_arrow, to_series could be an option?

I started a implementation of __dataframe__ for pandas in https://github.com/pandas-dev/pandas/pull/32908

I agree that an interchange API would be immensely useful.

my understanding of the OP was that the protocol would allow programming to an interface. I therefore assumed that

if hasattr(data, __dataframe__):
    data = pd.Dataframe(data)

would be an anti-pattern and that if the protocol was just used for converting one dataframe format to another that the optimizations of the dataframe engine will be lost.

Since a little time has passed I will probably soon (within the next several days) make another iteration on my PR to incorporate feedback.

my understanding of the OP was that the protocol would allow programming to an interface.

OK, my wording was maybe a bit too strong! the code that you are writing above would achieve my goals.

1 Like

I just updated the PR again, feedback welcome

As a high level comment, there seem to be a variety of new requirements that people are raising.

Personally, I am only interested in creating unambiguous data export APIs to concrete data / memory representations that can be used to make physical data available at library boundaries. These APIs should not require that a particular intermediate data / memory representation (like pandas.DataFrame) be used in the implementation.

Other considerations seem auxiliary and can be addressed later. If general consensus can’t be reached on data export APIs (or if people don’t think they are important) then I’m going to step away from this project and let others pursue it. Would it be helpful to take one or two steps back and write a requirements document (separate from any actual Python code) for the project to enumerate what problems are being solved and what problems aren’t being solved in the first iteration?

I am one of the aforementioned people trying to raise new requirements and I apologize. A lot of my comments were directed towards making the data / memory representations non-container specific. I.E. instead of using numpy we should use the __array_interface__ or Python buffer protocol to represent the memory buffers for array data.

Our use case is surrounding using GPUs in the ecosystem and giving paths to allow GPU-accelerated libraries natural integration points into the ecosystem to allow them to be plugged in as seamlessly as possible.