A dataframe protocol for the PyData ecosystem

It would be a view if the underlying data representation supports it.

For example, Arrow’s to_numpy returns a view/is zero-copy for numeric data without nulls:

In [6]: import pyarrow as pa                                                                                                                                                                   

In [7]: arr = pa.array([1, 2, 3, 4], type='float32')                                                                                                                                           

# This is a view / zero-copy
In [9]: arr.to_numpy()                                                                                                                                                                         
Out[9]: array([1., 2., 3., 4.], dtype=float32)

In [10]: arr = pa.array([1, 2, None, 4], type='float32')                                                                                                                                       

In [11]: arr.to_numpy()                                                                                                                                                                        
ArrowInvalid: Needed to copy 1 chunks with 1 nulls, but zero_copy_only was True

# This copies
In [12]: arr.to_numpy(zero_copy_only=False)                                                                                                                                                    
Out[12]: array([ 1.,  2., nan,  4.], dtype=float32)
1 Like

Interesting discussion, and ideal we now have a place to discuss this.
Maybe it is useful to split the discussion in a few use cases:

Small data, uses all rows
I like @shoyer 's idea of a Mapping with array convertible values, simple and easy.

I think for libraries such as matplotlib, where you should expect the data to fit in memory, this should be good enough. I think actually most libraries could already do this. But I do consider this subset a solvable problem.
It’s true that taking Arrow in mind, some columns would not be convertible to a numpy array, but I guess you cannot plot them either, so failing there should be fine (and only the used columns should be converted).

Large data, needs aggregation
When data becomes larger, an API for aggregation becomes more important because you don’t want to visualize/show a billion individual rows, you want to groupby a particular column (or maybe binby in vaex’ terms). For libraries such as plotly express (also seaborn?), it would be extremely useful to be able to pass in a vaex/cudf/vaex dataframe and expect it to work.

Large data, but need to pass over all rows
In cases where you don’t want aggregation, but you need to pass over all the data, you want to ask a dataframe for chunks of data at a time. In vaex we’ve been hammering on this API for quite a bit in order to pass all the data to scikit-learn incremental learners in an efficient/parallel way (it will, for instance, submit the evaluation of the next chunk before it yields the current one) such that single-threaded consumers don’t have to wait for the next chunk.

Also, writing Arrow or Parquet in chunks, where you cannot hold the whole dataframe in memory (and some columns aren’t even materialized, they are ‘virtual columns’ in vaex) is a common use case for dealing with a dataframe.

Viewing a dataframe as matrix
As I mentioned here: Dataframe Data Model discussion: What is this popular abstraction?
I really like that the Modin paper started naming things. For instance, a matrix dataframe (all the same type, int or float) is something that can (in theory) be passed to sklearn as a matrix. We are working on implementing NEP 13/18 on top of a matrix dataframe for vaex https://github.com/vaexio/vaex/pull/415 but if we had a more common API, this might be able to live outside of Vaex. I think having a common name helps, if only for consistent error messages (e.g. ‘This dataframe is not a matrix dataframe, column X is of type QXY’), having an API to maybe ask for these conditions might be useful.

To summarize, I fully agree that a dataframe protocol/API would be really useful, but I think different use cases require different features/APIs.

Possible path forward
I could imagine a similar story https://promisesaplus.com/ where instead of trying to build a library that would beat all javascript Promises, they laid down a spec with a test suite ( https://github.com/domenic/promise-tests https://www.npmjs.com/package/promises-aplus-tests). What I like about it, is that the spec is formalized by the unittests, meaning that if you pass the tests, you follow the spec, no questions asked. (https://www.youtube.com/watch?v=V2Q13hzTGmA for a background on this).

5 Likes

Note all Arrow types can be converted to NumPy arrays, though some conversions (e.g. integers with nulls) can be lossy.

You mean having a common API for computation? That seems out of scope to me. @GaelVaroquaux’s proposal is strictly about interchange to achieve looser-coupling between libraries – or at least so everyone isn’t using pandas as the lowest-common-denominator.

It seems like the dataframe protocol API should recommend that serialization / conversion be deferred until it is actually needed. For example:

proto_df = foreign_object.__dataframe_protocol__()
proto_chunk = proto_df.slice(start_index, end_index)

# Work doesn't happen until this point
proto_chunk[col].to_numpy()

Of course the semantics of materialization to the target data structure could be different for each protocol implementation.

I agree that having a standard test harness to verify protocol implementations would be very useful.

2 Likes

Just FYI, pandas.Series has a to_numpy method: https://pandas.pydata.org/docs/reference/api/pandas.Series.to_numpy.html#pandas.Series.to_numpy. The keywords. The docstring there has some discussions on the issues around lossy conversion, and the keywords give an idea of what kind of controls are commonly necessary (desired dtype and scalar value to use for representing missing values).

It’s unclear to me if this is in/out of scope for this proposal. I believe the two headings, “An example usecase: plotting with seaborn” and “Bigger picture: programming to an interface” @GaelVaroquaux gave in the introduction are possibly conflicting.

The first suggests “It could be changed to call the __dataframe_interface__ method (or any other name) on the input, if that method, to convert to its favorite internal representation” seems like a relatively tightly scoped proposal to add a convention for a dunder method to return an actual instance of a pandas.dataframe, if possible, much like the existing __bool__ or __str__ methods.

The second “the bigger picture, the goal is to enable library designer (such as seaborn, plotly, statmodels, scikit-learn) to accept a well-defined interface, rather than a specific implementation.” makes the scope a bit bigger and could include @maartenbreddel’s suggestion that we want to push aggregations (and likely other operations) to the backend compute instead of computing them on a pandas.dataframe object.


As @GaelVaroquaux said in the first post, the dunder method wouldn’t fix all problems but would be a short term win: “This shortcoming should be documented and acknowledged. Yet, a library such as seaborn already supports only pandas DataFrames. Hence the users would be better off, not worse off, with the current proposal.”

However, it seems like many people on this post are discussing the second broader solution, like @story645:

Just to clarify, it doesn’t need to literally be a Numpy array, just something that can be cast to a numpy array or better yet supports Numpy semantics & duck typing for a key set of operations that haven’t totally been pinned down yet

I guess one question we could ask here is “Is it worth it to try to get large scale support for a __dataframe_interface__ protocol or should we instead aim a bit higher to support computations on things that act like dataframes but are not pandas dataframes?”

For the first option, one nice way to define it precily is with a typing.Protocol:

import typing
import pandas

@typing.runtime_checkable
class PandasDataframeProtocol(typing.Protocol):
    def __pandas_dataframe__(self) -> pandas.Dataframe:
        ...

# ...

def process_df(df_like):
    
    if isinstance(df_like, PandasDataframeProtocol):
        df = df_like.__pandas_dataframe__()
    elif isinstance(df_like, pandas.Dataframe):
        df = df_like
    else:
        raise ValueError()

If we do pursue the second, I would like to see how we could adopt a solution that would let us push us much computation as possible to the underlying backend. Which would mean putting off returning an actual np.array or pd.Dataframe for as long as possible in your stack. You should instead use the to-be-determined dataframe/np like object to build up some representation of the computation you would like to do.

I have been exploring this route a bit and have some ideas here if this is the right place to discuss them.

2 Likes

To be clear, my goal was to enable compatibility, ie conversions, potentially lossy and with memory copies, rather than full-blown operations keeping the performance of the underlying implementation. In the terms of the post above by @saulshanabrook: the first, narrower, solution. The second, broader, one will take more iterations to get right and in the mean time, libraries are stuck and default to supporting only pandas’ dataframes.

While I think it is useful to separately discuss common APIs for computation, I think the lack of an interchange / conversion protocol for dataframe-like data is actively hurting projects because they are using pandas as the go-between, which can be very expensive (creating a pandas.DataFrame can be surprisingly costly). If we can’t build consensus around a narrowly-scoped interchange protocol we are unlikely to be able to agree on much of anything else. The interchange protocol needs to be able to address basic questions about the data:

  • The names of the columns
  • Getting one or more columns as NumPy arrays
  • Selecting a subset of rows (either a slice or a boolean array)

I just created the following GitHub repository

I can write up a brief strawman proposal document to state goals (and non-goals) as a starting point to dig into the fine details. If it turns out I’ve grossly mischaracterized the goals of what @GaelVaroquaux is looking for I’ll gladly let others step in to lead the design process

5 Likes

I am using “hearts” in this discussion to reflect the comments that I think characterize well my proposal, rather than polluting the thread with “I agree” messages. Your message above expresses well my thoughts.

I feel that I am not the right one to lead the technical discussion on its implementations. I would rather only reflect on some usecases.

2 Likes

Aligned with what @shoyer, @tacaswell et. al. propose, I think it’s not difficult to start with a MVP (minimum viable product) that already adds a lot of value to the community, while postponing more complex problems like avoiding memory copy, dealing with big data… Better illustrated what I would do with an example:

    def __dataframe__(self):
        """
        Standard representation of a dataframe in Python.

        Libraries interested in be compatible with downstream
        code receiving dataframes as input, should implement this
        ``__dataframe__`` method, which should return a dataframe
        in this format:

        >>> {"column_data": {"<col1>": iterable, protocol buffer object...
        ...                  "<col2>": same,
        ...                  ...},
        ...  "index": iterable, protocol buffer object...}
        """
        # just showing that this is not expected to support big data problems
        if len(self) > 100_000:
            raise RuntimeError('Data is to big for a Python dataframe representation')

        column_data = {}
        for column_name in self.columns:
            column_data[column_name] = iter(self[column_name])
        return {'column_data': column_data,
                'index': iter(self.index)}
    def plot_from_standard_dataframe(df, x_col, y_col):
        """
        Example downstream code compatible with any dataframe
        implementation.
        """
        from matplotlib import pyplot
        
        dataframe_data = df.__dataframe__()
        x = list(dataframe_data['column_data'][x_col])
        y = list(dataframe_data['column_data'][y_col])
        pyplot.plot(x, y)

To run the full example:

    import pandas    
    pandas.DataFrame.__dataframe__ = __dataframe__    
    gdp = pandas.DataFrame({'year': [2016, 2017, 2018],
                            'value': [18_715, 19_519, 20_580]})
    plot_from_standard_dataframe(gdp, 'year', 'value')

My understanding is that the main disagreement in the thread is about what the type of each column data should be. I think an iterator is a reasonable start (surely not solving all use cases, but it’s simple, pure Python, and solves decently small data problems).

To be more efficient in the future, we could probably implement a library to be used by downstream project. It would wrap dataframes of different kinds (any supporting the protocol), and can be updated as the standard representation grows in a compatible way. This should also help with compatibility via dependency management (requiring X version of the library would imply supporting Y version of the standard).

Then, with time, downstream projects could be using:

    # user will provide a dataframe compatible object
    df = dataframe(pandas_df)
    df = dataframe(dask_df)
    df = dataframe(vaex_df)
    df = dataframe(modin_df)
    df = dataframe(sqlalchemy_table)
    ...

    # Initially we can live with iterators, but as the complexity grows, more functionality can be made available to downstream projects
    df[col_name].to_list()
    df[col_name].to_numpy()
    df[col_name].to_chunks()
    df[col_name].to_dataframe_matrix()
    ...

I think this is much better than nothing, can be implemented immediately (in the next release of each project), and should be somehow scalable, as more decisions are made on how to deal with the more complex problems.

1 Like

I would much rather things that support the array protocol, __len__, and slicing. Iterators are nice, but they mean you have to keep going back to the well if you want to look at things a second time. Further, while good memory management is not a goal here, using iterators guarantees that we have to do at least one copy.

I prefer @shoyer’s flat proposal. Can the index be promoted (demoted?) to be a normal column? Maybe with the key None (everyone has access to it, it is a singleton at the Python level and if you have a column labeled None I have other questions for you…) or some other well-known sentinel?

1 Like

My idea by suggesting an iterator instead of a list-like is that if a dataframe contains lots of columns, but just two are going to be used, you don’t need to copy or fetch that data. But I’m surely fine with your proposal too. Defining what’s the type of that object is the trickier thing, and surely requires more discussion, but if we agree in the general idea that’s a good step forward.

I like the idea of returning the index as a regular column. Not sure if using None as the label can be a problem with a multiindex (it will, if we use a dict to represent the data). And we can consider whether the row indices should be part of the standard or just a pandas thing. But I’m sure we can find a good solution. May be the return could be something like (with the fields that make sense):

def __dataframe__(self):
    return [{'name': 'numbers',
             'is_index': True,
             'type': int,
             'data': [1, 2, 3]},
            {'name': 'letters',
             'is_index': False,
             'type': str,
             'data': ['a', 'b', 'c']}]

I have gone ahead and implemented the start of a test suite for @shoyer’s proposal: https://github.com/tacaswell/dataframe_spec. Went with np.asarray and slicing working the columns to mean “aray like”.

Happy to move this some place more neutral than my personal account and/or give push access to anyone who wants it.

It seems like it would be useful to specify and document an ABC (Abstract Base Class) for the GenericDataFrame concept along with any helper ABC interfaces (e.g. GenericColumn). This would include method stubs (that raise NotImplementedError) along with clear docstrings. This would help us capture all the requirements of different consumers while also making sure that we have well-defined encapsulation of knowledge about the “dataframe producer”. Thoughts?

There’s quite a lot of details to work out. For example, this interface should probably provide a minimal set of type metadata objects that are not coupled to NumPy dtypes or similar implementation-specific metadata. As rationale, you might use different options when calling to_numpy for integers vs. for strings.

The basic data type objects you need are

  • Boolean
  • The 8 signed and unsigned ints
  • 32/64 bit floating point
  • Some date and time types (e.g. you could pick a common subset of ones available in NumPy, pandas, and Arrow)
  • Binary (bytes)
  • String (unicode)
  • Any (Python object)

I’d be hesitant initially to import pandas’s row index concept into this interface. It seems like some consumers more likely would want to be able to reference the index by its name as another data column. If a simple OrderedDict-like interface is successfully adopted then we could always add it later.

2 Likes

The protocol / spec (i.e. don’t build a library build a spec and a test-suite) approach makes a lot of sense to me. I’d like to see a minimal protocol / API / spec created with a test-suite that other data-frame libraries can build on.

I’d like to see the same thing for NumPy as well.

This is different than a common data-structure. A common data-structure is helpful for some cases, but I it’s even more helpful to have a common way to talk about your data-structure (type-systems and interfaces).

1 Like

I agree that such a protocol is a good and worthy goal. I think a similar thing for data-types and ndarrays (tensors) is also useful.

I had not seen the C Data interface before. This is much closer to what I’ve been looking for and what motivated XND. Thanks for sharing.

Thanks, I did not know that!

Yes, at least groupby/aggregation and basic math (multiplication). I prefer to aim high.

I agree (and this is what vaex already does), but this API would not be enough. Lets imagine you want to get all chunks by a dataframe backed up Apache Flight, or some other remote dataset (like vaex’ remote dataframe). When you are processing chunk N, you’d like the library to fetch chunk N+1 async. E.g.

for chunk in df.chunk_iterator(chunk_size=1_000_000): # or `async for`
    # in the background the next chunk should be fetched from network/compute/GPU
    process(chunk)   # chunk can be another dataframe, or numpy 'matrix', in the back

I would really like to aim higher, in vaex we’ve come a long way to supporting a pandas like API, and have columns act like numpy arrays, while not copying data. An example of this is https://github.com/vaexio/vaex/pull/415 where we are able to pass vaex dataframes to sklearn transformers without making any changes to sklearn. So my point is, Apache Arrow/@wesm aimed high (no memory copies, mmappable, active comunity) and seems to hit the target, I think in terms of dataframe API/interoperability we should not aim lower. With Vaex I try to stretch the limits, so I see/know what is possible, so why settle for less?

I guess you wanted to tag me? Excellent work! I’ll look into it, do you think we should have a CI running on that repo, or should that library be used/imported by others to check compatibility?

Thanks, I think it’s really would help compatibility and interoperability. For instance, Vaex tries to mimic 1d (expressions) and 2d numpy arrays (dataframes) and if I were able to plug in NumPy’s test on that, it would be really useful. I think the same would apply to Apache Arrow as well, or QuantStack’s xtensor.

1 Like

I don’t think this has to be in the dataframe protocol. The dataframe protocol only allows you to interpret each individual chunk the library gives you. APIs for async fetching etc. seem to be a separate, library-specific API plumbing issue.

Similarly, just before the Python buffer protocol doesn’t have any notion of async fetching, doesn’t mean it can’t be used in an async fetching context.

1 Like

I don’t think we shouldn’t do this but it’s a sequencing question. If we try to solve too many problems simultaneously, it will be difficult to finish and deliver some preliminary things that have a lot of value. The interchange problem is relatively small and narrowly scoped, and we should be able to come up with a specification (basically a helper library defining ABCs and some helper classes for metadata) for this and deliver it fairly quickly. If this is well received the next stage of work would be to add certain computation APIs to these ABCs. Does that make sense?

You’re more than welcome to work on a proposal for computation APIs and add them to the discussion, but I think it would be a bad idea (risking analysis paralysis) to couple together interchange APIs with computation APIs in terms of getting community acceptance. Once there is a community investment in something then incrementally adding new APIs will be easier.

Conversely, if we fail to achieve a consensus solution on interchange then the entire effort is kaputt. So we limit our collective time investment.

2 Likes

I probably should have tagged you as well, sorry @maartenbreddels . Implemented Maarten’s suggestion of a test suite testing Stephan’s definition [I also discovered I’m still a “new user” so it would not have let me mention two people!]

Travis is currently running on the repo (https://travis-ci.com/github/tacaswell/dataframe_spec) or it can also be easily embedded in tools own tests suites (assuming they use pytest).

I agree!

I also strongly agree with Antoine and Wes about this being a sequencing issue. There are going to be lots of devils in the details that need to be worked out. It is better to confront those details on a limited proposal when a major re-think in still possible. Further, the lighter weight thing we start with the easier it is to get wrappers / buy-in from library so we can get some real-world usage.

2 Likes