A dataframe protocol for the PyData ecosystem

I agree that this is well motivated.

In the spirit of looking for a minimal complexity solution, let me throw out a concrete proposal: The return value of __dataframe_interface__() should be a Python dict with values that are each convertible to a 1D array, all of which must have the same length.

Array convertible

Exactly what “array convertible” means should be defined more carefully. One necessary property is that array convertible objects must satisfy np.asarray(obj).ndim == 1. But we probably should be more restrictive than that, to avoid over-specialization on quirks of NumPy.

Here are some examples of possibly “array convertible” values that are worth considering:

  • Unnested Python list objects with “scalar” values.
  • 1D NumPy arrays.
  • Objects that support Python’s buffer protocol with ndim == 1.
  • Objects that support NumPy’s __array_interface__ or __array__ (and that return 1D arrays).

Not every Python Sequence should be array convertible. For example, str definitely is not. tuple is a bit of a gray area: NumPy and pandas often interpret tuples as lists, but in idiomatic Python tuples correspond to heterogeneous collections, so it’s a little strange to treat them as arrays.

Example uses

A plotting library like seaborn should be able to sanitize a dataframe into a type-safe format with something like the following:

import numpy as np
import pandas as pd

def convert_to_dataframe(obj):
    if not hasattr(obj, "__dataframe_interface__"):
        obj = pd.DataFrame(obj)
    df = obj.__dataframe_inteface__()
    return {str(k): np.asarray(v) for k, v in df.items()}

A more sophisticated computational library like scikit-learn might choose to check for more expansive protocols for working with array data, like those from NEP 18 or NEP 37, before falling back to using np.asarray(). This might allow for manipulating DataFrame data in a much more efficient manner, e.g., using distributed backend like Dask or on a GPU.

Implementation notes

  • I chose dict for simplicity, but perhaps it would be better to allow for any object that satisfies the Mapping interface from collections.abc, e.g., to allow for immutability. If so, we should be clear that the full Mapping interface must be satisfied, and the semantics of operations should not deviate from those of dict.
  • The protocol itself might need to consider immutability, so users can choose whether they want to copy data or use views.
  • My proposal intentionally omits row labels, because I don’t think those are a core part of the dataframe abstraction.
4 Likes

Something else worth noting is that “things that can be converted into a DataFrame” is a bigger category than just “DataFrame” objects. Good examples might include patsy.DesignMatrix and xarray.Dataset.

This is a challenging problem because the “protocol” is intimately related to the in-memory data model of the returned data – not only type metadata but the byte/bit-level layout.

Per Stephan’s comment:

Even what is meant by “1D array” must be analyzed. NumPy arrays have been shown (can add references) to be inadequate as a memory model (in general) for structured data analytics and for very large datasets. Some reasons:

  • Lack of coherent/consistent missing value representation (pandas does its own bespoke things)
  • Strings an afterthought relative to numeric data
  • Lack of support for nested types (values that are themselves arrays, structs, or unions)

Of course, I’ve spent the last 4+ years working on Apache Arrow which seeks to be a universal language-agnostic data frame memory model, but I’m not going to advocate necessarily for adopting that (even though that’s what I want everyone to be using). Note that Arrow just adopted a C Data Interface inspired by the Python buffer protocol to greatly simplify adding Arrow import/export to third party libraries.

If the scope of what you are trying to accomplish is narrow, namely:

  • The amount of data you intend to interchange is not large
  • You don’t care much about strings and nested data
  • You don’t care much about serialization / data structure conversion performance
  • You don’t care much about interop with the extra-Python ecosystem

then I think a dictionary of NumPy arrays is fine.

Some other notes:

  • It’s actually very difficult to zero-copy construct a pandas.DataFrame, so whatever input format to go back to pandas is probably going to copy / serialize. You at least want to make this as fast as possible
  • The benefits of a memory-mappable data frame representation are significant
3 Likes

This is a bit self serving, but I agree with @wesm that Apache Arrow would be a great start. I think strings are important enough alone to design with them in mind from the beginning. Again, this is a bit self serving, but I also believe in Apache Arrow.

1 Like

About Seaborn:

Due to the author’s refusal to expose even basic calculated
parameters for review,
I don’t think it’s appropriate to regard Seaborn as anything more
than pretty default charts.

For visualization-driven analyses where calculated parameters
are more important than pretty charts, I suggest something like
Yellowbrick instead (which I also have no stake in):

Yellowbrick extends the Scikit-Learn API to make model selection and hyperparameter tuning easier. Under the hood, it’s using Matplotlib.

https://www.scikit-yb.org/en/latest/

Refocusing discussion to the OT:

That being said,
what are the featues we need in a .to_dataframe() interface
that [pydata ecosystem] tools can easily call
so that we can limit further fragmentation
due to performance optimizations?

Do we have a list of all of the different DataFrame API implementations?

https://pandas.pydata.org/pandas-docs/stable/ecosystem.html#out-of-core lists a few.

DataFrame APIs:

Data selection and manipulation abstractions:

Though the name “dataframe” comes from R,
an obj.to_dataframe() could call obj.__dataframe_interface__();
but IDK what/which API we would expect such an object to implement
(or how to indicate whether that will make a copy)?

2 Likes

This may be your opinion, but factually it is not true (does R have row indexing?). Go back and look at pandas 0.1 (still on PyPI) – these were distinct visions from day one.

This sounds like an ad hominem. Can we avoid such comments on this forum?

6 Likes

Part of my motivation for starting the discussion on the data model was to have a principled place to start a discussion like this. It is tempting to jump in to start discussing protocols and APIs, but there’s not even a consensus on what a dataframe is, so that will make a protocol definition impossible. It’s also made more complicated because no project wants to be labeled “not a dataframe” or “dataframe-like” because it does not conform to some standard data model/API (think SQL and NOSQL).

As I said in the other thread, being inclusive of projects that self-brand as dataframes will be to the detriment of whatever protocol or API that gets developed, and thus the user. I do not think it useful to tie future APIs to the past APIs, but I do think it’s important to tie future semantics to past semantics.

If we focus on the common subset of all dataframe projects, we will land on relational algebra/SQL, which I think is uninteresting for the majority of the Pydata community and solved from an academic perspective. There would be little to discuss in this context.

I agree with above that Apache Arrow is a good start for the standard format. There are some things I’d like to be supported for it to replace compute kernels completely (transpose, for example). I’m not convinced that an API/protocol must be tied to a standard memory format, the caller just needs a standard way of consuming the result of the API/protocol call. This can be decoupled from the memory layout, especially if the caller is just going to iterate over the values.

Even though this scope is narrow, I still think it would be extremely helpful. On one hand the benefits of going the direction of Arrow are apparent, but on the other hand the have been a long time coming (because it is a hard problem!).

Having a name and a library independent way to ask “can I view you as a Mapping of equal length 1D arrays?” would make awesome from a down-stream point of view.

In my day-to-day work I see a lot of problems that do fit in a laptop / workstation and we should not lose sight of the long tail of “small data” problems while focusing on getting things right at scale for “big data”. [1]

In many use-cases (ex plotting) if you have nested structured data you are hosed anyway because the tools that would consume this interface don’t know how to deal with it.

I read this proposal as orthogonal to extra-python interop story as this is about intra-python interop.

This is very close to the required API for the data kwarg in Matplotlib that we have had from 1.5, but I would be in favor of not requiring a literal dict so that pd.DataFrame and friends can return them selves. I think abc.Mapping is sufficient and should assume to be immutable by default.

I enthusiastically endorse this idea in general and @shoyer’s concrete suggestion

[1] In many fields of science if you look at the aggregate data size it looks like “big data”, but it is split up into so many small data sets that you never have to grapple with it “at scale”. What we really have is a truck load of small-to-medium sized data problems.

3 Likes

I think it’s important to carefully consider the things that would be traded away by continuing to build on top of NumPy in 2020:

  • Missing values (NumPy doesn’t have this, except via numpy.ma which few people use)
  • Native strings (people use heap PyString/PyBytes, fixed size strings are too limited – these are slow to process)
  • Nested data (people use PyObject arrays of dicts, lists, tuples, etc., which use a lot of memory and are slow to process)
  • Memory mapping (for all data types – including strings – not just numbers)
  • Categorical data (pandas implements synthetic categorical data on top of NumPy)

If many projects adopt a data frame interface / protocol that does not have these things as 1st class citizens, in the long run it will create a lot of problems and ultimately fragment use cases (the ones that can live without these features, and the ones that can live with them).

I have the feeling that by layering the API, we could split the problem. At the higher level, a SQL-like API would provide a well-understood abstraction that can be plugged in all the implementations, with obvious performance issues, but if we agree that small data is still pervasive and that we could also mainly split and aggregate, this high-level API is still useful to access data in a portable way.
The second level would be more Pandas-like, with unknown complexity for some operations and magic behind the scenes but not too unreasonable. Pandas returns views for slices but not for fancy indexing, well, ok. The doc is clear except all the beginners stumble on it. Pandas can wrap a HDF5 container, and you pay a price, but there are trade-offs described in the doc.
The lower level would expose a hopefully reasonable set of low-level layouts and implementations. Relying on raw numpy arrays would be one, but chunked arrays could be another one (for Dask and PyArrow), and probably a few others.
An application code could decide to use any level, and, if it does not understand the 3rd level, it would revert to the 2nd level or the first.
Still, the semantics of tables is well understood, but xarray is different and does not map easily to tables. Should multidimensional tables with named axes be supported?
The format of Python strings is a huge issue in my experience, since they cannot be memory mapped. There are workarounds but they cost quite a lot. But that might something to address later.

The point of view that I (and I assume @GaelVaroquaux) am coming from is that there already exist libraries that build on top of numpy. Currently we have apis that look like

df = DataFrame(..)
ret = do_some_work(df['by'], df['ripping'], df['the'], df['df'], df['apart']) 

At this point we are already assuming numpy arrays of numbers (more-or-less). This proposal gives us a path to:

df = DataFrame(...)
ret = do_work_smarter(df)

and is almost completely orthogonal to if this is literally built on numpy, how it is encoded at the memory level, or the philosophical under pinnings of what is a DataFrame.

This is a way for those of us at the next level closer to the users to be able to exploit all of the work y’all are doing without going all-in on any implementation. If in the long run we start talking about how numpy arrays duck-type as an Arrow column or the other way around :woman_shrugging:, in the short term this gets us a path to start using this stuff natively ASAP.

Part of the Matplotlib CZI grant is supporting @story645 to think about how Matplotlib wants to interact with structured data in a consistent way.

2 Likes

To me this seems to be calling for a middleware / wrapper layer that inserts an abstraction between the developer-consumer (e.g. matplotlib) and the data representation (which could be pandas, Arrow, or something else). So rather than protocol_df[col] returning its naked data representation, you would call a method protocol_df[col].to_numpy() to ask for the data to be returned to you as a NumPy array. Then it would be up to libraries “that provide DataFrames” to implement and export this middleware API.

pandas for example could implement the middleware API and provide methods for the data to be exported to you in the format you require. So then matplotlib codes against the middleware API and not pandas directly.

One of the challenges with this will be implementation-specific serialization options (for example, how to coerce the underlying representation into the desired output data structure – NumPy datetime64 types spring to mind)

2 Likes

Yes, that seems fair. col.to_numpy() seems to overlap with what __array_protocol__ is trying to do?

To first order, failing seem OK to me. Same thing with columns of nested data, if it can’t be cleanly cast to a nd array (I am also interested in columns of (n-1)D arrays) then that is a good sign that the consumer could not cope with it anyway.

I also suspect that this will be the first of number of increasingly rich interfaces/middle layers:

  1. “I am a mapping of array like things (or things you can get array like things from)”
  2. “… and I support groupby”
  3. “… and I support row labels”

At the end of the day I would like to be able to ask the object I get in “do you support the feature I need?” and then I can decide to cast it to some type that does or raise.

I want to stress again, even level 0 (how ever it is spelled) would me a major step forward and does not require boiling the ocean to get there.

Sure, but def to_numpy(self, ...) might have some additional options that cannot be passed with __array_protocol__. So it would look like this:

# "Protoframe" implementation for Library

class LibraryProtoframe(protoframe.DataFrame):

    def __getitem__(self, key):
        # implementation


class LibraryColumnWrapper(protoframe.Column):

    def __array_protocol__(self, ...):
        return self.to_numpy()

    def to_numpy(self, null_sentinel=DEFAULT_PLACEHOLDER):
        # implementation
        ...

class LibraryFrame:

    def __dataframe_protocol__(self):
        return LibraryProtoframe(self)

So in matplotlib you would write

def function(data, attr='x', ...):
    if hasattr(data, '__dataframe_protocol__'):
        protoframe_data = data.__dataframe_protocol__()
        attr = np.asarray(protoframe_data[attr]))
    # and so on
    ...
1 Like

For what it’s worth, Modin has this “middleware abstraction” agnostic to the API layer, which multiple groups are using to program against. In Modin, this abstraction is the QueryCompiler interface, and is not tied to any specific API or implementation. An abstraction like this probably does belong in pandas, and I have offered to contribute the parts of it that make sense to pandas once it’s more mature.

I’ll be interested to take a closer look at this.

If I’m understanding some of the motivations for this discussion, many projects want to accept “dataframe-like data” at call sites but want to avoid:

  • Having a hard dependency on the pandas library
  • Requiring users (or other third party libraries they want to interoperate with) to coerce their data into pandas

I agree with both of these motivations.

2 Likes

Just to clarify, it doesn’t need to literally be a Numpy array, just something that can be cast to a numpy array or better yet supports Numpy semantics & duck typing for a key set of operations that haven’t totally been pinned down yet and might intersect w/ @devin-petersohn’s work (the mpl middlewear is also a projection-query model).

Also by returned as a NumPy array, do you mean a view of the data or a copy?

1 Like

It would be a view if the underlying data representation supports it.

For example, Arrow’s to_numpy returns a view/is zero-copy for numeric data without nulls:

In [6]: import pyarrow as pa                                                                                                                                                                   

In [7]: arr = pa.array([1, 2, 3, 4], type='float32')                                                                                                                                           

# This is a view / zero-copy
In [9]: arr.to_numpy()                                                                                                                                                                         
Out[9]: array([1., 2., 3., 4.], dtype=float32)

In [10]: arr = pa.array([1, 2, None, 4], type='float32')                                                                                                                                       

In [11]: arr.to_numpy()                                                                                                                                                                        
ArrowInvalid: Needed to copy 1 chunks with 1 nulls, but zero_copy_only was True

# This copies
In [12]: arr.to_numpy(zero_copy_only=False)                                                                                                                                                    
Out[12]: array([ 1.,  2., nan,  4.], dtype=float32)
1 Like

Interesting discussion, and ideal we now have a place to discuss this.
Maybe it is useful to split the discussion in a few use cases:

Small data, uses all rows
I like @shoyer 's idea of a Mapping with array convertible values, simple and easy.

I think for libraries such as matplotlib, where you should expect the data to fit in memory, this should be good enough. I think actually most libraries could already do this. But I do consider this subset a solvable problem.
It’s true that taking Arrow in mind, some columns would not be convertible to a numpy array, but I guess you cannot plot them either, so failing there should be fine (and only the used columns should be converted).

Large data, needs aggregation
When data becomes larger, an API for aggregation becomes more important because you don’t want to visualize/show a billion individual rows, you want to groupby a particular column (or maybe binby in vaex’ terms). For libraries such as plotly express (also seaborn?), it would be extremely useful to be able to pass in a vaex/cudf/vaex dataframe and expect it to work.

Large data, but need to pass over all rows
In cases where you don’t want aggregation, but you need to pass over all the data, you want to ask a dataframe for chunks of data at a time. In vaex we’ve been hammering on this API for quite a bit in order to pass all the data to scikit-learn incremental learners in an efficient/parallel way (it will, for instance, submit the evaluation of the next chunk before it yields the current one) such that single-threaded consumers don’t have to wait for the next chunk.

Also, writing Arrow or Parquet in chunks, where you cannot hold the whole dataframe in memory (and some columns aren’t even materialized, they are ‘virtual columns’ in vaex) is a common use case for dealing with a dataframe.

Viewing a dataframe as matrix
As I mentioned here: Dataframe Data Model discussion: What is this popular abstraction?
I really like that the Modin paper started naming things. For instance, a matrix dataframe (all the same type, int or float) is something that can (in theory) be passed to sklearn as a matrix. We are working on implementing NEP 13/18 on top of a matrix dataframe for vaex https://github.com/vaexio/vaex/pull/415 but if we had a more common API, this might be able to live outside of Vaex. I think having a common name helps, if only for consistent error messages (e.g. ‘This dataframe is not a matrix dataframe, column X is of type QXY’), having an API to maybe ask for these conditions might be useful.

To summarize, I fully agree that a dataframe protocol/API would be really useful, but I think different use cases require different features/APIs.

Possible path forward
I could imagine a similar story https://promisesaplus.com/ where instead of trying to build a library that would beat all javascript Promises, they laid down a spec with a test suite ( https://github.com/domenic/promise-tests https://www.npmjs.com/package/promises-aplus-tests). What I like about it, is that the spec is formalized by the unittests, meaning that if you pass the tests, you follow the spec, no questions asked. (https://www.youtube.com/watch?v=V2Q13hzTGmA for a background on this).

5 Likes