I think adding Categorical to that list is also important.
If we want to add some
__dataframe_interface__ (or similar) method, it should be further discussed what that would look like. There’s agreement here from a couple of people, but I’m not convinced myself that a dictionary is the best way for downstream applications to consume dataframe data. Constructing a dictionary column-by-column is going to be very expensive for many implementations so it likely won’t be adopted.
@tacaswell Why aren’t we discussing a
to_pyarrow for downstream consumption? This would be much better than a dictionary and it’s a standard format that many systems already are using/implementing. I don’t see why we need to discuss a completely new interchange when we have pyarrow which can interchange even between languages. Is there a reason we should not be considering Arrow? If types are the issue, there are ways to handle that.
@wesm Modin has this ABC notion in the dataframe query compiler interface. It does need some work and better documentation, but I think it is worth having a solid place to start that is grounded in theory. This interface is currently being implemented by multiple groups in academia, larger companies, and startups, so if adoption is a major concern there’s already headway here. I don’t think many people here are familiar with Modin’s architecture, so I will outline the interface here.
The QueryCompiler interface is designed to be a smaller API than pandas while still capturing all its unique behaviors, such that all operators only have one way of being written. We place importance on the data model discussed in this other discussion. The interface provides a uniform way for the dataframe to be consumed, whether by an API layer or by some other application.
I think it is better to let implementations decide which parts they will implement or how they will implement them (similar to some historical SQL systems) to keep things generic. If a downstream application calls
df.index it is up to the implementation how that is handled (throwing error, getting column(s), etc.).
This spec can be moved into a separate project, it is rather easy to do since it’s just an interface anyway. This interface does solve the “common data structure” problem.