I want to start a discussion around dataframes and how we can think about the dataframe space moving forward. I hope this can be a discussion/debate about the merits and data model of the dataframe. Please do not use this as a way to market your dataframe implementation, the discussion should be focused on the data model.
The lack of formalism around the dataframe is a problem because of the difficulty in optimization, implementation, and scalability.
- Concretely, what is a dataframe? Is it a table? Is it a matrix? Is it something else?
- Is it important to preserve the semantics of older implementations (S, R, pandas), or should some new definition emerge?
- Does a formal definition of the dataframe matter to you?
To start the conversation, I would like to share my take on the subject. In January, a group of us at UC Berkeley submitted a pre-print with a formal definition for the dataframe (there is also a section in the paper with the origin story of the dataframe). Dataframes in the classical sense are not tables, nor are they matrices. We explain this in the paper: A dataframe can be viewed in two equivalent ways. From a relational viewpoint, dataframes are ordered relations with [explicitly] named rows and a lazily-induced schema. From a linear algebra viewpoint, dataframes are heterogeneous [lazily typed] matrices with added labels for rows and columns. Dataframes can do operations from both linear and relational algebra, and I think it’s important that these semantics are preserved.
Definitions will rule out some existing implementations by their nature, but it is important to define this formalism so we can rationalize about scaling these properties. Historically the way systems have scaled the dataframe is to ignore or remove dataframe properties or operators that are difficult to scale, rather than try to formalize them. For example, by this definition Spark DataFrames are not dataframes because Spark is entirely relational in its implementation.
I do not think that we should try to be as inclusive as possible with a definition because it defeats the purpose of the formalism, if everything is a dataframe we cannot optimize it. At the same time, it’s important to define the dataframe and clarify the distinctions from the relational table so that semantics (not necessarily API) are preserved. If dataframes are tables then database researchers will say that everything should just be done with SQL and ORMs (and indeed many already do )
I’m interested to hear other people’s takes on the subject.