Dataframe Data Model discussion: What is this popular abstraction?

I want to start a discussion around dataframes and how we can think about the dataframe space moving forward. I hope this can be a discussion/debate about the merits and data model of the dataframe. Please do not use this as a way to market your dataframe implementation, the discussion should be focused on the data model.

The lack of formalism around the dataframe is a problem because of the difficulty in optimization, implementation, and scalability.

Discussion Questions:

  • Concretely, what is a dataframe? Is it a table? Is it a matrix? Is it something else?
  • Is it important to preserve the semantics of older implementations (S, R, pandas), or should some new definition emerge?
  • Does a formal definition of the dataframe matter to you?

To start the conversation, I would like to share my take on the subject. In January, a group of us at UC Berkeley submitted a pre-print with a formal definition for the dataframe (there is also a section in the paper with the origin story of the dataframe). Dataframes in the classical sense are not tables, nor are they matrices. We explain this in the paper: A dataframe can be viewed in two equivalent ways. From a relational viewpoint, dataframes are ordered relations with [explicitly] named rows and a lazily-induced schema. From a linear algebra viewpoint, dataframes are heterogeneous [lazily typed] matrices with added labels for rows and columns. Dataframes can do operations from both linear and relational algebra, and I think it’s important that these semantics are preserved.

image

Definitions will rule out some existing implementations by their nature, but it is important to define this formalism so we can rationalize about scaling these properties. Historically the way systems have scaled the dataframe is to ignore or remove dataframe properties or operators that are difficult to scale, rather than try to formalize them. For example, by this definition Spark DataFrames are not dataframes because Spark is entirely relational in its implementation.

I do not think that we should try to be as inclusive as possible with a definition because it defeats the purpose of the formalism, if everything is a dataframe we cannot optimize it. At the same time, it’s important to define the dataframe and clarify the distinctions from the relational table so that semantics (not necessarily API) are preserved. If dataframes are tables then database researchers will say that everything should just be done with SQL and ORMs (and indeed many already do :smile:)

I’m interested to hear other people’s takes on the subject.

2 Likes

So I’ve been thinking a lot about generalizing Codds algebra into more or less of a catch all for the Matplotlib data model, so I’m very curious about the details of the limits you’re hitting for describing dataframes.

Are you working on Lara by chance? We cite the Lara works in our paper. It has a very nice and concise representation, and it is almost certainly possible to extend the Lara data model to support the dataframe data model. There are challenges with indexing/masking, but I think that should be possible to work through. I think hypothetically the lazy schema can be integrated into the Lara data model as well. If you are working on Lara I would love to chat.

Our preprint was meant to introduce dataframes to the database community, because many in that community still believe dataframes to be relations. Much of the way we describe things in the preprint is to draw connections and distinctions with the relational data model. There are still challenges with describing the lazy schema component (where not all data in a column must conform to the same type), and the algebra can be further reduced.

A glass shattering just went in my head

I’m not sure row labels belong in the formal dataframe data model. This strikes me as more of a quirk of pandas than an intrinsic dataframe feature. In R, row names tend me to frowned upon and aren’t supported at all in one of the most popular dataframe implementations. In most cases, I think it’s more useful to think of row labels as just a special type of column(s).

I’ve posted a bunch of links to this other thread: A dataframe protocol for the PyData ecosystem to: [pydata] DataFrame implementations (Pandas, Dask, Vaex, xarray, pyspark, modin,) and tools that use said interfaces, arrow docs (why arrow is not another DataFrame API), W3C Model for Tabular Data and Metadata on the Web https://www.w3.org/TR/tabular-data-model/#model . As is a common issue with normalization/denormalization into tabular datasets with references, I’m not sure where the demarcation point is between these two discussions.

Is it useful to have DataFrame columns with an overly-abstract “object” type, or missing data, or piecewise exclusions due to such schema variability?

DataFrames are obviously a useful data abstraction. Thinking in terms of iterable streams of data may be a better basis for developing adaptive systems that continue to learn as new data is discovered from new sensors with new schema.

I think it may just be that DataFrame APIs with object methods are easier to work with than ORM-less SQL. In general, relational algebra systems like SQL are far more rigorously and completely defined, but parametrization is hard and dangerous and SQL docs tend to suck compared to Python API docs.

All abstractions are leaky. Object-relational impedence mismatch is real. Schema evolution is hard; especially when there’s a DBA, multiple applications, a data warehouse, and a data lake. Data quality may be even harder; especially without objects that do client-side data validation. Dataframes are often good enough; until you start working with neural nets, tensors, graphs, EAV/RDF, and RDF*.

I’m not sure row labels belong in the formal dataframe data model.

@shoyer That’s an interesting point, we also considered making it syntactic sugar as a “special column”, but ultimately decided against it. That said, it can be implemented in this way, there’s no rule against that :slight_smile:. Removing row labels altogether would effectively make a dataframe non-transposable, which I believe is fundamental to the optimization/definition of the dataframe.

Let’s take a common example: pivot. In SQL systems pivot is very annoying to implement because of the schema definition. Pivot elevates data into the metadata of the dataframe and does a transpose + groupby. In this model, we can define an operator to move data to/from the labels, solving the elevation problem. We can also define a transpose operator and a groupby operator, so pivot becomes a clean combination of those operators, instead of some of inheriting the challenges of SQL systems. We talk more about the expressiveness in the paper I linked above.

This strikes me as more of a quirk of pandas than an intrinsic dataframe feature.

This is a slight digression, but I hear the argument against pandas a lot, talking about its quirkiness. Row labels are one of the most commonly used abstractions in pandas, so I don’t think they should be dismissed because other systems have chosen to frown upon their use or remove them. This is what I meant above by systems removing properties that are hard to scale/support efficiently. Users are not the people who advocate for removing properties from pandas, in fact most find the API to be “just right” by a factor of 2-1.

I feel that developers and engineers often misidentify the problems users face. Users like pandas. If our goal is to solve problems for users, I feel we cannot remove some property they find useful because it is considered “bad practice in other systems”. People are not flocking from pandas to other systems for exactly this reason. Instead, I feel we should take it as a set of new challenges moving forward to support the semantics (not necessarily the API) users have grown to depend on, whether we know how to scale/optimize them or not.

Hi Devin,

your paper was a great read, I really liked that you started naming things. I think I’ll adopt the naming ‘homogeneous dataframe’ and ‘matrix dataframe’ for vaex as well, since we can treat those dataframes as a ‘numpy matrix’.

I agree with this, in vaex I’ve actually avoided doing that because I do not agree with its design. I don’t consider it bad perse, but too limited. I don’t think there needs to be a ‘special’ column, I’m much more in favor of SQL’s distinction between a primary key/unique and indexed columns (fast lookups). It’s a balance between usability and flexibility I guess.

Row labels are fantastically useful for ways of organizing data, e.g., for wide dataframes and 3D “panel” datasets (which I know were some of Wes’ original motivating use-cases) and also for the scientific datasets for which we created xarray.

I am not advocating for removing row labels from pandas, which are absolutely a core part of pandas. But given that they are not present in many alternative data frame models, let’s think carefully about whether they’re a core part of the generic data model.

In my mind, a notion of “indexed” columns feels much more useful than “row labels.” Even in pandas, there has been interest in moving the API in this direction.

Interesting that both of you are saying the same thing. I think it makes sense to have an indexed column(s) and the data model isn’t really restricting that use case. It’s just a bit more general in this sense. It gets a bit strange when you start having null or missing values in this index column (to support missing labels), but the data model doesn’t prevent an implementation from requiring unique values for each row. The data model requires the interchangeability of column and row metadata (labels and types) because of the transposability of the structure. In the case of “no labels”, null values or positional notation can be used in the labels place (kind of like the default index for pandas). This would also probably be implementation specific.

I guess it makes sense at this point to discuss whether dataframes should be transposable. In the definition and algebra linked above, most of the time transpose will be rewritten internally and does not need to be supported generally for most queries (think of UDFs defined along either column or row axis). I think it’s an important distinguishing feature of dataframes and I’m interested to hear why others might disagree.

The Julia data ecosystem has also avoided row labels. We do have some table types that can have an index column. As far as I can tell, row labels really are mostly a pandas thing.

1 Like