This is a challenging problem because the “protocol” is intimately related to the in-memory data model of the returned data – not only type metadata but the byte/bit-level layout.
Per Stephan’s comment:
Even what is meant by “1D array” must be analyzed. NumPy arrays have been shown (can add references) to be inadequate as a memory model (in general) for structured data analytics and for very large datasets. Some reasons:
- Lack of coherent/consistent missing value representation (pandas does its own bespoke things)
- Strings an afterthought relative to numeric data
- Lack of support for nested types (values that are themselves arrays, structs, or unions)
Of course, I’ve spent the last 4+ years working on Apache Arrow which seeks to be a universal language-agnostic data frame memory model, but I’m not going to advocate necessarily for adopting that (even though that’s what I want everyone to be using). Note that Arrow just adopted a C Data Interface inspired by the Python buffer protocol to greatly simplify adding Arrow import/export to third party libraries.
If the scope of what you are trying to accomplish is narrow, namely:
- The amount of data you intend to interchange is not large
- You don’t care much about strings and nested data
- You don’t care much about serialization / data structure conversion performance
- You don’t care much about interop with the extra-Python ecosystem
then I think a dictionary of NumPy arrays is fine.
Some other notes:
- It’s actually very difficult to zero-copy construct a
pandas.DataFrame, so whatever input format to go back to pandas is probably going to copy / serialize. You at least want to make this as fast as possible
- The benefits of a memory-mappable data frame representation are significant