Would it be fair to say that C++ knowledge is mandatory to create language bindings for parquet?

There are lots of different questions in this thread :slight_smile: If you want to use the Parquet C++ library to write Parquet files, you must talk to it in a language it understands. Which means you probably need to give it Arrow data.

@antoine thanks for the pointer! I looked through that now, and here is how I interpreted how this would work. Am I on the right track/understanding with that? I’m still not sure whether I understand properly how this is all supposed to work :slight_smile:

We would in Julia allocate arrays and data structures that conform to the C Data interface layout. Right now I think that should be fairly straightforward. So then we have a pointer to one of these C Data interface structures.

Do I then call one of the functions from bridge.h and pass this pointer to the structures we allocated in Julia to it? And then I get something back from those functions that I can for example pass to the parquet C++ writer? And all of that wouldn’t require that a copy of the original array data structures has to be made?

That’s the idea… But the point of the C Data Interface is to be able to expose or ingest Arrow data without taking a dependency on the Arrow C++ library. If you already plan to take a dependency on the Arrow C++ library (for example because you want to use the C++ Parquet implementation), then I’m not sure taking a detour through the C Data Interface is useful. You can just as well construct a regular C++ Array instance around your data (unless you’re extremely uncomfortable with C++, but proficient in C, in which case the C Data Interface may help).

The example given in the spec may point you to the kind of scenarios where the C Data Interface is really useful. Say database engine FooDB wants to expose a C client API that gives out Arrow-compatible data, but without burdening itself with a dependency on Arrow C++ (because other client APIs are available). Then it can expose a C client API that basically gives out a C struct ArrowArray.

Ok I am figuring out things as I go along. Seems like the first step is to follow the step in Building Arrow C++

git clone https://github.com/apache/arrow.git
cd arrow/cpp
mkdir release
cd release
cmake ..
make parquet

and this will build the parquet library used in here

Now I just need to come up with the table in

#include "parquet/arrow/writer.h"

{
   std::shared_ptr<arrow::io::FileOutputStream> outfile;
   PARQUET_ASSIGN_OR_THROW(
      outfile,
      arrow::io::FileOutputStream::Open("test.parquet"));

   PARQUET_THROW_NOT_OK(
      parquet::arrow::WriteTable(table, arrow::default_memory_pool(), outfile, 3));
}

and turn that into a function so I can use CxxWrap.jl.

Now, it seems like a few things still need to be done.

  1. Write a Julia DataFrame into arrow blob structure (potentially leveraging Arrow.jl)
  2. Write the C++ function using CxxWrap.jl that calls the parquet write function (yet to be writer) to write the arrow blob into parquet file

These are notes for me in case I forget. Also for other to let me know if I am sort of on the right track.