Would it be fair to say that C++ knowledge is mandatory to create language bindings for parquet?

I see. You are right. I must have gotten confused with Feather.jl with this issue https://github.com/JuliaData/Feather.jl/issues/124

It was from the same dataset saved in parquet in python and R but can’t be read back in Julia. I might submit an issue at some point.

I have opened up 5 issues now, see https://github.com/queryverse/ParquetFiles.jl/issues

Awesome, thanks!

Most of them are due to Parquet.jl clearly currently not being able to handle certain column names. I opened a PR that fixes some of those issues and almost fixes a third problem :wink: I’ll need help with the others from the Parquet.jl author.

Ok. I am mulling whether to take up a contract to create Parquet writer in Julia. The issue is I have very little C++ exp. Would you say it’s feasible for an experience Python/R programmer to learn enough C++ to get it going? Whichs parts of C++ should I focus on learning? Any book recommendations?

I have 40 hours to complete a basic writer. Basically one week of full time work.

If I were you, I would work on creating a native Julia implementation of the brand new Arrow C data interface https://github.com/apache/arrow/blob/master/docs/source/format/CDataInterface.rst. A small amount of wrapper code could then be created to take a Julia-generated blob of Arrow memory and write a Parquet file using the C++ API.

If instead you want to use the low-level Parquet write API, you can certainly do that. The main files you need to work with are src/parquet/file_writer.h and src/parquet/column_writer.h. I’m not sure how long it would take to learn enough C++ to work with these APIs since everyone is different. If you have experience with C programming in general and its value types (pointers, references, values) then the jump to C++ is easier since mostly you need to learn about how C++ classes work (constructors, destructors, etc.) and some C++11-specific features like std::move.

Aside from C++ language features, there might be some initial shock from working with C++ build systems and packaging libraries for installation.

On the bright side, C++ as a language isn’t going anywhere so the investment of time would pay dividends longer-term.

1 Like

Thank you very much for the advice!

I read the C Data Interface, I only did one semseter of C more than 10 years ago, but I think given some time I can learn it.

I see there are two format, one for schema and one for the data.

My understanding of the general approach. Say I allocate two memory blob and in one I write the schema and in the other I write the data. Then I can call a C++ function (which is suppose easy to do in Julia) that points to those two blob and write out a parquet file?

Is this the page about C++ writing parquet file that I should be checking? Currently it has a TODO and it says write this.

https://arrow.apache.org/docs/cpp/parquet.html#writing

This new C data interface looks very promising!

Is there some more documentation about it somewhere?

Would we then for example compile Arrow C++ and the C interface (the glib stuff, I guess?) into a shared library, and then from Julia load those shared libraries, and then just call some C function a la write_parquet_file(filename, pointer_to_c_data_interface_structure) in these shared libraries? If so, that would be fantastic!

I think the info I couldn’t find right now is mostly what functions from the arrow libraries accept these pointers.

I think the next step for the Julia community is to try to cross-compile the whole arrow stuff using https://binarybuilder.org/.

@davidanthoff So I take it you have the technical understanding here to know how to approach this? Basically, given my limited background in R, Julia, and Python, I don’t really know how to take this further.

Yes, I think I roughly know how to push this forward. I’ll play around with building the arrow stuff with Julia binary builder next, that is clearly the first step we need to take.

Ideally i’d like to help but i dont feel i understand enough c to help. I am learning c++ though

The C Data Interface implementation is part of the C++ library and enabled by default. You simply need to include arrow/c/bridge.h to export and import data from/to Arrow C++ using the C Data Interface.

Also, you can find some useful C helpers in arrow/c/helpers.h.

(the Arrow C/Glib binding is unrelated, it’s basically a GObject wrapper around Arrow C++ APIs)

The original question was about writing parquet files. Is what you mentioned a necesary step towards that goal?

There are lots of different questions in this thread :slight_smile: If you want to use the Parquet C++ library to write Parquet files, you must talk to it in a language it understands. Which means you probably need to give it Arrow data.

@antoine thanks for the pointer! I looked through that now, and here is how I interpreted how this would work. Am I on the right track/understanding with that? I’m still not sure whether I understand properly how this is all supposed to work :slight_smile:

We would in Julia allocate arrays and data structures that conform to the C Data interface layout. Right now I think that should be fairly straightforward. So then we have a pointer to one of these C Data interface structures.

Do I then call one of the functions from bridge.h and pass this pointer to the structures we allocated in Julia to it? And then I get something back from those functions that I can for example pass to the parquet C++ writer? And all of that wouldn’t require that a copy of the original array data structures has to be made?

That’s the idea… But the point of the C Data Interface is to be able to expose or ingest Arrow data without taking a dependency on the Arrow C++ library. If you already plan to take a dependency on the Arrow C++ library (for example because you want to use the C++ Parquet implementation), then I’m not sure taking a detour through the C Data Interface is useful. You can just as well construct a regular C++ Array instance around your data (unless you’re extremely uncomfortable with C++, but proficient in C, in which case the C Data Interface may help).

The example given in the spec may point you to the kind of scenarios where the C Data Interface is really useful. Say database engine FooDB wants to expose a C client API that gives out Arrow-compatible data, but without burdening itself with a dependency on Arrow C++ (because other client APIs are available). Then it can expose a C client API that basically gives out a C struct ArrowArray.

Ok I am figuring out things as I go along. Seems like the first step is to follow the step in Building Arrow C++

git clone https://github.com/apache/arrow.git
cd arrow/cpp
mkdir release
cd release
cmake ..
make parquet

and this will build the parquet library used in here

Now I just need to come up with the table in

#include "parquet/arrow/writer.h"

{
   std::shared_ptr<arrow::io::FileOutputStream> outfile;
   PARQUET_ASSIGN_OR_THROW(
      outfile,
      arrow::io::FileOutputStream::Open("test.parquet"));

   PARQUET_THROW_NOT_OK(
      parquet::arrow::WriteTable(table, arrow::default_memory_pool(), outfile, 3));
}

and turn that into a function so I can use CxxWrap.jl.

Now, it seems like a few things still need to be done.

  1. Write a Julia DataFrame into arrow blob structure (potentially leveraging Arrow.jl)
  2. Write the C++ function using CxxWrap.jl that calls the parquet write function (yet to be writer) to write the arrow blob into parquet file

These are notes for me in case I forget. Also for other to let me know if I am sort of on the right track.

Hey @evalparse i found your “Arrow C++ for the completely clueless” post here, it helped me to get started with reading and writing parquet files in C++ exploring further i wanted to encrypt the parquet files so i tried to look into the inbuilt encryption-reader-writer code in examples i am getting an error at cmake build which is in cpp/examples/parquet/

Error : – Building using CMake version: 3.10.2
– Configuring done
– Generating done
– Build files have been written to: /home/sandesh/apachearrow/arrow/cpp/examples/parquet/encryptlib
[ 12%] Linking CXX executable parquet-stream-api-example
/usr/bin/ld: cannot find -lparquet_static
collect2: error: ld returned 1 exit status
CMakeFiles/parquet-stream-api-example.dir/build.make:94: recipe for target ‘parquet-stream-api-example’ failed
make[2]: *** [parquet-stream-api-example] Error 1
CMakeFiles/Makefile2:67: recipe for target ‘CMakeFiles/parquet-stream-api-example.dir/all’ failed
make[1]: *** [CMakeFiles/parquet-stream-api-example.dir/all] Error 2
Makefile:83: recipe for target ‘all’ failed
make: *** [all] Error 2

And also when i try directly to compile the encryption code after executing it gives a runtime error of built without openssl so i think that is directly linked cmake.

Have you gone through this , if so your guidance would be appreciated.

No idea. Sorry. I ended up writing a parquet writer in pure Julia