Would it be fair to say that C++ knowledge is mandatory to create language bindings for parquet?

I think Julia lacks a good parquet reader. I toyed with the idea of trying to help out on the Julia parquet reader but found that there many hurdles including not knowing how to use protobuff(or some library).

Is knowledge c++ mandatory for implementing parquet in other languages?

No, it’s not mandatory, but you do need to be able to read and write the Thrift metadata (not sure if there is a Thrift implementation for Julia). Fully (and correctly) implementing the format is a large project, though (speaking from my own experience), so it is likely more expedient to create bindings to the C++ library in Julia rather than building a Julia-native implementation.

I should point out there are a number of technologies needed for a complete Parquet implementation:

  • Serialization (Thrift)
  • Compressors (Snappy, zlib, lz4, zstd, there is also LZO but it seems to have fallen out of fashion). You can mostly get away with only zlib and Snappy
  • Encryption (OpenSSL) – though this is new to the format in the last 18 months and not widely used yet

Julia does have a native parquet reader, see here, so I think all the pieces that are needed should be there as native Julia code. I’m sure one could improve it, of course!

I think the main thing missing right now is a parquet writer for Julia.

I would it’s not mature. I have tried reading many parquet files that I have saved, seems to error quite often.

Open issues that report these problems!

Did thay before iirc.

I don’t see any issues from you, open or closed, in the Parquet.jl repo :slight_smile:

I see. You are right. I must have gotten confused with Feather.jl with this issue https://github.com/JuliaData/Feather.jl/issues/124

It was from the same dataset saved in parquet in python and R but can’t be read back in Julia. I might submit an issue at some point.

I have opened up 5 issues now, see https://github.com/queryverse/ParquetFiles.jl/issues

Awesome, thanks!

Most of them are due to Parquet.jl clearly currently not being able to handle certain column names. I opened a PR that fixes some of those issues and almost fixes a third problem :wink: I’ll need help with the others from the Parquet.jl author.

Ok. I am mulling whether to take up a contract to create Parquet writer in Julia. The issue is I have very little C++ exp. Would you say it’s feasible for an experience Python/R programmer to learn enough C++ to get it going? Whichs parts of C++ should I focus on learning? Any book recommendations?

I have 40 hours to complete a basic writer. Basically one week of full time work.

If I were you, I would work on creating a native Julia implementation of the brand new Arrow C data interface https://github.com/apache/arrow/blob/master/docs/source/format/CDataInterface.rst. A small amount of wrapper code could then be created to take a Julia-generated blob of Arrow memory and write a Parquet file using the C++ API.

If instead you want to use the low-level Parquet write API, you can certainly do that. The main files you need to work with are src/parquet/file_writer.h and src/parquet/column_writer.h. I’m not sure how long it would take to learn enough C++ to work with these APIs since everyone is different. If you have experience with C programming in general and its value types (pointers, references, values) then the jump to C++ is easier since mostly you need to learn about how C++ classes work (constructors, destructors, etc.) and some C++11-specific features like std::move.

Aside from C++ language features, there might be some initial shock from working with C++ build systems and packaging libraries for installation.

On the bright side, C++ as a language isn’t going anywhere so the investment of time would pay dividends longer-term.

1 Like

Thank you very much for the advice!

I read the C Data Interface, I only did one semseter of C more than 10 years ago, but I think given some time I can learn it.

I see there are two format, one for schema and one for the data.

My understanding of the general approach. Say I allocate two memory blob and in one I write the schema and in the other I write the data. Then I can call a C++ function (which is suppose easy to do in Julia) that points to those two blob and write out a parquet file?

Is this the page about C++ writing parquet file that I should be checking? Currently it has a TODO and it says write this.


This new C data interface looks very promising!

Is there some more documentation about it somewhere?

Would we then for example compile Arrow C++ and the C interface (the glib stuff, I guess?) into a shared library, and then from Julia load those shared libraries, and then just call some C function a la write_parquet_file(filename, pointer_to_c_data_interface_structure) in these shared libraries? If so, that would be fantastic!

I think the info I couldn’t find right now is mostly what functions from the arrow libraries accept these pointers.

I think the next step for the Julia community is to try to cross-compile the whole arrow stuff using https://binarybuilder.org/.

@davidanthoff So I take it you have the technical understanding here to know how to approach this? Basically, given my limited background in R, Julia, and Python, I don’t really know how to take this further.

Yes, I think I roughly know how to push this forward. I’ll play around with building the arrow stuff with Julia binary builder next, that is clearly the first step we need to take.

Ideally i’d like to help but i dont feel i understand enough c to help. I am learning c++ though

The C Data Interface implementation is part of the C++ library and enabled by default. You simply need to include arrow/c/bridge.h to export and import data from/to Arrow C++ using the C Data Interface.

Also, you can find some useful C helpers in arrow/c/helpers.h.

(the Arrow C/Glib binding is unrelated, it’s basically a GObject wrapper around Arrow C++ APIs)

The original question was about writing parquet files. Is what you mentioned a necesary step towards that goal?