Go - Apache Arrow Parquet

Hey everyone,

I’m currently in the progress of porting the C++ version of Parquet in the Apache Arrow project to Golang. Many projects and companies have been and are building their data lakes and persistence layer using Parquet. Apache Spark uses it heavily for persistence (including Databricks DeltaLake).

To me this is the missing component for people to truly begin using the Go implementation of Arrow with any existing data architectures.

If you have any interest in this project, give this post a like / bookmark it as it will keep me motivated to finish the port. Also, if you have specific use cases feel free to drop them in here so I can keep them in mind as I continue with the port.

Things with the code base are rather in flux at the moment as I figure out how to solve various nuances between the features of C++ and Go. As soon as I have a solid chunk of the port working, I’ll create a PR in the Apache Arrow project on Github and let everyone know in here.

Hi @nickpoorman, sounds like a great effort. I would definitely share the project with the Apache Arrow community when you have a chance as there are likely other Arrow Go developers who would be interested in this. When you mentioned it, it was the first I’d heard of someone working on it, so letting people know (even by opening a JIRA issue) reduces the chance of people duplicating work.

1 Like

@wesm I created the issue in JIRA: https://issues.apache.org/jira/browse/ARROW-7905

I only recently took up the effort. Previously, I was using another Parquet library in Go but came to the conclusion that it would be better to align with the Arrow version of Parquet so I figured this would be a good time to start porting the C++ version over to Go.

@nickpoorman have you seen this https://github.com/xitongsys/parquet-go

also: sebastian binet (gonum core dev) has a guide on arrow: https://blog.gopheracademy.com/advent-2018/go-arrow/

@chewxy That’s the library I was using. Due to how its API is designed, I ended up having to manually write much of the mapping between native Go types and arrow types. When I got to Parquet logical types, they were non-existent in that library. Also, the parquet.thrift maintained in the Arrow project was out of sync with the one in the parquet-go library. As more and more edge cases cropped up, I decided it was easier to port everything over and doing so would give me more control over the integration between Parquet and Arrow.

Reading Sebastien’s guide is what actually lit the fire in me to push doing data engineering and data science in Go about a year ago. I’ve been following his work ever since.

1 Like