Attempting to write "Arrow C++ for the completely clueless"

I am trying to write a minimal C++ code so I can write a parquet file. And hopefully, eventually, learn enough to be able to use the C++ code to write parquet files from Julia. I already started work on that and some notes can be found here.

I am completely clueless about C++ but one book and 10 videos later I seem to be getting somewhere. So more help would be much appreciated!

Here are some of the steps I have taken to get something going, so I don’t forget how I got here. I am not 100 sure, whether these steps are needed due to my cluelessness but here we

1. Prepare to install (build?) Thrift

Which you can follow the guide here. Thrift also seems to require boost, so I am copying and pasting the code from the link so I don’t lose it

# install a bunch of things needed by Thrift
sudo apt-get install automake bison flex g++ git libboost-all-dev libevent-dev libssl-dev libtool make pkg-config

wget http://ftp.debian.org/debian/pool/main/a/automake-1.15/automake_1.15-3_all.deb
sudo dpkg -i automake_1.15-3_all.deb

wget http://sourceforge.net/projects/boost/files/boost/1.60.0/boost_1_60_0.tar.gz                                                                      tar xvf boost_1_60_0.tar.gz
cd boost_1_60_0
./bootstrap.sh
sudo ./b2 install

2. Build Thrift by following this tutorial

Download a copy of thrift from here

Untar and uncompress it e…g. (tar -xzvf thrift.gz), then C

The says do the following but it’s NOT ENOUGH for me.

./configure && make

Note for me, I was using WSL2 and I happened to have a nodejs installed on Windows as well and it was causing issues, so the below works better

./configure -without-nodejs -with-boost=/usr/local/<path to boost>
make

3. Compile arrow

Firstly you need to “compile” the arrow C++ src. I don’t what the technical term is, but it’s using cmake and make; see SO post

git clone https://github.com/apache/arrow.git
cd arrow/cpp
mkdir release
cd release
cmake .. -DCMAKE_INSTALL_PREFIX=<install_path> -DARROW_PARQUET=ON
make
make install

I think I needed because some folder couldn’t be written to or something

sudo make install

4. Compile a program
For me, adding

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<install_path>/lib

was necessary to RUN without error (it will compile OK just runs with error).

Make a minimal C++ file which is an amalgamation of this example and the writing parquet example code

// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#include <cstdint>
#include <iostream>
#include <vector>

#include <arrow/api.h>
#include "parquet/arrow/writer.h"
#include "parquet/exception.h"
#include "arrow/io/file.h"

using arrow::DoubleBuilder;
using arrow::Int64Builder;
using arrow::ListBuilder;

// While we want to use columnar data structures to build efficient operations, we
// often receive data in a row-wise fashion from other systems. In the following,
// we want give a brief introduction into the classes provided by Apache Arrow by
// showing how to transform row-wise data into a columnar table.
//
// The data in this example is stored in the following struct:
struct data_row
{
    int64_t id;
    double cost;
    std::vector<double> cost_components;
};

// Transforming a vector of structs into a columnar Table.
//
// The final representation should be an `arrow::Table` which in turn
// is made up of an `arrow::Schema` and a list of
// `arrow::ChunkedArray` instances. As the first step, we will iterate
// over the data and build up the arrays incrementally.  For this
// task, we provide `arrow::ArrayBuilder` classes that help in the
// construction of the final `arrow::Array` instances.
//
// For each type, Arrow has a specially typed builder class. For the primitive
// values `id` and `cost` we can use the respective `arrow::Int64Builder` and
// `arrow::DoubleBuilder`. For the `cost_components` vector, we need to have two
// builders, a top-level `arrow::ListBuilder` that builds the array of offsets and
// a nested `arrow::DoubleBuilder` that constructs the underlying values array that
// is referenced by the offsets in the former array.
arrow::Status VectorToColumnarTable(const std::vector<struct data_row> &rows,
                                    std::shared_ptr<arrow::Table> *table)
{
    // The builders are more efficient using
    // arrow::jemalloc::MemoryPool::default_pool() as this can increase the size of
    // the underlying memory regions in-place. At the moment, arrow::jemalloc is only
    // supported on Unix systems, not Windows.
    arrow::MemoryPool *pool = arrow::default_memory_pool();

    Int64Builder id_builder(pool);
    DoubleBuilder cost_builder(pool);
    ListBuilder components_builder(pool, std::make_shared<DoubleBuilder>(pool));
    // The following builder is owned by components_builder.
    DoubleBuilder &cost_components_builder =
        *(static_cast<DoubleBuilder *>(components_builder.value_builder()));

    // Now we can loop over our existing data and insert it into the builders. The
    // `Append` calls here may fail (e.g. we cannot allocate enough additional memory).
    // Thus we need to check their return values. For more information on these values,
    // check the documentation about `arrow::Status`.
    for (const data_row &row : rows)
    {
        ARROW_RETURN_NOT_OK(id_builder.Append(row.id));
        ARROW_RETURN_NOT_OK(cost_builder.Append(row.cost));

        // Indicate the start of a new list row. This will memorise the current
        // offset in the values builder.
        ARROW_RETURN_NOT_OK(components_builder.Append());
        // Store the actual values. The final nullptr argument tells the underyling
        // builder that all added values are valid, i.e. non-null.
        ARROW_RETURN_NOT_OK(cost_components_builder.AppendValues(row.cost_components.data(),
                                                                 row.cost_components.size()));
    }

    // At the end, we finalise the arrays, declare the (type) schema and combine them
    // into a single `arrow::Table`:
    std::shared_ptr<arrow::Array> id_array;
    ARROW_RETURN_NOT_OK(id_builder.Finish(&id_array));
    std::shared_ptr<arrow::Array> cost_array;
    ARROW_RETURN_NOT_OK(cost_builder.Finish(&cost_array));
    // No need to invoke cost_components_builder.Finish because it is implied by
    // the parent builder's Finish invocation.
    std::shared_ptr<arrow::Array> cost_components_array;
    ARROW_RETURN_NOT_OK(components_builder.Finish(&cost_components_array));

    std::vector<std::shared_ptr<arrow::Field>> schema_vector = {
        arrow::field("id", arrow::int64()), arrow::field("cost", arrow::float64()),
        arrow::field("cost_components", arrow::list(arrow::float64()))};

    auto schema = std::make_shared<arrow::Schema>(schema_vector);

    // The final `table` variable is the one we then can pass on to other functions
    // that can consume Apache Arrow memory structures. This object has ownership of
    // all referenced data, thus we don't have to care about undefined references once
    // we leave the scope of the function building the table and its underlying arrays.
    *table = arrow::Table::Make(schema, {id_array, cost_array, cost_components_array});

    return arrow::Status::OK();
}

arrow::Status ColumnarTableToVector(const std::shared_ptr<arrow::Table> &table,
                                    std::vector<struct data_row> *rows)
{
    // To convert an Arrow table back into the same row-wise representation as in the
    // above section, we first will check that the table conforms to our expected
    // schema and then will build up the vector of rows incrementally.
    //
    // For the check if the table is as expected, we can utilise solely its schema.
    std::vector<std::shared_ptr<arrow::Field>> schema_vector = {
        arrow::field("id", arrow::int64()), arrow::field("cost", arrow::float64()),
        arrow::field("cost_components", arrow::list(arrow::float64()))};
    auto expected_schema = std::make_shared<arrow::Schema>(schema_vector);

    if (!expected_schema->Equals(*table->schema()))
    {
        // The table doesn't have the expected schema thus we cannot directly
        // convert it to our target representation.
        return arrow::Status::Invalid("Schemas are not matching!");
    }

    // As we have ensured that the table has the expected structure, we can unpack the
    // underlying arrays. For the primitive columns `id` and `cost` we can use the high
    // level functions to get the values whereas for the nested column
    // `cost_components` we need to access the C-pointer to the data to copy its
    // contents into the resulting `std::vector<double>`. Here we need to be care to
    // also add the offset to the pointer. This offset is needed to enable zero-copy
    // slicing operations. While this could be adjusted automatically for double
    // arrays, this cannot be done for the accompanying bitmap as often the slicing
    // border would be inside a byte.

    auto ids =
        std::static_pointer_cast<arrow::Int64Array>(table->column(0)->chunk(0));
    auto costs =
        std::static_pointer_cast<arrow::DoubleArray>(table->column(1)->chunk(0));
    auto cost_components =
        std::static_pointer_cast<arrow::ListArray>(table->column(2)->chunk(0));
    auto cost_components_values =
        std::static_pointer_cast<arrow::DoubleArray>(cost_components->values());
    // To enable zero-copy slices, the native values pointer might need to account
    // for this slicing offset. This is not needed for the higher level functions
    // like Value(…) that already account for this offset internally.
    const double *ccv_ptr = cost_components_values->data()->GetValues<double>(1);

    for (int64_t i = 0; i < table->num_rows(); i++)
    {
        // Another simplification in this example is that we assume that there are
        // no null entries, e.g. each row is fill with valid values.
        int64_t id = ids->Value(i);
        double cost = costs->Value(i);
        const double *first = ccv_ptr + cost_components->value_offset(i);
        const double *last = ccv_ptr + cost_components->value_offset(i + 1);
        std::vector<double> components_vec(first, last);
        rows->push_back({id, cost, components_vec});
    }

    return arrow::Status::OK();
}

#define EXIT_ON_FAILURE(expr)                            \
    do                                                   \
    {                                                    \
        arrow::Status status_ = (expr);                  \
        if (!status_.ok())                               \
        {                                                \
            std::cerr << status_.message() << std::endl; \
            return EXIT_FAILURE;                         \
        }                                                \
    } while (0);                                         \


int main(int argc, char *argv[]) {
    std::vector<data_row> rows = {
        {1, 1.0, {1.0}}, {2, 2.0, {1.0, 2.0}}, {3, 3.0, {1.0, 2.0, 3.0}}};

    std::shared_ptr<arrow::Table> table;
    EXIT_ON_FAILURE(VectorToColumnarTable(rows, &table));

    // std::vector<data_row> expected_rows;
    // EXIT_ON_FAILURE(ColumnarTableToVector(table, &expected_rows));

    // assert(rows.size() == expected_rows.size());

    std::shared_ptr<arrow::io::FileOutputStream> outfile;

    PARQUET_ASSIGN_OR_THROW(
        outfile,
        arrow::io::FileOutputStream::Open("test.parquet"));

    PARQUET_THROW_NOT_OK(
        parquet::arrow::WriteTable(*table, arrow::default_memory_pool(), outfile, 3));

    return EXIT_SUCCESS;
}

The code just writes out a parquet file which succeeded!

And compile it

g++ main.cpp -I<install_path>/include -L<install_path>/lib -lparquet -larrow -o main

Yay! Now I can run ./main and generate a parquet file!

Next Steps

I still feel like there’s a lot to do before I can achieve a Julia parquet write via C++ but I think the missing steps are

  1. Make a docker image to reproducible the above. Technically this is not needed, but I want to produce it to make sure I understand the process
  2. Write a C++program that accepts array blobs, and write out a parquet
  3. Wrap the C++ program with CxxWrap.jl
  4. Write a Julia package to use the above (potentially leverage Arrow.jl)

Any help offered would be much appreciated!

@davidanthoff

@evalparse it seems like this discussion would be better to have on the Arrow channels (dev@arrow.apache.org in particular) where more Arrow developers can help with your questions.

I wouldn’t recommend building Thrift yourself unless you have exhausted all of your other options somehow. Many Linux package managers have a new enough Thrift that there is no need to build it yourself (for example, apt install libthrift-dev on Debian). The Arrow C++ build system will also build it for you (pass -DThrift_SOURCE=BUNDLED to cmake. Note that Boost and all other build dependencies can be built this way). The requisite Thrift symbols are statically linked in libparquet.so so there is no runtime dependency on libthrift.so.

If you have more questions about the Arrow C++ project in particular we’ll be happy to help on the Arrow mailing lists.

1 Like

I see, I couldn’t find instructions on installing it on thrift or I have missed it in the arrow docs.

I think I tried that option, but it gave errors? I am totally new to C++ (3 days in) so I will go to the Arrow mailing list from now on. Thanks