Pooling efforts on Continuous Benchmarking (CB)

One of the perennial “thorns in my side” in long-term maintenance of data analytics code is that of benchmarking and continuous performance monitoring.

The basic idea of Continuous Benchmarking (I’ll call this CB henceforth a la CI and CD) is the following:

  • Developer builds a “benchmark suite” containing a large number of benchmarks which may measure code microperformance (codepaths taking as little as single digit microseconds) or more macroperformance (things taking seconds or minutes)
  • For each commit to a codebase, the benchmarks are run in a controlled, consistent environment (this is very important) and the results are recorded in some kind of database (e.g. SQLite as one example)
  • For each benchmark, you can observe the performance within that controlled environment over time.

The purpose of CB is to identify performance regressions and protect developers’ code optimization labors from unintentional slowdowns. For me, there are few things more vexing than finding that some performance-sensitive data processing that you slaved over to make faster became slower over some period of time without knowing why.

For the highest quality results, it is best to run the benchmarks on the same physical machine every time, preferably without pollution from other jobs running on it at the same time. If you try doing CB in public CI services (e.g. Travis CI, Appveyor, CircleCI) the results will in general be useless – especially for microperformance – due to inconsistency with what processor is used and other issues like load on the bare metal where the VM is running.

There have been various tools created to help with CB.

Critically, tools like vbench and ASV have code that automates the benchmark execution and data collection. In other words, for each commit:

  • Check out the codebase at a particular git commit
  • Rebuild the project (including any C extensions) at that commit
  • Run the benchmarks and insert the results into a database (e.g. SQLite is used in ASV and vbench)

I have found when discussing CB that people sometimes hand wave over the benchmark automation problem. For example, there is the Codespeed Python project, but it does not deal with the mechanics of checkout-build-benchmark-collect-and-store-data.

There have been some CB tools created for other programming languages (interested to hear about the ones I don’t know about!). There are some issues I’d like to discuss and see if there are people interested in collaborating on.

Critically, most CB tools are language-specific (e.g. just for Python). This means that many common problems (database schema design, data collection, data management, website generation) have to be solved over and over again for each language. This language-specificity arose as a problem for us in Apache Arrow where we have code so far in about 11 different programming languages. We have benchmarks written both in C++ and Python, for example, but no tools to collect and manage the C++ benchmark data (while we have set up ASV for the Python benchmarks).

This experience has left me yearning for a non-language-specific CB framework. The idea would be as follows:

  • A sufficiently general database schema for storing benchmark data, allowing results from different machines
  • Some code to extract machine information (CPU/GPU information, OS / Linux kernel version, relevant installed dependencies)
  • A “benchmark runner” program providing for pluggable build/rebuild logic and pluggable data collectors. Data collectors specific to target programming languages or benchmarking libraries (e.g. Google Benchmark for C++, Python, Go, Java, Rust, etc.)
  • A tool to generate a website (static or dynamic) to browse the stored benchmark data

There’s probably some other nice-to-have features (like a REST API to enable remote benchmarkers to “report” in data to a central server), but this would get things started. We definitely need this in Apache Arrow but it would make sense to develop the software in a general purpose fashion so it can be reused in other OSS projects.


One of the things I really love about Go is how benchmarking is a first class citizen.

As for running on consistent hardware, this seems like something ripe for academic institutions. I know DL had benchmarks running regularly on the 120-core servers at SUNY Oswego benchmarking ForkJoin for JVM. This problem sounds like a great grad-student thesis project.

1 Like

Completely agree with the proposal!
Especially, +1 for the google-bench C++ and potentially adapting it to a format ingestible by ASV.

1 Like

@teju85 Yes, I think it is essential to be able to ingest data from any benchmark framework.

@nickpoorman indeed, it seems that many features of Google’s Benchmark library for C++ (https://github.com/google/benchmark) are built into Go’s testing package https://golang.org/pkg/testing/.

Tooling for CB in go seems a bit thin, a few things I found:

So seems like we would need to write a tool to execute a Go project’s benchmarks and ingest the results into a common results DB schema.

ASV can be a bit clunky, but it is worth noting that it is not exactly python specific.
The tests must be written in python which requires either a python binding or launching a sub process from a python script. In either case, for more precise measurements of the code you want to measure, you’re better off having that script return the results via a tracking benchmark. Than using the wall time benchmark or the python specific benchmark runners.

@kvngrmn Do you know of any non-Python projects using ASV to monitor computational performance (particular at the micro-level, things taking < 1 second or < 100ms)?

not off hand, but that time resolution is not uncommon

I’m very interested in this topic. I’m often seeing cross project changes that affect the benchmark or integration tests of other projects. I actually put in a grant via NumFOCS to AWS for time to run such things for our projects but haven’t built anything yet.

1 Like

Our likely path (at least for Arrow benchmarking) will probably be either to fork an existing project written in Python (like ASV) – possibly upstreaming changes if there is sufficient interest – or starting a new code base from scratch. My worry with starting from something like ASV is that there might be too many Python-specific assumptions built into its design. If we’re the first group of OSS devs to make a move on this, we’ll certainly let folks here know and advertise elsewhere (e.g. on Twitter)

1 Like

What’s the next step here? Some thoughts:

  • git repo of features for a fuller PEP style proposal
  • examples showing different methodologies (e.g. coming from the HPC world I can point to many different attempts, since benchmarking is how vendors get paid there are a lot of things out there)
  • A wider call for community participation

Sure. I just created a new GitHub org and repository

Seems that the next step would be a requirements document

I would be interested in helping on any effort that starts here. One common issue with many benchmarking frameworks is that they present the results without the the necessary statistical analysis to make informed decisions.

For example, often when you are benchmarking you are trying to determine the effect a certain change has on performance. In this case, what you are looking for is the ratio between the two results with a confidence interval. I like the method proposed by Kalibera and Jones (shorter paper, longer manuscript), to compute confidence intervals of this ratio.

Another key insight in their work is that there are often multiple levels of reproduction you can do and you need some way of figuring out how to best use your time. This is an even bigger problem with optimizing JITs like the JVM, but is also true for AOT compiled programs.

I implemented some subset of the work in the paper a while back that gives a CI for the ratio of the means of two runs of results. It would be nice to make this sort of statistical analysis the default for folks comparing benchmark results.

I’m planning to write up a requirements document for the project, this week with any luck. I’ll post a link to it here and will circulate for comments.

1 Like

I just drew up this requirements document to help the discussion along

There are probably some things I missed and some things that are unclear so please leave comments on the PR and I will clarify unclear things or add missing details. Happy to continue discussing other things here also.

I just updated the PR based on a few comments that came in. More comments would be most welcome

Hi all, thanks so much for your comments on the PR. I committed the draft requirements document after incorporating the feedback. We could obviously continue to discuss and debate requirements, but it seems like what is there is “good enough” to enable a rough implementation sketch (or at least breaking down the project into some distinct “work areas”) so that some code can begin to be written. An agile planning approach will probably more effective than a waterfall one, as I don’t think we’ll be able to eliminate all uncertainties around details up front. As pieces begin falling into place, continuing to collect detailed feedback from different kinds of end users will help to navigate the implementation work.

As far as timeline, I am not sure. I personally am fully committed through the end of Q2. One or more of my Ursa Labs colleagues may be able to take up some of the initial work with the goal of having a POC for collecting benchmark data from Apache Arrow for a few of our supported programming languages (even just C++ and Python would be a good start).