One of the perennial “thorns in my side” in long-term maintenance of data analytics code is that of benchmarking and continuous performance monitoring.
The basic idea of Continuous Benchmarking (I’ll call this CB henceforth a la CI and CD) is the following:
- Developer builds a “benchmark suite” containing a large number of benchmarks which may measure code microperformance (codepaths taking as little as single digit microseconds) or more macroperformance (things taking seconds or minutes)
- For each commit to a codebase, the benchmarks are run in a controlled, consistent environment (this is very important) and the results are recorded in some kind of database (e.g. SQLite as one example)
- For each benchmark, you can observe the performance within that controlled environment over time.
The purpose of CB is to identify performance regressions and protect developers’ code optimization labors from unintentional slowdowns. For me, there are few things more vexing than finding that some performance-sensitive data processing that you slaved over to make faster became slower over some period of time without knowing why.
For the highest quality results, it is best to run the benchmarks on the same physical machine every time, preferably without pollution from other jobs running on it at the same time. If you try doing CB in public CI services (e.g. Travis CI, Appveyor, CircleCI) the results will in general be useless – especially for microperformance – due to inconsistency with what processor is used and other issues like load on the bare metal where the VM is running.
There have been various tools created to help with CB.
- Over 8 years ago I created a microframework called vbench for CB for use in pandas, which was still in use in pandas until version 0.15.0 or so.
- Other Python developers created a more mature realization of the CB concept with the Airspeed Velocity (ASV) package. ASV is what pandas uses now for benchmarks. It is capable of generating a static site for exploring the benchmark data
Critically, tools like vbench and ASV have code that automates the benchmark execution and data collection. In other words, for each commit:
- Check out the codebase at a particular git commit
- Rebuild the project (including any C extensions) at that commit
- Run the benchmarks and insert the results into a database (e.g. SQLite is used in ASV and vbench)
I have found when discussing CB that people sometimes hand wave over the benchmark automation problem. For example, there is the Codespeed Python project, but it does not deal with the mechanics of checkout-build-benchmark-collect-and-store-data.
There have been some CB tools created for other programming languages (interested to hear about the ones I don’t know about!). There are some issues I’d like to discuss and see if there are people interested in collaborating on.
Critically, most CB tools are language-specific (e.g. just for Python). This means that many common problems (database schema design, data collection, data management, website generation) have to be solved over and over again for each language. This language-specificity arose as a problem for us in Apache Arrow where we have code so far in about 11 different programming languages. We have benchmarks written both in C++ and Python, for example, but no tools to collect and manage the C++ benchmark data (while we have set up ASV for the Python benchmarks).
This experience has left me yearning for a non-language-specific CB framework. The idea would be as follows:
- A sufficiently general database schema for storing benchmark data, allowing results from different machines
- Some code to extract machine information (CPU/GPU information, OS / Linux kernel version, relevant installed dependencies)
- A “benchmark runner” program providing for pluggable build/rebuild logic and pluggable data collectors. Data collectors specific to target programming languages or benchmarking libraries (e.g. Google Benchmark for C++, Python, Go, Java, Rust, etc.)
- A tool to generate a website (static or dynamic) to browse the stored benchmark data
There’s probably some other nice-to-have features (like a REST API to enable remote benchmarkers to “report” in data to a central server), but this would get things started. We definitely need this in Apache Arrow but it would make sense to develop the software in a general purpose fashion so it can be reused in other OSS projects.