Announcing Remote Build Execution

Alex EagleAlex Eagle
7 min read

Remote Build Execution (RBE) is a technique for off-loading computation of a wide build and test graph to a farm of worker machines. It can vastly speed up development when changes affect a large subgraph of a monorepo.

Aspect is pleased to announce that this is now a supported feature of our Workflows platform on both GCP and AWS, in use on our OSS rulesets and rolling out to users today!

What is Remote Build Execution (RBE)?

Bazel has a built-in scheduler. It tries to parallelize build steps as much as possible given the estimated available resources. For example, on an eight-core machine, it might run eight different test actions concurrently, if it determines other resources can allow it (for example, available system memory).

Bazel will queue actions which otherwise might have been able to run immediately when resources on the “local” machine are exhausted. Remote Build Execution allows additional resources on other computers to be added to the build. Instead of queuing, Bazel then uses a remote API to send RPC calls to that “farm” of computers. The inputs are identified (and uploaded, if needed), then an RPC call schedules the action to run remotely.

In some cases this makes the overall build faster. In this post we’ll discuss which cases those are, so you can decide if RBE is right for your organization.

Remote Build Execution is misunderstood

Aspect’s competitors have offered Remote Build Execution from the beginning. Their pricing is based on usage (either on the upper bound on the number of executors or on Cloud Compute resource consumption.) Perhaps due to this perverse incentive, they have positioned RBE as the way to accelerate build and test and encourage every user to adopt it. From reading their website, an engineer can reasonably come away with the false impression that RBE is the first step to speeding up a slow build. However this naive view usually results in much higher costs. You may have heard me in a conference talk describe RBE as “the performance optimization of last resort”.

Our first goal is “Minimal Execution”, which is where a build is incremental thanks to a very high cache hit rate. Of course “Minimal Execution” is faster and cheaper than either Local or Remote execution. That’s why Aspect doesn’t have a usage-based pricing model! Later in this article, I’ll dig more into the reasons that a small-to-medium sized codebase and team might not get enough benefits from RBE to make it worthwhile. For now, just be aware that the highest-order factor for ANY Bazel project is the remote cache, which is how highly incremental builds can skip work, and is the bottleneck for Bazel to look up cache hits to avoid re-work.

Aspect’s largest users like Airtable do benefit from Remote Build Execution. These companies have hundreds of engineers working on a codebase with millions of lines of code. Even after minimizing the amount of execution, their build graph shape still lends itself to wide parallelism on a typical product engineer’s change.

About Aspect’s RBE

Aspect has open-source in our DNA. So it was obvious from the beginning that we’d build on excellent, well-maintained and battle-tested Remote Cache and Execution software. We chose Buildbarn! It powers hundreds of developers at Apple and has a strong Slack community. (Apple’s open-source office doesn’t like to publicize their projects, so it’s not obvious that Buildbarn is built there!)

Our Buildbarn remote cache deployments have been live at every Workflows customer for the last 18 months. The cache is highly available and scalable, and has been rock-solid. So adding RBE was “just” a matter of enabling another Buildbarn component.

In practice, it was not that easy. Although Buildbarn has an “example deployments” repository, there’s a lot of missing documentation. Users are forced to read comments on the protocol buffer definitions to understand a lot of the fields. If you decide to deploy Buildbarn yourself, and run into problems, we recommend our partner Meroton for professional services.

Moving execution off of Bazel’s host machine has the architectural benefit of separation of concerns. You can choose compute instance types to match the profile of your Bazel actions, rather than needing to bend the actions to fit the available resources on the machine where Bazel runs. Separating the environment where the CI system executes steps from the environment where actions execute also adds security and maintainability benefits. This is especially true when the remote cache and execution is accessed from developer machines.

As a special case of this separation, some builds need to execute actions on various hardware platforms. Aspect has several autonomous driving customers with custom circuit boards. Remote execution allows Bazel to run on a standard and inexpensive instance type like AWS Graviton, while some test logic can execute on a runner process on the board. Bazel itself understands execution platforms, and Buildbarn allows workers on multiple platforms within a single cluster. Aspect Workflows includes the configuration needed to connect Bazel and Buildbarn ensuring that each action runs on the right hardware and operating system.

Customers may wish to distribute a "platform-compatible" workload over a variety of compute options, with various scaling parameters and compute specificity. Buildbarn accounts for this with a concept called "size classes". With no effort required by the user, Buildbarn may schedule a single action across the gamut of size class options available, and uses this data to schedule that action in the future on the least expensive size class where it is likely to succeed. This has the advantage of still delivering fast results, but optimizing the workload over time to reduce costs.

RBE also serves as a stop-gap when a build is non-incremental, typically when the Remote Cache is unavailable or has been intentionally cleared as part of an infrastructure rollout.

Watch for a formal Case Study on Aspect Workflows RBE at some of our customers, coming soon!

When to consider Local rather than RBE

As I've said at conference talks, we consider RBE to be the "performance optimization of last resort". We believe strongly that reducing execution is the first step, to minimize cloud compute costs.

Here are other factors that make Bazel’s Local execution strategy a better choice than RBE for small-to-medium sized repositories:

  1. When typical developer activity results in invalidating many expensive test actions, compute costs will be high (regardless of whether execution is local or remote). If your organization’s budget doesn’t allow for re-computing all these tests, you may want to run them less frequently (i.e. only after merge, or nightly). We often see this in robotics or autonomous driving, where physics simulations run as part of a test. This “test selection” strategy is in contrast to the typical Bazel idiom which is to “test everything affected”.

  2. RBE requires strictness in configuring Bazel’s toolchains and hermeticity to account for the “host” and “execution” platforms differing. This can be a big task, and gets harder as you work through the “long tail” of atypical build actions. As our competitor writes on one of their case studies:

Migrating a Bazel project to remote execution can be a daunting task

  1. A wide cache miss is often caused by an infra engineer who changed an SDK version, or some other configuration change that invalidates a lot of the graph. In that use case, the engineer probably doesn't expect the same performance as a product engineer making routine changes.

  2. It’s easy to add more Local resources. By provisioning a larger machine for Bazel, the built-in scheduling will parallelize build actions over the available resources such as multiple CPU cores. By occupying such a machine for short time periods, we can avoid the high cost implications of these expensive instance types.

  3. Your build graph may not be amenable to parallelization. An action cannot run until its inputs are available, so when a build is slow due to the “critical path” of such serialized actions, adding more compute resources doesn’t make the build faster.

Try Aspect Workflows

Try Aspect's RBE solution by requesting a free trial of Aspect Workflows. Our engineering team will deploy an instance to try out with your real workload, and gather the data to help decide whether RBE is a net benefit for your team!

0
Subscribe to my newsletter

Read articles from Alex Eagle directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Alex Eagle
Alex Eagle