Our first design for test parallelization failed

Two weeks ago, we announced the first public release of ABQ, a universal test runner designed to split test suites into parallel jobs in CI. We learned a lot about how to build a tool to do this, because our first approach to create ABQ failed.

Background

The core idea behind ABQ is that queues are typically the best mechanism for distributing workloads. Without ABQ, most CI workflows will pre-partition tests for parallel execution by splitting the files into distinct groups.

However, queues are a better approach for distributing tests among multiple CI jobs for the same reason they’re effective at distributing other work. They allow workers to pull from the queue as fast as possible, resulting in all of the processes finishing around the same time. With pre-partitioning, scenarios can arise where one process runs longer than all of the others.

Initial Approach

Our first approach is exactly what most engineers would expect for queue-based test distribution:

n workers would all spawn and connect to the queue
we configured worker 0 to be responsible for generating the test manifest – when it spawned, it’d generate a list of tests to run and populate the queue
once the manifest was ready, workers would start pulling from the queue
workers would push test results back to the queue, and we also configured worker 0 to be responsible for printing the test results

For the most part this approach worked well, but we ran into a few challenges that were significant enough that we had to completely scrap this approach.

Flaky Tests Cause Major Problems

Due to the possibility of flaky tests, CI jobs need to be able to be retried. To provide this capability, we initially synchronized the exit status across the cluster. Each worker would wait to exit until the entire test suite was finished so that they knew whether to exit with a zero or non-zero exit status. That way, engineers could easily click a button to retry all failed jobs in CI.

We implemented the exit status synchronization, but this solution was undesirable to teams beta testing ABQ who were concerned about their billable CI minutes for two reasons.

First, some test suites, such as end-to-end integration tests, consist of a small number of very slow tests. To synchronize the exit status, we had to have every process wait to exit until the last test finished. Despite ABQ’s superior distribution mechanisms, having a small number of very slow tests could still result in workers waiting to exit, taking up CI minutes while effectively being idle.

Second, retrying the entire cluster uses substantially more minutes than retrying an individual worker. Additionally, if tests are flaky, retrying the entire cluster increases the compounded probability of having a flake fail, reducing the likelihood of retries succeeding.

Automated Retries

We attempted to mitigate the issues with flaky tests by building automated retries into ABQ. We hoped that the retries would help eliminate failures due to flakiness, reducing the frequency that test suites needed to be retried.

Although the automated retries worked well, we still too frequently saw scenarios in beta testing where entire test suites needed to be retried.

We ultimately came to the conclusion that the clustered queue-based approach with a synchronized exit status wasn’t going to be viable.

Atomic Workers

We completely overhauled the approach to ABQ to address these shortcomings. In response to what we learned, we decided to:

have each worker exit as soon as it was done pulling tests from the queue, with an exit status specific to whether any tests assigned to that worker failed
have the queue persist which tests were dispatched to which workers, so that individual workers could be retried and run the exact same list of tests that they originally pulled from the queue
build an abq report command that could be run after the entire cluster finishes to display an aggregated list of test results. Beta testers loved having an aggregate view of test failures, and we wanted to keep that feature!

With these changes in place, ABQ worked superbly well on CI. We made ABQ generally available and released it as open source.

Future Possibilities

We’re happy to have shipped a new universal test runner for running tests in parallel jobs. At the same time, it’s clear that more could be done to facilitate compute-efficient retries of flaky tests in CI workflows. We also built Captain to detect and mitigate flaky tests, and we have even more ideas in store.

We’d love to chat with anybody thinking about these problems. Say hello on Discord or reach out at [email protected]