by Dan Manges
Two weeks ago, we announced the first public release of ABQ, a universal test runner designed to split test suites into parallel jobs in CI. We learned a lot about how to build a tool to do this, because our first approach to create ABQ failed.
The core idea behind ABQ is that queues are typically the best mechanism for distributing workloads. Without ABQ, most CI workflows will pre-partition tests for parallel execution by splitting the files into distinct groups.
However, queues are a better approach for distributing tests among multiple CI jobs for the same reason they’re effective at distributing other work. They allow workers to pull from the queue as fast as possible, resulting in all of the processes finishing around the same time. With pre-partitioning, scenarios can arise where one process runs longer than all of the others.
Our first approach is exactly what most engineers would expect for queue-based test distribution:
nworkers would all spawn and connect to the queue
0to be responsible for generating the test manifest – when it spawned, it’d generate a list of tests to run and populate the queue
0to be responsible for printing the test results
For the most part this approach worked well, but we ran into a few challenges that were significant enough that we had to completely scrap this approach.
Due to the possibility of flaky tests, CI jobs need to be able to be retried. To provide this capability, we initially synchronized the exit status across the cluster. Each worker would wait to exit until the entire test suite was finished so that they knew whether to exit with a zero or non-zero exit status. That way, engineers could easily click a button to retry all failed jobs in CI.
We implemented the exit status synchronization, but this solution was undesirable to teams beta testing ABQ who were concerned about their billable CI minutes for two reasons.
First, some test suites, such as end-to-end integration tests, consist of a small number of very slow tests. To synchronize the exit status, we had to have every process wait to exit until the last test finished. Despite ABQ’s superior distribution mechanisms, having a small number of very slow tests could still result in workers waiting to exit, taking up CI minutes while effectively being idle.
Second, retrying the entire cluster uses substantially more minutes than retrying an individual worker. Additionally, if tests are flaky, retrying the entire cluster increases the compounded probability of having a flake fail, reducing the likelihood of retries succeeding.
We attempted to mitigate the issues with flaky tests by building automated retries into ABQ. We hoped that the retries would help eliminate failures due to flakiness, reducing the frequency that test suites needed to be retried.
Although the automated retries worked well, we still too frequently saw scenarios in beta testing where entire test suites needed to be retried.
We ultimately came to the conclusion that the clustered queue-based approach with a synchronized exit status wasn’t going to be viable.
We completely overhauled the approach to ABQ to address these shortcomings. In response to what we learned, we decided to:
With these changes in place, ABQ worked superbly well on CI. We made ABQ generally available and released it as open source.
We’re happy to have shipped a new universal test runner for running tests in parallel jobs. At the same time, it’s clear that more could be done to facilitate compute-efficient retries of flaky tests in CI workflows. We also built Captain to detect and mitigate flaky tests, and we have even more ideas in store.
We’d love to chat with anybody thinking about these problems. Say hello on Discord or reach out at [email protected]
Get the latest releases and news about RWX and our ecosystem with our newsletter.