Our first design for test parallelization failed

ABQ
Apr 19, 2023
Dan Manges
Our first design for test parallelization failed

Two weeks ago, we announced the first public release of ABQ, a universal test runner designed to split test suites into parallel jobs in CI. We learned a lot about how to build a tool to do this, because our first approach to create ABQ failed.

Background

The core idea behind ABQ is that queues are typically the best mechanism for distributing workloads. Without ABQ, most CI workflows will pre-partition tests for parallel execution by splitting the files into distinct groups.

However, queues are a better approach for distributing tests among multiple CI jobs for the same reason they’re effective at distributing other work. They allow workers to pull from the queue as fast as possible, resulting in all of the processes finishing around the same time. With pre-partitioning, scenarios can arise where one process runs longer than all of the others.

Initial Approach

Our first approach is exactly what most engineers would expect for queue-based test distribution:

For the most part this approach worked well, but we ran into a few challenges that were significant enough that we had to completely scrap this approach.

Flaky Tests Cause Major Problems

Due to the possibility of flaky tests, CI jobs need to be able to be retried. To provide this capability, we initially synchronized the exit status across the cluster. Each worker would wait to exit until the entire test suite was finished so that they knew whether to exit with a zero or non-zero exit status. That way, engineers could easily click a button to retry all failed jobs in CI.

We implemented the exit status synchronization, but this solution was undesirable to teams beta testing ABQ who were concerned about their billable CI minutes for two reasons.

First, some test suites, such as end-to-end integration tests, consist of a small number of very slow tests. To synchronize the exit status, we had to have every process wait to exit until the last test finished. Despite ABQ’s superior distribution mechanisms, having a small number of very slow tests could still result in workers waiting to exit, taking up CI minutes while effectively being idle.

Second, retrying the entire cluster uses substantially more minutes than retrying an individual worker. Additionally, if tests are flaky, retrying the entire cluster increases the compounded probability of having a flake fail, reducing the likelihood of retries succeeding.

Automated Retries

We attempted to mitigate the issues with flaky tests by building automated retries into ABQ. We hoped that the retries would help eliminate failures due to flakiness, reducing the frequency that test suites needed to be retried.

Although the automated retries worked well, we still too frequently saw scenarios in beta testing where entire test suites needed to be retried.

We ultimately came to the conclusion that the clustered queue-based approach with a synchronized exit status wasn’t going to be viable.

Atomic Workers

We completely overhauled the approach to ABQ to address these shortcomings. In response to what we learned, we decided to:

With these changes in place, ABQ worked superbly well on CI. We made ABQ generally available and released it as open source.

Future Possibilities

We’re happy to have shipped a new universal test runner for running tests in parallel jobs. At the same time, it’s clear that more could be done to facilitate compute-efficient retries of flaky tests in CI workflows. We also built Captain to detect and mitigate flaky tests, and we have even more ideas in store.

We’d love to chat with anybody thinking about these problems. Say hello on Discord or reach out at [email protected]

Never miss an update

Get the latest releases and news about RWX and our ecosystem with our newsletter.

Share this post

Enjoyed this post? Pleas share it on your favorite social network!

Related posts

Read more on updates and advice from the RWX engineering team

See all posts
ABQ 1.5: Replay test runs locally
ABQAnnouncements

ABQ 1.5: Replay test runs locally

Discover ABQ 1.5: Replay CI test runs locally and optimize performance with new test batching options. Try it now!

Jun 7, 2023
Read now
ABQ 1.3 Generally Available, Open Source Release
ABQAnnouncements

ABQ 1.3 Generally Available, Open Source Release

Discover ABQ: The Universal Test Runner for Parallel CI Jobs and split test suites efficiently with ABQ, an open-source tool.

Apr 5, 2023
Read now