What would GitHub Actions look like if you designed it today?

In the Hacker News comments on "I hate GitHub Actions with passion," gen220 asked:

If you wanted a better version of GitHub Actions/CI, it would presumably need to be more opinionated and have more constraints?
Who here has been thinking about this problem? Have you come up with any interesting ideas? What's the state of the art in this space?
GHA was designed in ~2018. What would it look like if you designed it today, with all we know now?

Here are some interesting ideas.

#Feedback loop

GitHub Actions: commit and push to run

State of the art: use a local CLI to start a run

Having to make a work-in-progress commit and push to test workflow file changes results in a painfully slow feedback loop. This was the primary complaint of the post being discussed on HN.

Solving this problem is fairly easy: provide a mechanism to start a run from a local CLI. Additionally, parameterize run definitions so that the implementation is not coupled directly to version control webhooks.

#Parallelization

GitHub Actions: separate job definitions which correspond to separate VMs

State of the art: graph-based task execution, machine agnostic

With GitHub Actions, parallelization happens by defining separate jobs. Every time an engineer adds something new to a pipeline, they have to decide if they're going to make a new job for it or tack it onto an existing job. Each job has duplicate setup steps, creating duplication in both the configuration and also the execution, slowing down pipelines and driving up costs.

With graph-based task execution, all of that goes away. Engineers don't have to think about parallelization. Tasks will run with maximum parallelization based purely on the dependencies between the steps. Additionally, each step only needs executed once, regardless of how many downstream steps depend on it.

GitHub Actions is a server-centric model, and the state of the art is a serverless model.

#Caching

GitHub Actions: manual cache keys, manual archiving and restoring of directories

State of the art: automatic, content-based caching for every step

Caching is an important part of speeding up pipelines. On GitHub Actions, cache keys have to be configured manually using hashFiles. Unfortunately, this is highly error-prone. It's easy to omit a file from the list in hashFiles that actually affects the execution, which will result in getting a false positive on a cache hit. Additionally with GitHub Actions, you have to manually archive and restore directories when caching, and then manually no-op certain commands if you do not want them to run in the case of a cache hit.

With content-based caching, all of that complexity goes away. Anytime the CI platform sees the same command being executed on identical source file contents as a previous execution, the platform will automatic produce a cache hit and skip execution. This behavior can automatically apply to every step in the pipeline. With sandboxing, you can ensure that files not specified will not be present on disk at all, entirely eliminating false positives.

#Debugging remotely

GitHub Actions: add debugging statements to workflows and start a new run

State of the art: remote breakpoints to pause runs

GitHub Actions does not have a built-in mechanism to open a remote console on a running agent. This limitation results in engineers having to resort to print-line debugging or guessing at fixes while on a several minute feedback loop.

Setting a remote breakpoint can greatly accelerate debugging. Pause execution at the desired point, open a console, and figure out what's wrong.

#Debugging locally

The difficulty in debugging remotely motivates people to look for solutions to run locally.

GitHub Actions: third-party nektos/act repo that only works sometimes

State of the art: pull container images corresponding to any step in the pipeline

With no way to easily debug remotely and a desire for faster feedback loops, it's not a surprise that the third-party nektos/act repo has over 65k stars on GitHub. People want better workflows.

With already capturing file system state for graph-based task execution, it's possible to pull a container image corresponding to any step in a pipeline. While remote debugging is easy, pulling a container image for any step can also be quite helpful.

#Retries

GitHub Actions: have to cancel in-progress jobs to retry other failed jobs

State of the art: can retry jobs while other jobs are in progress

With GitHub Actions, if a job fails and you retry it, all of the other jobs which are running will be cancelled.

This problem is easy to solve, just allow retrying some jobs while others are still running.

#Retries again

GitHub Actions: every retry starts from the beginning of the job

State of the art: resume retries at the point of individual steps

With GitHub Actions, every retry starts from the top. That means if a step fails after 10 other steps have run, those 10 steps are going to be re-executed as well.

With a graph-based approach, retries start with the step that failed rather than starting at the top.

#Base images

GitHub Actions: 74 GiB image that bundles far too much

State of the art: use container images

GitHub Actions made a poor decision in including a bunch of bundled software in their base images. It makes dependencies implicit instead of explicit and results in workflows being less portable. It also negatively affects performance.

Instead, it's best to use minimal containers as base images for performance, explicitness, and portability.

#Third-party code

GitHub Actions: Actions, which are proprietary to GitHub and only work in the context of Actions

State of the art: portable bash scripts calling CLIs on a generic container base image

The third-party Actions are one of the greatest assets of GitHub Actions, and also one of the great downsides. While reusing code that somebody else wrote to automate something can be a great aid in initial development, it makes pipelines harder to understand. Actions inherently only run in GitHub Actions, which makes debugging them or running them in other contexts not possible. It's far too proprietary for a coding platform that's about openness.

It's better to implement third-party scripts using portable bash scripts instead of proprietary javascript. CLIs and APIs are the ubiquitous interface to interacting with services from systems scripts; we don't need platform-specific code.

#Supply chain security

GitHub Actions: Actions often sourced from mutable git branches

State of the art: lock file

GitHub Actions partially fixed this, but many engineering teams still have their third-party dependencies getting pulled from branches. A malicious commit pushed to a branch will instantly be pulled into their pipelines.

The state of the art is using a lockfile so that updates are only applied when desired. In addition to improving security, this approach also improves stability of pipelines.

#Failure summaries

GitHub Actions: clicking into each individual job that failed and scrolling through logs

State of the art: summarizing all failures, parsing test and linter failures

With GitHub Actions, when there are failures across multiple jobs, you have to click into each individual job to see what failed, scrolling through each individual log file.

Although CI pipelines are running command-line utilities that are meant to run in text-based terminals, CI results are displayed in browsers which are capable of displaying more richly formatted information. 95% of the time an engineer is looking at build results it's to see what failed. Test results and linter errors should be parsed and summarized, and failures across all jobs should be aggregated and displayed in a unified view.

#Retrying failed tests

GitHub Actions: if one flaky test fails, retire the entire job

State of the art: retry only the one test that failed

With GitHub Actions, if a single test fails in a test suite, the entire job including all of the setup needs to be executed again. This can make flaky tests especially painful as the compounding effect can result in other flaky tests failing on the retry.

Ideally flaky tests would be eliminated entirely, but they're a fact of life for many engineering teams. Being able to retry only the individual tests that failed, instead of the entire job, can provide a big boost to engineering productivity.

#Docker services

GitHub Actions: custom syntax to specify which containers to run

State of the art: docker compose up

GitHub Actions has a custom syntax for running background containers.

A better approach is to use common toolchains and run docker compose up in a background process.

#Resource provisioning

GitHub Actions: all steps run on a single machine

State of the art: different steps can run on different machines

With the VM-centric model in GitHub Actions, allocating more resources for one step in a pipeline can result in being over-provisioned for the rest of the pipeline.

With a graph-based execution model, resource provisioning can vary per step. One step can execute with 16 CPUs, while other steps in the pipeline only use 2 CPUs.

#Dynamic tasks

GitHub Actions: pipeline definitions statically defined

State of the art: pipeline definitions can be generated at runtime

GitHub Actions requires pipelines to be defined statically at the start. This often requires introducing complexity to vary pipeline behavior based on certain conditions.

A better approach is to allow defining tasks at runtime. In addition to providing flexibility, this also enables handling complexity with code rather than with expressions in a static YAML definition.

#RWX is the state of the art

Everything described in this post is available in RWX.

Never miss an update

Get the latest releases and news about RWX and our ecosystem with our newsletter.

Share this post

Enjoyed this post? Please share it on your favorite social network!