How Roc Compiles Closures

At RWX, we’re investing some of our engineering resources towards development of the Roc language project. Roc is a new, pre-alpha purely-functional language that offers a trove of new developer productivity features. We won’t get into all of them today, but we’d like to tell you about how Roc compiles closures with excellent runtime performance characteristics. One of Roc’s goals is to produce performant binaries when they’re needed, and how closures are represented at runtime is an important part of the design space.

You might know closures as "lambdas" or "function objects" from other languages. Roc takes an approach to closure compilation that differs from most other language compilers by applying a technique know as defunctionalization at the type-system level. We’ll discuss what all that means below!

This is the first part of a long story about compiling closures; in the future, we’ll describe a novel mechanism for compiling Roc-style closures in the presence of ad-hoc polymorphism like Rust’s traits, C++ concepts, or Go’s interfaces¹.

RWX is building products to improve engineering productivity. We’d love for you to check out Captain, a CLI for supercharging your test frameworks, and ABQ, a universal test runner that parallelizes test suites!

Defining Closures

There are many different names for the concept we’ll be talking about today. To be precise, I’d like to talk about a language feature that allows you to write functions that can

be constructed on-the-fly
capture values from the lexical scope they’re constructed in
be used as values in the language, e.g. passed around to other functions
conditionally be used in place of other functions of the same shape²

Here’s a concrete description of closures satisfying the above criteria in Python, C++, and Roc. Each snippet defines a function that takes an integer n, and conditionally returns a closure that takes an integer m and increases m by n via some operation - either addition or multiplication.

def increase_with(n, increase_op):
  if increase_op == "add":
    return lambda m: m + n
  else:
    assert increase_op == "mul"
    return lambda m: m * n

Closure Conversion

The Idea

Compiling closures doesn’t have to be tricky, but making their runtime representation efficient may take some care. If we think about a non-capturing function (think of a top-level function exported by a library in any language), calling it is determined by two things:

The code of the function
The parameters passed to the function at the call site

For a caller of a function, only the code of the function is outside information, since the caller will provide the parameters. So, to represent a non-capturing function, only the function’s address needs to be relayed. Want to conditionally call one of two functions? Hide the address of the function behind a variable, and have the caller dereference that variable before calling.

Of course, the only difference between closures and non-capturing functions is that a closure captures variables! But that makes all the difference - those captured variables are now also outside information, and must be stored somewhere.

Captured variables are determined at runtime, not statically determined like the code for a function, so they can’t be stored alongside the code of the function Captured variables can’t be determined by the caller of the function, since they were defined at the time the closure was defined. If the caller always knew what they were, there would be no point to capturing those variables - they could be passed as regular parameters!

That means closures must be identified by both their function code, and their captured variables (the latter is sometimes called an "environment"). And there is no way around this - we are forced to provide our callers both pieces of information. In fact, as we’ll see below, the most common way to compile closures is as a tuple consisting of an address to the function, and an address to the captured environment!

Okay, I just gave you a lot of words, so let me give you a concrete example. Suppose we wanted to implement closures in C using this technique (C has no language-level concept equivalent to the kind of closures I described). Let’s do so for our increase_with example from above:

#include <stdlib.h>

struct any_closure {
    void* closure_addr; // Address to the function constituting the closure
    void* closure_env;  // Captured environment of the closure
};

struct add_closure_env {
  int n;
};

// The implementation of the adding closure.
int add_closure(int m, void* env) {
  int n = ((struct add_closure_env*) env)->n;
  return m + n;
}

struct mul_closure_env {
  int n;
};

// The implementation of the multiplying closure.
int mul_closure(int m, void* env) {
  int n = ((struct mul_closure_env*) env)->n;
  return m * n;
}

struct any_closure increase_with(int n, int increase_op) {
  if (increase_op == 0) { // add
    // Allocate the environment `add_closure` needs on the heap, and return a pointer to it.
    struct add_closure_env* env = (struct add_closure_env*) malloc(sizeof(struct add_closure_env));
    env->n = n;
    struct any_closure clos = {
        .closure_addr=(void*) &add_closure,
        .closure_env=env,
    };
    return clos;
  } else {
    // Allocate the environment `mul_closure` needs on the heap, and return a pointer to it.
    struct mul_closure_env* env = (struct mul_closure_env*) malloc(sizeof(struct mul_closure_env));
    env->n = n;
    struct any_closure clos = {
        .closure_addr=(void*) &mul_closure,
        .closure_env=env,
    };
    return clos;
  }
}

As a quick gut check, how would you call a value of any_closure to produce the result of a closure?

Let’s walk over a few finer points about this implementation strategy:

Note that the code for the add_closure and mul_closure definitions now take an additional argument, the explicit closure environment. As mentioned before, it’s totally necessary.
any_closure is defined as an opaque pointer to a function, and an opaque pointer to a closure environment. This is mostly for convenience; we can cast them to the types we expect at a usage site, with no runtime cost.

Indeed, for our example, we could be a lot more specific than using void pointers, since increase_with returns closures whose functions have the same signature, and moreover their environments are defined the same way!

You might have noticed that, “hey, actually, we don’t need the closure environment to be a pointer unless it needs to live on the heap - in other cases, it can be placed on the stack, and the type of closure_env here can be a union.” And that would be true, if we always knew what type of closure_env the function at closure_addr expected³. Let’s hang on to that thought going forward.

This strategy of adding an environment parameter to capturing function definitions’ and carrying the closure environment with the function pointer is generally known as closure conversion, and is pretty much what every compiler that supports closures must do to an extent.⁴

Before we get to what Roc does differently, I’d like to show you concretely that this strategy of packing a function pointer and environment into a value aligns with what Python and C++ do.⁵

What Python does

There are a lot of python implementations, so I’ll just talk about CPython, and its bytecode VM.

So, dumping bytecode - CPython includes in its standard library a nice module dis to disassemble Python bytecode. Well, for us it’s nice because it’s an easy way to examine bytecode, for the CPython maintainers I don’t know… something something Hyrum’s law, right? Anyway, let’s disassemble increase_with.

Look at how nice that is to read! Thank you, CPython developers. And thanks for telling us how closures are constructed - MAKE_FUNCTION builds a function object. Python’s function objects have more than just the code to the function and its captured variables, but they do have at least that!

What C++ does

In C++, a lot of values are objects, and closures (called lambdas in C++) are no different. C++ lambdas are de-sugared to function objects (called functors) that define what it means to call an instance of such an object via a call operator⁶. That is, the following code

auto increment_from(int num) {
  return [num]() { return num + 1; };
}

desugars to an anonymous class instance, roughly in the form presented below. In C++, auto directs the compiler to deduce the type of an item.

auto increment_from(int num) {
  class increment_from_functor {
    public:
      // Constructor
      increment_from_functor(int num): num{num} {}

      // Defines a way to call an instance of `increment_from_functor`
      // with no arguments, returning an int.
      int operator()() const {
        return num + 1;
      }
    private:
      int num;
  } increment_from_functor_inst {num};

  return increment_from_functor_inst;
}

This is a pretty efficient way to represent a closure, and is the way you or I would. You can decide how the captured values should be stored (stack or heap-allocated). Compare this to a hypothetical unconditional version of the C code we wrote before - it’s mostly the same, sans some heap allocations.

Unfortunately, this efficient representation doesn’t immediately scale to conditional closures (property [4] of our definition for what it means for something to be a closure). That’s because C++ is nominally-typed, and nominal typing rules don’t change for functors, which as we saw desugar to unique classes. That means the following code doesn’t compile:

auto increase_with(int m, IncreaseOp increase_op) {
    if (increase_op == IncreaseOp::add) {
        return [m](int n) { return m + n; };
    } else {
        return [m](int n) { return m * n; };
    }
}

C++ provides a nice abstraction to deal with this in the std::function container, which allows you to store functions polymorphically at the expense of additional pointer indirection at runtime, and possible heap-allocation. It has a constructor that takes any functor with a call operator of its parameterized type, which is why our initial example did compile:

#include <functional>

std::function<int ()> increase_with(int num, IncreaseOp increase_op) {
    if (increase_op == IncreaseOp::add) {
        return [m](int n) { return m + n; };
    } else {
        return [m](int n) { return m * n; };
    }
}

That’s a nice ergonomics feature, but it incurs a runtime performance cost you might need to consider! If only we could do better…

Roc uses Lambda Sets

Roc implements closure conversion (as mentioned, we have to get the captured environment to a closure’s function definition, and that environment must be passed alongside the function pointer no matter what). But Roc makes closure conversion really efficient, by eliminating function pointer dereferences in favor of switch statements on integers, and always optimally deciding at compile-time whether captured environments for an arbitrary closure should be stack or heap-allocated (even when closure environments are decided conditionally).

I know that’s a lot of words, and I promise I’ll walk through all of it. But first, credit where it’s due - this technique is known as defunctionalization, and has been around for a while. It is usually employed as an optimization in the backend of compilers, and as such, is not always guaranteed. Roc ensures defunctionalization is always applied by representing it at the type-system level. The earliest work regarding type-based defunctionalization that I am aware of comes from Bell, Bellegarde, and Hook (1997), but the particular flavor of type-based defunctionalization Roc uses, known as lambda sets, is due to the work of Brandon, et.al. (unpublished). Cejtin, Jagannathan, and Weeks (2000) which describes an auxiliary flow-based technique used in MLton. The implementation of specialization of lambda sets in Roc is greatly thanks to Folkert de Vries.

Defunctionalization

Recall that with our “function pointer and captured environment” implementation strategy for closures, we have to dereference a function pointer in order to figure out what function we need to call. But why should we have to do that? We know that increase_with only ever returns one of two closures. What if we just embed that information in a flag, and conditionally call the function we need based on the value of the flag? Doing so, we could produce a version of our C increase_with with the following diff:

diff --git a/increase_with.c b/increase_with.c
index 1bcc54c..315ea19 100644
--- a/increase_with.c
+++ b/increase_with.c
@@ -1,7 +1,9 @@
 #include <stdlib.h>

-struct any_closure {
-    void* closure_addr; // Address to the function constituting the closure
+enum increase_with_closure_flag { add, mul };
+
+struct increase_with_closure {
+    enum increase_with_closure_flag flag; // The function to dispatch to
     void* closure_env;  // Captured environment of the closure
 };

@@ -25,13 +27,13 @@ int mul_closure(int m, void* env) {
  return m * n;
 }

-struct any_closure increase_with(int n, int increase_op) {
+struct increase_with_closure increase_with(int n, int increase_op) {
  if (increase_op == 0) { // add
     // Allocate the environment `add_closure` needs on the heap, and return a pointer to it.
     struct add_closure_env* env = (struct add_closure_env*) malloc(sizeof(struct add_closure_env));
     env->n = n;
-    struct any_closure clos = {
-        .closure_addr=(void*) &add_closure,
+    struct increase_with_closure clos = {
+        .flag=add,
         .closure_env=env,
     };
     return clos;
@@ -39,8 +41,8 @@ struct any_closure increase_with(int n, int increase_op) {
     // Allocate the environment `mul_closure` needs on the heap, and return a pointer to it.
     struct mul_closure_env* env = (struct mul_closure_env*) malloc(sizeof(struct mul_closure_env));
     env->n = n;
-    struct any_closure clos = {
-        .closure_addr=(void*) &mul_closure,
+    struct increase_with_closure clos = {
+        .flag=mul,
         .closure_env=env,
     };
     return clos;

Besides eliminating at least one load from memory, this strategy might have other advantages. With a smaller memory representation of closure_addr, a compiler may be able to produce a smaller memory representation for structs composed of increase_with_closures.

Let’s take this a step higher. From here-on we’ll deal mostly with Roc syntax; I’ll explain basic concepts as they come up.

The typical example of defunctionalization has to do with higher-order functions. For example, suppose we have a function map : List a, (a -> b) -> List b that takes a list of any element type a, a function that maps the element type to any other type b, and returns a list of the mapped elements. Now suppose we define a function

greetNames : List Str, Str -> List Str
greetNames = \names, greeting ->
  mapper = \name -> "\(greeting), \(name)!"  # Roc’s string interpolation
  List.map names mapper

that takes a list of names, a greeting, and builds a list of greeting strings corresponding to the given names. Notice that with the function-and-environment-pointer implementation strategy for compiling closures, the implementation of List.map would have to dereference a function pointer and environment pointer to determine the mapping function. But in the particular case of greetNames, we know that the call to List.map can only ever take one closure - namely mapper!

That means we could define a particular version of List.map for this usage, that only ever calls mapper directly for each element, rather than needing to follow a dereference. And, in cases where List.map takes a closure that might point to multiple function addresses, we’d only need to switch on an integer, like we did with our increase_with example. This technique is known as defunctionalization⁷⁸.

Most usages of defunctionalization I’m aware of apply it as an optimization in the middle or backend of a compiler, over an intermediate representation. Moreover, such optimizations are typically global - in our example, a compiler would search for all usages of List.map, modify each call site to pass a unique integer instead of a function pointer, and modify the implementation of List.map to switch over the passed integer to dispatch to the appropriate function. This avoids creating specialized definitions of List.map for each different closure it’s called with, but makes incremental compilation harder - you have to be careful when a new usage of List.map is added, or one is deleted, in determining what parts of the compiled program can be re-used.

Lambda Sets

Critically, a compiler cannot guarantee defunctionalization can be applied unless it knows all values that are functions, and moreover attaches the set of functions those functions might actually point to. So, why don’t we associate those set of functions at the point we figure out whether a value is a function or not?

That’s the key insight of lambda sets - by bookkeeping the set of functions a closure might dispatch in the function type, you have all the information you need to implement defunctionalization during compilation, and can hence guarantee that it always happens. Moreover, all the information is available right at your feet; if the compiler knows a value has a function type, it also knows exactly what function it can be.

Let me show you what this looks like. Roc’s compiler generates a unique name for each function that a program uses, and this name is stored in the type inferred for a function value. Below, I’ve annotated the inferred type for values in our incOrDec example; the symbols #clos1 and #clos2 correspond to unique names generated for the two anonymous closures used in the function.

increaseWith = \n, increaseOp ->
  when increaseOp is
    Add -> \m -> m + n
    #      ^^^^^^^^^^^ {} -[[#clos1 I64]]-> I64
    Mul -> \m -> m * n
    #      ^^^^^^^^^^^ {} -[[#clos2 I64]]-> I64

The type for increaseWith itself would be inferred as⁹

increaseWith : I64, [Add, Mul] -[[incOrDec]]-> (I64 -[[#clos1 I64, #clos2 I64]]-> I64)

[[#clos1 I64]] describes a lambda set that consists of example one function, #clos1, that captures one variable, a 64-bit signed integer. (The name of the captured variable isn’t stored in the type; it could be, but Roc’s implementation stores that name elsewhere.) So what is the inferred type of increaseWith saying? Well, it says it’s a function value that resolves to one possible function, incOrDec, that captures nothing and takes as parameters a 64-bit signed integer and boolean. It returns a function value that resolves to either #clos1, which captures an I64, or #clos2, which captures an I64.

[#clos1 I64, #clos2 I64] is a tag union - a data type in Roc that allows you to specify a value that can take on one of a disjoint number of types. Tag unions are Roc’s sum types, and also known as enums, disjoint unions, and variants in other languages. For lambda sets, the tag names consist of the possible functions in the lambda set, and the payloads of the tag are the types of the captured variables. This syntax is not for convenience - lambda sets are compiled just like tag unions! Roc effectively¹⁰ desugars our increaseWith example into

clos1 = \m, n -> m + n

clos2 = \m, n -> m * n

increaseWith = \num, increaseOp ->
  when increaseOp is
    Add -> Clos1 num
    Mul -> Clos2 num

and calls like

n = 20
(increaseWith 10 Add) n

are effectively¹⁰ desugared to

n = 20
when increaseWith 10 Add is
  Clos1 m -> clos1 m n
  Clos2 m -> clos2 m n

And that’s how Roc compiles closures - with no pointer chasing to figure out what function to call, explicitly enumerating the possible calling options, in the same way a Roc developer would do by-hand.

Moreover, because Roc monomorphizes type-polymorphic functions, and lambda sets are part of function types, Roc will produce unique instantiations of functions that use or produce unique lambda sets. Taking our List.map example from before

greetNames : List Str, Str -> List Str
greetNames = \names, greeting ->
  mapper = \name -> "\(greeting), \(name)!"
  List.map names mapper

a unique specialization of [List.map](http://List.map) whose body only ever calls the function mapper will be created by Roc’s compiler, because the usage of List.map’s inferred type (with lambda sets) is List Str, (Str -[[mapper Str]]-> Str). That’s the optimal way we’d define List.map for this usage if we could by-hand, but we don’t have to - the compiler will do it for us!

This kind of defunctionalization-by-default can be very powerful - not only can you save a dereference or two, but the compiler can determine intelligently when captured environments can be passed around on the stack versus the heap, regardless of how conditional the closure environment is. In fact, for the vast majority programs, Roc will always place captured environments on the stack (they must be heap-allocated when lambda sets are recursive - that can happen, but it’s rare!).

Moreover, since this technique explicitly enumerates where function calls dispatch to, it provides other optimizations, like inlining, exactly the information they need to do their job well. Functions hidden behind a pointer can rarely be inlined at call sites!

Tradeoffs

Like every choice, lambda sets come with their own challenges and disadvantages. We’re optimistic that their runtime performance characteristics and capacity to unblock other, powerful compiler optimizations over regions of a program make them the right choice for Roc. Nevertheless, here are a few considerations that need to be taken into account:

The time to monomorphize a program increases proportionally with the number of unique lambda sets present in a program. An alternative to this is to globally update lambda sets and their dispatch sites with all lambdas of a type in the program, but this has other problems as described previously, in particular with regards to incremental compilation.
At the limit, a monomorphized lambda set can describe the entire tree of calls through a program, from its entry point until its exit point. This happens when a function captures a calling function that captures another calling function, and so on. This situation does arise in practice, and we’ve found that care is needed to maintain compiler performance when it happens. Ensuring good compiler performance in such cases is an active area of Roc’s development.
A key assumption in assessing the runtime performance of the lambda set compilation technique is that loading and switching on an integer (e.g. implemented as a jump table) is faster than loading and dereferencing a function pointer.

This is a reasonable assumption in general (especially when there is nothing to switch over, that is the switch is constant), however, whether it holds for a specific use case on a specific operating system and architecture is impossible to determine without benchmarks.

If you’re familiar with computer architecture, you may recognize that the fundamental tradeoff in the defunctionalization approach boils down to the efficacy of a CPU’s branch predictor (for the conditional-dispatch case) relative to its branch target predictor (for the pointer-chasing case). I am not aware of any existing studies that measure the relative efficacies of these predictors across CPUs; if you are aware of any, please let us know!
Type-based defunctionalization, and lambda sets in particular, are a relatively new idea with little prior art. That means there are few, if any, references to turn to when questions arise, and there are open questions about the full power and limitations of lambda sets.

So, that’s a whirlwind tour of how to compile closures, and how Roc uses a unique and highly-performant technique. We haven’t covered the nuances, but I hope this has given you a sense for what tradeoffs and decisions compiler authors make in support language features like closures.

Before we leave, you may have been wondering why the lambda set syntax includes two opening and closing braces -[[ ... ]]-> rather than just one. That’s due to an extension of lambda sets we’ll discuss in the future! As a teaser (and challenge to the interested), I’ll say that it’s not clear how type-based defunctionalization should interact with typeclasses, or ad-hoc polymorphism in general. If you’d like to like to think about this problem beforehand, consider what happens when a caller dispatches to a function hash, that has no single implementation, but is instead determined conditionally based on the type fed into it (that is, it is implemented as part as of an interface, or typeclass). We’ll share Roc’s answer to this challenge soon.

Roc supports ad-hoc polymorphism via a feature called “abilities”. They are comparable to typeclasses, traits, or interfaces in other languages. ↩
Let’s say two functions have the same shape if the language permits you to call them and use their result in the same way; for statically-typed languages that might be enforced during type checking. ↩
If at this point you’re saying “wait a minute…,” just hold on, I think I’m with you. We’ll get there in no time. ↩
You may sometimes hear of a related technique known as lambda lifting. Lambda lifting is a way of eliminating variable captures by re-writing a program with capturing functions lifted to top-level functions, where the captured values become formal parameters to the function. This might sound a lot like closure conversion, and it is - the main difference is that you can no longer pass around capturing functions as values conditionally, at least not without also doing closure conversion, because the whole idea of captures has been eliminated in such a program! ↩
To be precise, C++ implements closures as function objects. In my opinion, this distinction is only nominal; at the end of the day, the compiled code still assembles to a function pointer and a bag of captured variables. ↩
Rust uses roughly the same model to compile closures, though it is a little more complicated for the developer due to Rust’s ownership semantics. ↩
Defunctionalization has interesting applications besides efficient higher-order-function compilation - see this blog post for a discussion. ↩
In languages with objects and dynamic dispatch, defunctionalization of function objects might fall out of the common optimization of devirtualization. ↩
Importantly, the lambda sets typed between the arrow types are never exposed to the user - that is, you cannot explicitly tell the compiler what lambda sets a function type consists of. They are entirely an implementation detail known only during compilation. ↩
For brevity I’ve elided some important details; this is not exactly the transformation, but the main idea is the same. ↩ ↩²