11 Function factories

11.1 Introduction

A function factory is a function that makes functions. Here’s a simple example: we use a function factory (power()) to make two child functions (square() and cube()):

These are regular functions so we can call them as usual:

Of the three main FP tools, function factories are probably the least useful. However, they do come in handy from time-to-time, and the examples in this chapter will show you when and why. A recurring theme is that function factories tend not to reduce overall complexity, but allow you to partition complexity into a small number of pieces that can be more easily understood.

Outline

  • Section 11.2 begins the chapter with a description of how function factories work, pulling together ideas from scoping and environments.

  • Section 11.3 uses function factories to help solve maximum likelihood problems where function factory has many arguments and you call the child function many times.

  • Section 11.4 uses boostrapping as motivation for function factories that can do some work up front. The results are cached, and then can be used again by the generated function.

  • Section 11.5 shows how you can use <<- with function factories in order to preserve state across function calls. You’ll learn a richer approach in R6, but a function factory can be useful for simple cases, like capturing the number of times a function is called.

  • Section 11.6 explores numerical integration: starting with simple pieces like midpoint, trapezoid, Simpson, and Boole and rules, and showing how they can all be generated with a single function factory.

  • Section 11.7 shows how you can combine function factories with functionals to rapidly generate a family of functions from data.

Function factories are an important building block for very useful function operators, which you’ll learn about in the next chapter.

Prerequisites

11.2 Factory fundamentals

“An object is data with functions. A closure is a function with data.” — John D. Cook

Function factories work because functions keep a reference to the environment in which they are defined. Technically this is called enclosing the environment, and hence functions in R are also known as closures.

When you print a closure, you don’t see anything terribly useful:

That’s because the function itself doesn’t change. The difference is the enclosing environment, environment(square). That’s easier to see if we draw a diagram:

There’s a lot going on this diagram, but we can simplify things with a couple of conventions:

  • Any free floating symbol lives in the global environment.
  • Any environment without an explicit parent inherits from the global environment.

Those conventions allows me to redraw the diagram more simply:

We can verify the diagram with some code that looks at the environment of the function:

The execution environment of a function is usually ephemeral. But functions bind their enclosing argument, and that reference will keep the execution environment alive until the child function is garbage collected. This property makes function factories work, but requires a little care when you have more complicated code. It’s easy to keep variables alive that you don’t care about, and this can chew up memory. In the following code, note that f2() is so large because it’s environment contains the million element vector x:

11.2.1 Exercises

  1. Base R contains two function factories, approxfun() and ecdf(). Read their documentation and experiment to figure out what the functions do and what they return.

  2. What does the following statistical function do? What would be a better name for it? (The existing name is a bit of a hint.)

  3. Create a function that creates functions that compute the ith central moment of a numeric vector. You can test it by running the following code:

  4. Create a function pick() that takes an index, i, as an argument and returns a function with an argument x that subsets x with i.

11.3 Maximum likelihood estimation

Our first motivating example for function factories comes from statistics. The goal of maximum likelihood estimation (MLE) is to find the “most likely” parameters of a distribution given a set. In MLE, we have two sets of parameters: the data, which is fixed for a given problem, and the parameters, which vary as we try to find the maximum. These two sets of parameters make the problem well suited for function factories: given a dataset we create a new function that returns the likelihood for specified parameters.

The following example shows how we might find the maximum likelihood estimate for \(\lambda\), if our data come from a Poisson distribution. First, we create a function factory that, given a dataset, returns a function that computes the negative26 log likelihood (NLL) for parameter lambda.

Note how the closure allows us to precompute values that are constant with respect to the data.

We can use this function factory to generate specific NLL functions for input data. Then optimise() allows us to find the best values (the maximum likelihood estimates), given a generous starting range.

We can check that these values are correct by comparing them to the analytic solution: in this case, it’s just the mean of the data, 32.1 and 5.467.

Things are slightly less elegant when we generalise to more parameters because optim(), the generalisation of optimise() calls the function with a single argument containing a vector of parameters.

11.4 Bootstrap generators

Continuing with the statistical theme, bootstrapping is another place where function factories can be useful.

You might worry that this produces a function that takes up a lot of memory:

But boot_mtcars() doesn’t make a copy of df, it just captures a reference to it. That means the size of mtcars and boot_mtcars() together is much smaller than you might expect:

The advantage of a function factory is more clear with a parametric bootstrap where we have to first fit a model. This is a clear set up step that can be done once.

11.5 Mutable state with <<-

Having variables at two levels allows you to maintain state across function invocations. This is possible because while the execution environment is refreshed every time, the enclosing environment is constant. The key to managing variables at different levels is the double arrow assignment operator (<<-). Unlike the usual single arrow assignment (<-) that always assigns in the current environment, the double arrow operator will keep looking up the chain of parent environments until it finds a matching name. (Binding names to values has more details on how it works.)

Together, a static parent environment and <<- make it possible to maintain state across function calls. The following example shows a counter that records how many times a function has been called. Each time new_counter is run, it creates an environment, initialises the counter i in this environment, and then creates a new function.

The new function is a closure, and its enclosing environment is the environment created when new_counter() is run. Ordinarily, function execution environments are temporary, but a closure maintains access to the environment in which it was created. In the example below, closures counter_one() and counter_two() each get their own enclosing environments when run, so they can maintain different counts.

The counters get around the “fresh start” limitation by not modifying variables in their local environment. Since the changes are made in the unchanging parent (or enclosing) environment, they are preserved across function calls.

What happens if you don’t use a closure? What happens if you use <- instead of <<-? Make predictions about what will happen if you replace new_counter() with the variants below, then run the code and check your predictions.

Modifying values in a parent environment is an important technique because it is one way to generate “mutable state” in R. Mutable state is normally hard because every time it looks like you’re modifying an object, you’re actually creating and then modifying a copy. However, if you do need mutable objects and your code is not very simple, it’s usually better to use reference classes, as described in RC.

The power of closures is tightly coupled with the more advanced ideas in functionals and function operators. You’ll see many more closures in those two chapters. The following section discusses the third technique of functional programming in R: the ability to store functions in a list.

11.6 Numerical integration

A powerful use case for functionals is when you have a family of functions with flexible parameters, and some of the members of the family have special, known, names. You can use the function factory to provide a general builder, and then use the factory to give interesting special cases names.

The idea behind numerical integration is simple: find the area under a curve by approximating the curve with simpler components. The two simplest approaches are the midpoint and trapezoid rules. The midpoint rule approximates a curve with a rectangle. The trapezoid rule uses a trapezoid. Each takes the function we want to integrate, f, and a range of values, from a to b, to integrate over.

For this example, I’ll try to integrate sin x from 0 to \(\pi\). This is a good choice for testing because it has a simple answer: 2.

Neither of these functions gives a very good approximation. To make them more accurate using the idea that underlies calculus: we’ll break up the range into smaller pieces and integrate each piece using one of the simple rules. This is called composite integration. I’ll implement it using two new functions:

You’ll notice that there’s a lot of duplication between midpoint_composite() and trapezoid_composite(). Apart from the internal rule used to integrate over a range, they are basically the same. From these specific functions you can extract a more general composite integration function:

This function takes two functions as arguments: the function to integrate and the integration rule. We can now add even better rules for integrating over smaller ranges:

It turns out that the midpoint, trapezoid, Simpson, and Boole rules are all examples of a more general family called Newton-Cotes rules. (They are polynomials of increasing complexity.) We can use this common structure to write a function that can generate any general Newton-Cotes rule:

Mathematically, the next step in improving numerical integration is to move from a grid of evenly spaced points to a grid where the points are closer together near the end of the range, such as Gaussian quadrature. That’s beyond the scope of this case study, but you could implement it with similar techniques.

11.6.1 Exercises

  1. The trade-off between integration rules is that more complex rules are slower to compute, but need fewer pieces. For sin() in the range [0, \(\pi\)], determine the number of pieces needed so that each rule will be equally accurate. Illustrate your results with a graph. How do they change for different functions? sin(1 / x^2) is particularly challenging.

11.7 Function factories + functionals

Easily create a bunch of functions at once.

If the functional has two arguments, you could use map2(), and if it has 3 or more, you could use a data frame plus pmap().

See alternative approach in translation - uses quasiquotation so requires more knowledge, but has the advantage of generating functions with more readable bodies, and avoids accidentally capturing large objects in the enclosing scope. The following code is a quick preview of how we could rewrite power1() to use quasiquotation instead of a function factory. You’ll learn more about this in Section 19.6.4.

11.7.1 Exercises

  1. Instead of creating individual functions (e.g., midpoint(), trapezoid(), simpson(), etc.), we could store them in a list. If we did that, how would that change the code? Can you create the list of functions from a list of coefficients for the Newton-Cotes formulae?

  1. In R, you usually use the negative log-likelihood because optimise() defaults to finding the minimum, not the maximum.