18 Evaluation

18.1 Introduction

The user-facing inverse of quotation is unquotation: it gives the user the ability to selectively evaluate parts of an otherwise quoted argument. The developer-facing complement of quotation is evaluation: this gives the developer the ability to evaluate quoted expressions in custom environments to achieve specific goals.

This chapter begins with a discussion of evaluation in its purest. You’ll learn how rlang::eval_bare() evaluates an expression in an environment, and then how it can be used to implement a number of important base R functions. Next, we’ll circle back to base::eval() and friends to see how these ideas are expressed in base R.

Once you have the basics under your belt, you’ll learn extensions to evaluation that are needed for robustness. There are two big new ideas:

  • The quosure: a data structure that captures an expression along with its associated environment, as found in function arguments.

  • The data mask, which makes it easier to evaluate an expression in the context of a data frame. This introduces potential evaluation ambiguity which we’ll then resolve with data pronouns.

Together, quasiquotation, quosures, and data masks form what we call tidy evaluation, or tidy eval for short. Tidy eval provides a principled approach to non-standard evaluation that makes it possible to use such functions both interactively and embedded with other functions. Tidy evaluation is the most important practical implication of all this theory so we’ll spend a little time exploring the implications. The chapter finishes off with a discussion of the closest related approaches in base R, and how you can program around their drawbacks.

Outline

  • Section 18.2 discusses the basics of evaluation using eval(), and shows how you can use it to implement key functions like local() and source().

  • Section 18.3 introduces a new data structure, the quosure, which combines an expression with an environment. You’ll learn how to capture them from promises, and evaluate them using rlang::eval_tidy().

  • Section 18.4 extends evaluation with the “data mask”, which makes it trivial to intermingle symbols bound in an environment with variables found in a data frame.

  • Section 18.5 shows how to use tidy evaluation in practice, focussing on the common pattern of quoting and unquoting, and how to handle ambiguity with pronouns.

  • Section 18.6 circles back to evaluation in base R, discusses some of the downsides, and shows how to use quasiquotation and evaluation to wrap functions that use NSE.

Prerequisites

You’ll need to be familiar with the content of Chapter 16 and Chapter ??, as well as the environment data structure (Section 6.2) and the caller environments (Section 6.5).

We’ll continue to use rlang and purrr.

library(rlang)
library(purrr)
#> 
#> Attaching package: 'purrr'
#> The following objects are masked from 'package:rlang':
#> 
#>     %@%, %||%, as_function, flatten, flatten_chr, flatten_dbl,
#>     flatten_int, flatten_lgl, invoke, list_along, modify, prepend,
#>     rep_along, splice

18.2 Evaluation basics

In the previous chapter, we briefly mentioned eval(). Here, however, we’re going to start with rlang::eval_bare() as it’s the purest evocation of the idea of evaluation. It has two arguments: expr,and env. The first argument, expr, is the object to evaluate, which is typically either a symbol or an expression71. None of the evaluation functions quote their inputs, so you’ll usually use them with expr() or similar:

x <- 10
eval_bare(expr(x))
#> [1] 10

y <- 2
eval_bare(expr(x + y))
#> [1] 12

The second argument, env, gives the environment in which the expression should be evaluated, i.e. where should the values of x, y, and + be looked for? By default, this is the current environment, i.e. the calling environment of eval_bare(), but you can override it if you want:

eval_bare(expr(x + y), env(x = 1000))
#> [1] 1002

Because R looks up functions in the same way as variables, we can also override the meaning of functions. This is a very useful technique if you want to translate R code into something else, as you’ll learn about Chapter 19.

eval_bare(
  expr(x + y), 
  env(`+` = function(x, y) paste0(x, " + ", y))
)
#> [1] "10 + 2"

Note that the first argument to eval_bare() (and to base::eval()) is evaluated, not quoted. This can lead to confusing results if you forget to quote the input:

eval_bare(x + y)
#> [1] 12
eval_bare(x + y, env(x = 1000))
#> [1] 12

Now that you’ve seen the basics, let’s explore some applications. We’ll focus primarily on base R functions that you might have used before; now you can learn how they work. To focus on the underlying principles, we’ll extract out their essence implemented using rlang. Once you’ve seen some applications, we’ll circle back and talk more about base::eval().

18.2.1 Application: local()

Sometimes you want to perform a chunk of calculation that creates some intermediate variables. The intermediate variables have no long-term use and could be quite large, so you’d rather not keep them around. One approach is to clean up after yourself using rm(); another approach is to wrap the code in a function, and just call it once. A more elegant approach is to use local():

# Clean up variables created earlier
rm(x, y)

foo <- local({
  x <- 10
  y <- 200
  x + y
})

foo
#> [1] 210
x
#> Error in eval(expr, envir, enclos):
#>   object 'x' not found
y
#> Error in eval(expr, envir, enclos):
#>   object 'y' not found

The essence of local() is quite simple. We capture the input expression, and create a new environment in which to evaluate it. This is a new environment (so assign doesn’t affect the existing environment) with the caller environment as parent (so that expr can still access variables in that environment). This effectively emulates running expr as if it was inside a function (i.e. it’s lexically scoped, Section 5.4).

local2 <- function(expr) {
  env <- child_env(caller_env())
  eval_bare(enexpr(expr), env)
}

foo <- local2({
  x <- 10
  y <- 200
  x + y
})

foo
#> [1] 210
x
#> Error in eval(expr, envir, enclos):
#>   object 'x' not found
y
#> Error in eval(expr, envir, enclos):
#>   object 'y' not found

Understanding how base::local() works is harder, as it uses eval() and substitute() together in rather complicated ways. Figuring out exactly what’s going on is good practice if you really want to understand the subtleties of substitute() and the base eval() functions, so is included in the exercises below.

18.2.2 Application: source()

We can create a simple version of source() by combining eval_bare() with parse_expr() from Section 16.4.3. We read in the file from disk, use parse_expr() to parse the string into a list of expressions, and then use eval_bare() to evaluate each element in turn. This version evaluates the code in the caller environment, and invisibly returns the result of the last expression in the file just like base::source().

source2 <- function(path, env = caller_env()) {
  file <- paste(readLines(path, warn = FALSE), collapse = "\n")
  exprs <- parse_exprs(file)

  res <- NULL
  for (i in seq_along(exprs)) {
    res <- eval_bare(exprs[[i]], env)
  }
  
  invisible(res)
}

The real source() is considerably more complicated because it can echo input and output, and has many other settings that control its behaviour.

18.2.3 Gotcha: function()

There’s one small gotcha that you should be aware of if you’re using eval_bare() and expr() to generate functions:

x <- 10
y <- 20
f <- eval_bare(expr(function(x, y) !!x + !!y))
f
#> function(x, y) !!x + !!y

This function doesn’t look like it will work, but it does:

f()
#> [1] 30

This is because, if available, functions print their srcref attribute (Section 5.2.1), and because the srcref is a base R feature it’s unaware of quasiquotation. To work around this problem, either use new_function(), Section 17.7.4, or remove the srcref attribute:

attr(f, "srcref") <- NULL
f
#> function (x, y) 
#> 10 + 20

18.2.4 Base R

The closest base equivalent to eval_bare() is the two-argument form of eval(): eval(expr, envir):

eval(expr(x + y), env(x = 1000, y = 1))
#> [1] 1001

eval() has a third argument, enclos, which provides support for data masks, the topic of Section 18.5. eval() is paired with two helper functions:

  • evalq(x, env) quotes its first argument, and is hence a shortcut for eval(quote(x), env).

  • eval.parent(expr, n) is a shortcut for eval(expr, env = parent.frame(n)).

In most cases, there is no reason to prefer rlang::eval_bare() over eval(); I just used it here because it has a more minimal interface.

18.2.5 Exercises

  1. Carefully read the documentation for source(). What environment does it use by default? What if you supply local = TRUE? How do you provide a custom argument?

  2. Predict the results of the following lines of code:

    eval(quote(eval(quote(eval(quote(2 + 2))))))
    eval(eval(quote(eval(quote(eval(quote(2 + 2)))))))
    quote(eval(quote(eval(quote(eval(quote(2 + 2)))))))
  3. Write an equivalent to get() using sym() and eval_bare(). Write an equivalent to assign() using sym(), expr(), and eval_bare(). (Don’t worry about the multiple ways of choosing an environment that get() and assign() support; assume that the user supplies it explicitly.)

    # name is a string
    get2 <- function(name, env) {}
    assign2 <- function(name, value, env) {}
  4. Modify source2() so it returns the result of every expression, not just the last one. Can you eliminate the for loop?

  5. We can make base::local() slightly easier to understand by spreading out over multiple lines:

    local3 <- function(expr, envir = new.env()) {
      call <- substitute(eval(quote(expr), envir))
      eval(call, envir = parent.frame())
    }

    Explain how local() works in words. (Hint: you might want to print(call) to help understand what substitute() is doing, and read the documentation to remind yourself what environment new.env() will inherit from.)

18.3 Quosures

Almost every use of eval() involves both an expression and environment. This coupling is so important we need a data structure that can hold both pieces. Base R72 does not have such a structure so rlang fills the gap with the quosure, an object that contains an expression and an environment. The name is a portmanteau of quoting and closure, because a quosure both quotes the expression and encloses the environment. Quosures reify the internal promise object (Section ??) into something that you can program with.

In this section, you’ll learn how to create and manipulate quosures, and a little about how they are implemented.

18.3.1 Creating

There are three ways to create quosures:

  • Use enquo() and enquos() to capture user-supplied expressions, as shown above. The vast majority of quosures should be created this way.

    foo <- function(x) enquo(x)
    foo(a + b)
    #> <quosure>
    #> expr: ^a + b
    #> env:  global
  • quo() and quos() exist to match to expr() and exprs(), but they are included only for the sake of completeness and are needed very rarely.

    quo(x + y + z)
    #> <quosure>
    #> expr: ^x + y + z
    #> env:  global
  • new_quosure() create a quosures from its components: an expression and an environment. This is rarely needed in practice, but is useful for learning about the system so used a lot in this chapter.

    new_quosure(expr(x + y), env(x = 1, y = 10))
    #> <quosure>
    #> expr: ^x + y
    #> env:  0x614eb68

18.3.2 Evaluating

Quosures are paired with a new evaluation function: eval_tidy() that takes an expression and environment bundled together into a quosure. It is straightforward to use:

q1 <- new_quosure(expr(x + y), env(x = 1, y = 10))
eval_tidy(q1)
#> [1] 11

For this simple case, eval_tidy(q1) is basically a shortcut for eval_bare(get_expr(q1), get_env(q2)). However, it has two important features that you’ll learn about later in the chapter: it supports nested quosures and pronouns.

18.3.3 Dots {quosure-dots}

Quosures are typically just a convenience: they make code cleaner because you only have one object to pass around, instead of two. They are, however, essential when it comes to working with ... because it’s possible for each argument passed to … to have a different environment associated with it. In the following example note that both quosures have the same expression, x, but a different environment:

f <- function(...) {
  x <- 1
  g(..., f = x)
}
g <- function(...) {
  enquos(...)
}

x <- 0
qs <- f(global = x)
qs
#> <listof<quosures>>
#> 
#> $global
#> <quosure>
#> expr: ^x
#> env:  global
#> 
#> $f
#> <quosure>
#> expr: ^x
#> env:  0x640e450

That means that when you evaluate them, you get the correct results:

map(qs, eval_tidy)
#> $global
#> [1] 0
#> 
#> $f
#> [1] 1

Correctly evaluating the elements of dots was one of the original motivation for the development of quosures.

18.3.4 Under the hood

Quosures were inspired by R’s formulas, because formulas capture an expression and an environment:

f <- ~runif(3)
str(f)
#> Class 'formula'  language ~runif(3)
#>   ..- attr(*, ".Environment")=<environment: R_GlobalEnv>

An early version of tidy evaluation used formulas instead of quosures, as an attractive feature of ~ is that it provides quoting with a single keystroke. Unfortunately, however, there is no clean way to make ~ a quasiquoting function.

Quosures are, however, a subclass of formulas:

q4 <- new_quosure(expr(x + y + z))
class(q4)
#> [1] "quosure" "formula"

This makes them a call to ~:

is_call(q4)
#> [1] TRUE

q4[[1]]
#> `~`
q4[[2]]
#> x + y + z

With an attribute that stores the environment:

attr(q4, ".environent")
#> NULL

If you need to extract the expression or environment, don’t rely on these implementation details. Instead use the quo_get_ helpers:

quo_get_env(q4)
#> <environment: R_GlobalEnv>
quo_get_expr(q4)
#> x + y + z

18.3.5 Nested quosures

It’s possible to use quasiquotation to embed a quosure in an expression. This is an advanced tool, and most of the time you don’t need to think about it because it just works, but I talk about it here so you can spot nested quosures in the wild and not be confused. Take this example, which inlines two quosures into an expression:

q2 <- new_quosure(expr(x), env(x = 1))
q3 <- new_quosure(expr(x), env(x = 10))

x <- expr(!!q2 + !!q3)

It evaluates correctly with eval_tidy():

eval_tidy(x)
#> [1] 11

Even though when you print it, you only see the xs (and here their formula heritage leaks through):

x
#> (~x) + ~x

You can get a better display with rlang::expr_print() (Section 17.4.7):

expr_print(x)
#> (^x) + (^x)

When you use expr_print() in the console, quosures are coloured according to their environment, making it easier to spot when symbols are bound to different variables.

18.3.6 Exercises

  1. Predict what evaluating each of the following quosures will return.

    q1 <- new_quosure(expr(x), env(x = 1))
    q1
    #> <quosure>
    #> expr: ^x
    #> env:  0x5f29da8
    
    q2 <- new_quosure(expr(x + !!q1), env(x = 10))
    q2
    #> <quosure>
    #> expr: ^x + (^x)
    #> env:  0x60cf7c8
    
    q3 <- new_quosure(expr(x + !!q2), env(x = 100))
    q3
    #> <quosure>
    #> expr: ^x + (^x + (^x))
    #> env:  0x59deed0
  2. Write an enenv() function that captures the environment associated with an argument.

18.4 Data masks

So far, you’ve learned about quosures and eval_tidy(). In this section, you’ll learn about the data mask, a data frame where the evaluated code will look first for variable definitions. The data mask is the key idea that powers base functions like with(), subset() and transform(), and is used throughout the tidyverse in packages like dplyr and ggplot2.

18.4.1 Basics

The data mask allows you to mingle variables from an environment and and data frame in a single expression. You supply the data mask as the second argument to eval_tidy():

q1 <- new_quosure(expr(x * y), env(x = 100))
df <- data.frame(y = 1:10)

eval_tidy(q1, df)
#>  [1]  100  200  300  400  500  600  700  800  900 1000

This code is a little hard to follow because there’s so much syntax as we’re creating every object from from scratch. It’s easier to see what’s going on if we make a little wrapper. I call this with2() because it’s equivalent to base::with().

with2 <- function(data, expr) {
  expr <- enquo(expr)
  eval_tidy(expr, data)
}

We can now rewrite the code above as below:

x <- 100
with2(df, x * y)
#>  [1]  100  200  300  400  500  600  700  800  900 1000

base::eval() has similar functionality, although it doesn’t call it a data mask. Instead you can supply a data frame to the envir argument and an environment to the enclos argument. That gives the following implementation of with():

with3 <- function(data, expr) {
  expr <- substitute(expr)
  eval(expr, data, caller_env())
}

18.4.2 Pronouns

The data mask introduces ambiguity. For example, in the following code you can’t know whether x will come from the data mask or the environment, unless you know what variables are found in df.

with2(df, x)

That makes code harder to reason about (because you need to know more context), and can introduce bugs. To resolve that issue, the data mask provides two pronouns: .data and .env.

  • .data$x always refers to x in the data mask, or dies trying.
  • .env$x always refers to x in the environment, or dies trying.
x <- 1
df <- data.frame(x = 2)

with2(df, .data$x)
#> [1] 2
with2(df, .env$x)
#> [1] 1

You can also subset using [[. Otherwise the pronouns are special objects and you shouldn’t expect them to behave like data frames or environments. In particularly, they throw error if the object isn’t found:

with2(df, .data$y)
#> Error: Column `y` not found in `.data`

Pronouns are particularly important when using tidy evaluation, and we’ll come back to them in Section 18.4.2.

18.4.3 Application: subset()

We’ll explore tidy evaluation in the context of base::subset(), because it’s a simple yet powerful function that encapsulates one of the central ideas that makes R so elegant for data analysis. If you haven’t used it before, subset(), like dplyr::filter(), provides a convenient way of selecting rows of a data frame. You give it some data, along with an expression that is evaluated in the context of that data. This considerably reduces the number of times you need to type the name of the data frame:

sample_df <- data.frame(a = 1:5, b = 5:1, c = c(5, 3, 1, 4, 1))

# Shorthand for sample_df[sample_df$a >= 4, ]
subset(sample_df, a >= 4)
#>   a b c
#> 4 4 2 4
#> 5 5 1 1

# Shorthand for sample_df[sample_df$b == sample_df$c, ]
subset(sample_df, b == c)
#>   a b c
#> 1 1 5 5
#> 5 5 1 1

The core of our version of subset(), subset2(), is quite simple. It takes two arguments: a data frame, data, and an expression, rows. We evaluate rows using df as a data mask, then use the results to subset the data frame with [. I’ve included a very simple check to ensure the result is a logical vector; real code would do more to create an informative error.

subset2 <- function(data, rows) {
  rows <- enquo(rows)
  
  rows_val <- eval_tidy(rows, data)
  stopifnot(is.logical(rows_val))
  
  data[rows_val, , drop = FALSE]
}

subset2(sample_df, b == c)
#>   a b c
#> 1 1 5 5
#> 5 5 1 1

18.4.4 Application: transform

A more complicated situation is base::transform() which allows you to add new variables to data frame, evaluating their expressions in the context of the existing variables:

df <- data.frame(x = c(2, 3, 1), y = runif(3))
transform(df, x = -x, y2 = 2 * y)
#>    x      y    y2
#> 1 -2 0.0808 0.162
#> 2 -3 0.8343 1.669
#> 3 -1 0.6008 1.202

Implementing transform2() is again quite straightforward. We capture the unevalated ... with enquos(...), and then evaluate each expression using a for loop. Real code would need to do more error checking, ensure that each input is named, and evaluates to a vector the same length as data.

transform2 <- function(.data, ..., .na.last = TRUE) {
  dots <- enquos(...)
  
  for (i in seq_along(dots)) {
    name <- names(dots)[[i]]
    dot <- dots[[i]]
    
    .data[[name]] <- eval_tidy(dot, data = .data)
  }
  
  .data
}

transform2(df, x2 = x * 2, y = -y)
#>   x       y x2
#> 1 2 -0.0808  4
#> 2 3 -0.8343  6
#> 3 1 -0.6008  2

Note that I named the first argument .data. This avoids problems if the user tried to create a variable called data; this is the same reasoning that leads to map() having .x and .f arguments (Section 8.2.4).

18.4.5 Application: select()

Typically, the data mask will be a data frame. But it’s sometimes useful to provide a list filled with more exotic contents. This is basically how the select argument base::subset() works. It allows you to refer to variables as if they were numbers:

df <- data.frame(a = 1, b = 2, c = 3, d = 4, e = 5)
subset(df, select = b:d)
#>   b c d
#> 1 2 3 4

The key idea is to create a named list where each component gives the position of the corresponding variable:

vars <- as.list(set_names(seq_along(df), names(df)))
str(vars)
#> List of 5
#>  $ a: int 1
#>  $ b: int 2
#>  $ c: int 3
#>  $ d: int 4
#>  $ e: int 5

Then it’s a straight application of enquo() and eval_tidy():

select2 <- function(data, ...) {
  dots <- enquos(...)
  
  vars <- as.list(set_names(seq_along(data), names(data)))
  cols <- unlist(map(dots, eval_tidy, data = vars))
  
  df[, cols, drop = FALSE]
}
select2(df, b:d)
#>   b c d
#> 1 2 3 4

dplyr::select() takes this idea and runs with it, providing a number of helpers that allow you to select variables based on their names (e.g. starts_with("x"), ends_with("_a")).

18.4.6 Exercises

  1. What the difference between using a for loop and a map function in transform2()? Consider transform2(df, x = x * 2, x = x * 2).

  2. Here’s an alternative implementation of subset2():

    subset3 <- function(data, rows) {
      rows <- enquo(rows)
      eval_tidy(expr(data[!!rows, , drop = FALSE]), data = data)
    }
    
    df <- data.frame(x = 1:3)
    subset3(df, x == 1)

    Compare and contrast subset3() to subset2(). What are its advantages and disadvantages.

  3. The following function implements the basics of dplyr::arrange().
    Annotate each line with a comment explaining what it does. Can you explain why !!.na.last is strictly correct, but omitting the !! is unlikely to cause problems?

    arrange2 <- function(.df, ..., .na.last = TRUE) {
      args <- enquos(...)
    
      order_call <- expr(order(!!!args, na.last = !!.na.last))
    
      ord <- eval_tidy(order_call, .df)
      stopifnot(length(ord) == nrow(.df))
    
      .df[ord, , drop = FALSE]
    }

18.5 Using tidy evaluation

While it’s useful to understand how eval_tidy() works, most of the time you won’t call it directly. Instead, you’ll usually use it indirectly by calling a function that uses eval_tidy(). Tidy evaluation is infectious: the root always involves a call to eval_tidy() but that may be several levels away.

In this section we’ll explore how tidy evaluation facilitates this division of responsibility, and you’ll learn how to create safe and useful wrapper functions.

18.5.1 Quoting and unquoting

Imagine we have written a function that bootstraps a function:

bootstrap <- function(df, n) {
  idx <- sample(nrow(df), n, replace = TRUE)
  df[idx, , drop = FALSE]
} 

And we want to create a new function that allows us to bootstrap and subset in a single step. Our naive approach doesn’t work:

bootset <- function(df, cond, n = nrow(df)) {
  df2 <- subset2(df, cond)
  bootstrap(df2, n)
}

df <- data.frame(x = c(1, 1, 1, 2, 2), y = 1:5)
bootset(df, x == 1)
#>     x y
#> 1   1 1
#> 1.1 1 1
#> 3   1 3
#> 3.1 1 3
#> 2   1 2

bootset() doesn’t quote any arguments so cond is evaluated normally (not in a data mask), and we get an error when it tries to find a binding for x. To fix this problem we need to quote cond, and then unquote it when we pass it on ot subset2():

bootset <- function(df, cond, n = nrow(df)) {
  cond <- enquo(cond)
  
  df2 <- subset2(df, !!cond)
  bootstrap(df2, n)
}

bootset(df, x == 1)
#>     x y
#> 3   1 3
#> 3.1 1 3
#> 3.2 1 3
#> 1   1 1
#> 1.1 1 1

This is a very common pattern; whenever you call a quoting function with arguments from the user, you need to quote them yourself and then unquote.

18.5.2 Handling ambiguity

In the case above, we needed to think about tidy eval because of quasiquotation. We also need to think tidy evaluation even when the wrapper doesn’t need to quote any arguments. Take this wrapper around subset2():

threshold_x <- function(df, val) {
  subset2(df, x >= val)
}

This function can silently return an incorrect result in two situations:

  • When x exists in the calling environment, but not in df:

    x <- 10
    no_x <- data.frame(y = 1:3)
    threshold_x(no_x, 2)
    #>   y
    #> 1 1
    #> 2 2
    #> 3 3
  • When val exists in df:

    has_val <- data.frame(x = 1:3, val = 9:11)
    threshold_x(has_val, 2)
    #> [1] x   val
    #> <0 rows> (or 0-length row.names)

These failure modes arise because tidy evaluation is ambiguous: each variable can be found in either the data mask or the environment. To make this function safe we need to remove the ambiguity using the .data and .env pronouns:

threshold_x <- function(df, val) {
  subset2(df, .data$x >= .env$val)
}

x <- 10
threshold_x(no_x, 2)
#> Error: Column `x` not found in `.data`
threshold_x(has_val, 2)
#>   x val
#> 2 2  10
#> 3 3  11

Generally, whenever you use the .env pronoun, you can use unquoting instead:

threshold_x <- function(df, val) {
  subset2(df, .data$x >= !!val)
}

There are subtle differences in when val is evaluated. If you unquote, val will be early evaluated by enquo(); if you use a pronoun, val will be lazily evaluated by eval_tidy(). These differences are usually unimportant, so pick the form that looks most natural.

18.5.3 Quoting and ambiguity

To finish our discussion let’s consider the case where we have both quoting and potential ambiguity. I’ll generalise threshold_x() slightly so that the user can pick the variable used for thresholding. I

threshold_var <- function(df, var, val) {
  var <- as_string(ensym(var))
  subset2(df, .data[[var]] >= !!val)
}

df <- data.frame(x = 1:10)
threshold_var(df, x, 8)
#>     x
#> 8   8
#> 9   9
#> 10 10

Note that it is not always the responsibility of the function author to avoid ambiguity. Imagine we generalise further to allow thresholding based on any expression:

threshold_expr <- function(df, expr, val) {
  expr <- enquo(expr)
  subset2(df, !!expr >= !!val)
}

It’s not possible to evaluate expr only the data mask, because the data mask doesn’t include any functions like + or ==. Here, it’s the user’s responsibility to avoid ambiguity. As a general rule of thumb, as a function author it’s your responsibility to avoid ambiguity with any expressions that you create; it’s the user’s responsibility to avoid ambiguity in expressions that they create.

18.5.4 Exercises

  1. I’ve included an alternative implementation of threshold_var() below. What makes it different to the approach I used above? What make it harder?

    threshold_var <- function(df, var, val) {
      var <- ensym(var)
      subset2(df, `$`(.data, !!var) >= !!val)
    }

18.6 Base evaluation

Now that you understand tidy evaluation, it’s time to come back to the alternative approaches taken by base R, a family of approaches collectively known as non-standard evaluation (NSE). Here I’ll explore the two most common techniques in base R:

  • substitute() and evaluation in the caller environment, as used by subset(). I’ll use this technique to motivate why this technique is not programming friendly, as warned about in the subset(). documentation.

  • match.call(), call manipulation, and evaluation in the caller environment, as used by write.csv() and lm(). I’ll use this technique to motivate how quasiquotation and (regular) evaluation can help you write wrappers around NSE functions.

18.6.1 substitute()

The most common form of NSE in base R is substitute() + eval(). The following code shows how you might write the core of subset() in this style, using substitute() and eval() rather than enquo() and eval_tidy(). I repeat the code introduced in Section 18.4.3 so you can compare easily. The main difference is the evaluation environment: in subset_base() the expression is evaluated in the caller environment; in subset_tidy(), it’s evaluated in the environment where it was defined.

subset_base <- function(data, rows) {
  rows <- substitute(rows)

  rows_val <- eval(rows, data, caller_env())
  stopifnot(is.logical(rows_val))
  
  data[rows_val, , drop = FALSE]
}

subset_tidy <- function(data, rows) {
  rows <- enquo(rows)
  
  rows_val <- eval_tidy(rows, data)
  stopifnot(is.logical(rows_val))
  
  data[rows_val, , drop = FALSE]
}

18.6.1.1 Programming with subset()

The documentation of subset() includes the following warning:

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

There are main three problems:

  • base::subset() always evaluates rows in the calling environment, but if ... has been used, then the expression might need to be evaluated elsewhere:

    f1 <- function(df, ...) {
      xval <- 3
      subset_base(df, ...)
    }
    
    my_df <- data.frame(x = 1:3, y = 3:1)
    xval <- 1
    f1(my_df, x == xval)
    #>   x y
    #> 3 3 1

    This may seems like an esoteric concern, but it means that subset_base() cannot reliably work with functionals like map() or lapply():

    local({
      y <- 2
      dfs <- list(data.frame(x = 1:3), data.frame(x = 4:6))
      lapply(dfs, subset_base, x == y)
    })
    #> [[1]]
    #> [1] x
    #> <0 rows> (or 0-length row.names)
    #> 
    #> [[2]]
    #> [1] x
    #> <0 rows> (or 0-length row.names)
  • Calling subset() from another function requires some care: you have to use substitute() to capture a call to subset() complete expression, and then evaluate. Because substitute() doesn’t use a syntactic marker for unquoting, it’s a little hard to predict exactly what substitute() does. Here I print the generated call to make it a little easier.

    f2 <- function(df1, expr) {
      call <- substitute(subset_base(df1, expr))
      expr_print(call)
      eval(call, caller_env())
    }
    
    my_df <- data.frame(x = 1:3, y = 3:1)
    f2(my_df, x == 1)
    #> subset_base(my_df, x == 1)
    #>   x y
    #> 1 1 3
  • eval() doesn’t provide any pronouns so there’s no way to require part of the expression to come from the data. As far as I can tell, there’s no way to make the following function safe except by manually checking for the presence of z variable in df.

    f3 <- function(df) {
      call <- substitute(subset_base(df, z > 0))
      expr_print(call)
      eval(call, caller_env())
    }
    
    z <- -1
    f3(my_df)
    #> subset_base(my_df, z > 0)
    #> [1] x y
    #> <0 rows> (or 0-length row.names)

18.6.1.2 What about [?

Given that tidy evaluation is quite complex, why not simply use [ as ?subset recommends? Primarily, it seems unappealing to have functions that can only be interactively, and never inside another function. Even the simple subset() function provides two useful features compared to [:

  • It sets drop = FALSE by default, so it’s guaranteed to return a data frame.

  • It drops rows where the condition evaluates to NA.

That means subset(df, x == y) is not equivalent to df[x == y,] as you might expect. Instead, it is equivalent to df[x == y & !is.na(x == y), , drop = FALSE]: that’s a lot more typing! Real-life alternatives to subset(), like dplyr::filter(), do even more. For example, dplyr::filter() can translate R expressions to SQL so that they can be executed in a database. This makes programming with filter() relatively more important (because it does more behind the scenes that you want to take advantage of).

It would be possible to pair subset_base() with a programmable version, say subset_prog() below. I think this is unappealing because now need twice as many functions.

subset_prog <- function(data, rows, env = caller_env()) {
  rows_val <- eval(rows, data, env)
  stopifnot(is.logical(rows_val))
  data[rows_val, , drop = FALSE]
}

18.6.2 match.call()

Another common form of NSE is to capture the complete call with match.call(), modify it, and then evaluate. match.call() doesn’t have an equivalent in tidy evaluation, but it rather than capturing a single argument like subtitute() it captures the complete call:

g <- function(x, y, z) {
  match.call()
}
g(1, 2, z = 3)
#> g(x = 1, y = 2, z = 3)

One prominent user of match.call() is write.csv(), which basically works by transforming the call into a call to write.table() with the appropriate arguments set. The following code shows the heart of write.csv():

write.csv <- function(...) {
  call <- match.call(write.table, expand.dots = TRUE)
  
  call[[1]] <- quote(write.table)
  call$sep <- ","
  call$dec <- "."
  
  eval(call, parent.frame())
}

I don’t think this technique is a good idea because you can achieve the same result without NSE:

write.csv <- function(...) {
  write.table(..., sep = ",", dec = ".")
}

Nevertheless, it’s important to understand this technique because it’s commonly used in the modelling functions. The modelling functions also prominently print the captured call, which poses some special challenges, as you’ll see next.

18.6.2.1 Wrapping modelling functions

To begin, consider the simplest possible wrapper around lm():

lm2 <- function(formula, data) {
  lm(formula, data)
}

This wrapper works, but is suboptimal because lm() captures its call, and displays it when printing.

lm2(mpg ~ disp, mtcars)
#> 
#> Call:
#> lm(formula = formula, data = data)
#> 
#> Coefficients:
#> (Intercept)         disp  
#>     29.5999      -0.0412

This is important because this call is the chief way that you see the model specification when printing the model. To overcome this problem, we need to capture the arguments, create the call to lm() using unquoting, then evaluate that call. To make it easier to see what’s going on, I’ll also print the expression we generate. This will become more useful as the calls get more complicated.

lm3 <- function(formula, data, env = caller_env()) {
  formula <- enexpr(formula)
  data <- enexpr(data)
  
  lm_call <- expr(lm(!!formula, data = !!data))
  expr_print(lm_call)
  eval(lm_call, env)
}

lm3(mpg ~ disp, mtcars)
#> lm(mpg ~ disp, data = mtcars)
#> 
#> Call:
#> lm(formula = mpg ~ disp, data = mtcars)
#> 
#> Coefficients:
#> (Intercept)         disp  
#>     29.5999      -0.0412

There are three pieces that you’ll use whenever wrapping a base NSE function in this way:

  • You capture the unevaluated arguments using enexpr(), and capture the caller environment using caller_env(). You have to accept that the function will not work correctly if the arguments are not defined in the caller environment.

  • You generate a new expression using expr() and unquoting.

  • You evaluate that expression in the caller environment. This is not guaranteed to be correct, but providing the env argument at least provides a hook that wrapper functions can use.

Note that the user of enexpr() has a nice side-effect: we can use unquoting to generate formulas dynamically:

resp <- expr(mpg)
disp1 <- expr(vs)
disp2 <- expr(wt)
lm3(!!resp ~ !!disp1 + !!disp2, mtcars)
#> lm(mpg ~ vs + wt, data = mtcars)
#> 
#> Call:
#> lm(formula = mpg ~ vs + wt, data = mtcars)
#> 
#> Coefficients:
#> (Intercept)           vs           wt  
#>       33.00         3.15        -4.44

18.6.2.2 The evaluation environment

What if you want to mingle objects supplied by the user with objects that you create in the function? For example, imagine you want to make an auto-bootstrapping version of lm(). You might write it like this:

boot_lm0 <- function(formula, data, env = caller_env()) {
  formula <- enexpr(formula)
  boot_data <- bootstrap(data, n = nrow(data))
  
  lm_call <- expr(lm(!!formula, data = boot_data))
  expr_print(lm_call)
  eval(lm_call, env)
}

df <- data.frame(x = 1:10, y = 5 + 3 * (1:10) + rnorm(10))
boot_lm0(y ~ x, data = df)
#> lm(y ~ x, data = boot_data)
#> Error in is.data.frame(data):
#>   object 'boot_data' not found

Why doesn’t this code work? We’re evaluating lm_call in the caller environment, but boot_data exists in the execution environment. We could instead evaluate in the execution environment of boot_lm0(), but there’s no guarantee that formula could be evaluated in that environment.

There are two basic ways to overcome this challenge:

  1. Unquote the data frame into the call. This means that no lookup has to occur, but has all the problems of inlining expressions. For modelling functions this means that the captured call is suboptimal:

    boot_lm1 <- function(formula, data, env = caller_env()) {
      formula <- enexpr(formula)
      boot_data <- bootstrap(data, n = nrow(data))
    
      lm_call <- expr(lm(!!formula, data = !!boot_data))
      expr_print(lm_call)
      eval(lm_call, env)
    }
    boot_lm1(y ~ x, data = df)$call
    #> lm(y ~ x, data = <data.frame>)
    #> lm(formula = y ~ x, data = list(x = c(7L, 1L, 8L, 8L, 10L, 10L, 
    #> 4L, 5L, 4L, 2L), y = c(26.6480432853289, 7.53337955186315, 29.0758039597908, 
    #> 29.0758039597908, 34.2464592942242, 34.2464592942242, 18.9694248434104, 
    #> 20.4631747068887, 18.9694248434104, 10.1428095974415)))
  2. Alternatively you can create a new environment that inherits from the caller, and you can bind variables that you’ve created inside the function to that environment.

    boot_lm2 <- function(formula, data, env = caller_env()) {
      formula <- enexpr(formula)
      boot_data <- bootstrap(data, n = nrow(data))
    
      lm_env <- env(env, boot_data = boot_data)
      lm_call <- expr(lm(!!formula, data = boot_data))
      expr_print(lm_call)
      eval(lm_call, lm_env)
    }
    boot_lm2(y ~ x, data = df)
    #> lm(y ~ x, data = boot_data)
    #> 
    #> Call:
    #> lm(formula = y ~ x, data = boot_data)
    #> 
    #> Coefficients:
    #> (Intercept)            x  
    #>        4.14         3.12

    This is more work, but gives the cleanest specification.

18.6.3 Exercises

  1. Why does this function fail?

    lm3a <- function(formula, data) {
      formula <- enexpr(formula)
    
      lm_call <- expr(lm(!!formula, data = data))
      eval(lm_call, caller_env())
    }
    lm3(mpg ~ disp, mtcars)$call
    #> lm(mpg ~ disp, data = mtcars)
    #> lm(formula = mpg ~ disp, data = mtcars)
  2. When model building, typically the response and data are relatively constant while you rapidly experiment with different predictors. Write a small wrapper that allows you to reduce duplication in this situation.

    pred_mpg <- function(resp, ...) {
    
    }
    pred_mpg(~ disp)
    pred_mpg(~ I(1 / disp))
    pred_mpg(~ disp * cyl)
  3. Another way to way to write boot_lm() would be to include the bootstrapping expression (data[sample(nrow(data), replace = TRUE), , drop = FALSE]) in the data argument. Implement that approach. What are the advantages? What are the disadvantages?

  4. To make these functions somewhat more robust, instead of always using the caller_env() we could capture a quosure, and then use its environment. However, if there are multiple arguments, they might be associated with different environments. Write a function that takes a list of quosures, and returns the common environment, if they have one, or otherwise throws an error.

  5. Write a function that takes a data frame and a list of formulas, fitting a linear model with each formula, generating a useful model call.

  6. Create a formula generation function that allows you to optionally supply a transformation function (e.g. log()) to the response or the predictors.


  1. All object yield themselves when evaluated; i.e. eval_bare(x) yields x except when x is a symbol or expression.

  2. That’s a bit of simplification because technically a formula combines an expression and environment. However, formulas are tightly coupled to modelling so a new data structure makes sense.