21 Evaluation

21.1 Introduction

The user-facing opposite of quotation is unquotation: it gives the user the ability to selectively evaluate parts of an otherwise quoted argument. The developer-facing complement of quotation is evaluation: this gives the developer of the function the ability to evaluate quoted expressions in special ways to create domain specific languages for data analysis like ggplot2 and dplyr.

Mention tidy evaluation because it’s one of the principles underlying the tidyverse. Why do I teach here when it’s not base R? Because it gives us the following good things.

Tidy evaluation is the combination of four big ideas:

  • Quasiquotation to give the user control of quoting.

  • Quosures to capture arguments expressions and their evaluation environment.

  • Data masks to enable expressions to mingle variables from an environment and from a data source.

  • Pronouns to eliminate ambiguity between environment and data source when needed.

library(rlang)

Outline

Prerequisites

Environments play a big role in evaluation, so make sure you’re familiar with Environments before continuing.

21.2 Evaluation basics

In the previous chapter, we briefly mentioned eval(). Here, rather than starting with eval(), we’re going to start with rlang::eval_bare() which is the purest evocation of the idea of evaluation. The first argument, expr is an expression to evaluate. This will usually be either a symbol or expression:

x <- 10
eval_bare(expr(x))
#> [1] 10

y <- 2
eval_bare(expr(x + y))
#> [1] 12

(As well as be self-quoting, constants are also self-evaluating)

The second argument, env, gives the environment in which the expression should be evaluated, i.e. where should the values of x, y, and + be looked for? By default, this is the current environment, i.e. the calling environment of eval_bare(), but you can override it if you want:

eval_bare(expr(x + y), env(x = 1000))
#> [1] 1002

Because R looks up functions in the same way as variables, we can also override the meaning of functions. This is a key technique for generating DSLs, as discussed in the next chapter.

eval_bare(expr(x + y), env(`+` = function(x, y) paste0(x, " + ", y)))
#> [1] "10 + 2"

If passed an object other than a symbol or expression, the evaluation functions will simply return the input as is (because it’s already evaluated). This can lead to confusing results if you forget to quote() the input: eval_bare() doesn’t quote expr so it is passed by value.

eval_bare(x + y)
#> [1] 12
eval_bare(x + y, env = env)
#> [1] 12

Now that you’ve seen the basics, let’s explore some applications. We’ll focus primarily on base R functions that you might have used before; now you can learn how they work. To focus on the underlying principles, we’ll extracting their essence and rewrite to use functions from rlang. We’ll then circle back and talk about the base R functions most important for evaluation.

21.2.1 Application: local()

Sometimes you want to perform a chunk of calculation that creates a bunch of intermediate variables. The intermediate variables have no long term use and could be quite large, so you’d rather not keep them around. One approach is to clean up after yourself using rm(). Another approach is to wrap the code in a function, and just call it once.

A more elegant approach is to use local():

# Clean up variables created earlier
rm(x, y)

foo <- local({
  x <- 10
  y <- 200
  x + y
})

foo
#> [1] 210
x
#> Error in eval(expr, envir, enclos): object 'x' not found
y
#> Error in eval(expr, envir, enclos): object 'y' not found

The essence of local() is quite simple. We capture the expression, and create an new environment in which to evaluate it. This environment inherits from the caller environment so it can access the current lexical scope.

local2 <- function(expr, env = child_env(caller_env())) {
  eval_bare(enexpr(expr), env)
}

foo <- local2({
  x <- 10
  y <- 200
  x + y
})

env_has(nms = c("x", "y"))
#> [1] FALSE FALSE

It’s a bit harder to understand how base::local() works, as it takes uses eval() and substitute() together in rather complicated ways. Figuring out exactly what’s going on is good practice if you really want to understand the subtleties of substitute() and the base eval() funtions.

21.2.2 Application: source()

We can create a simple version of source() by combining expr_text() and eval_tidy(). We read in the file from disk, use parse_expr() to parse the string into an list of expressions, and then use eval_bare() to evaluate each component. This version evaluates the code in the caller environment, and invisibly returns the result of the last expression in the file (like source()).

source2 <- function(file, env = caller_env()) {
  lines <- readLines(file, warn = FALSE)
  code <- paste(lines, collapse = "\n")
  exprs <- parse_exprs(code)

  res <- NULL
  for (i in seq_along(exprs)) {
    res <- eval_bare(exprs[[i]], env)
  }
  
  invisible(res)
}

The real source() is considerably more complicated because it can echo input and output, and also has many additional settings to control behaviour.

21.2.3 Gotcha: function()

x <- 10
y <- 20
f <- eval_bare(expr(function(x, y) !!x + !!y))
f
#> function(x, y) !!x + !!y

But it works!

f()
#> [1] 30

What is going on? srcrefs!

Two options: remove srcrefs by setting srcref attr to NULL. Alternatively, use new_function().

new_function(
  exprs(x = , y = ),
  expr({!!x + !!y})
)
#> function (x, y) 
#> {
#>     10 + 20
#> }

21.2.4 Base R

The base function equivalent to eval_bare() is the two-argument form of eval(): eval(expr, envir):

eval(expr(x + y), env(x = 1000, y = 1))
#> [1] 1001

The final argument, enclos provides support for data masks, which you’ll learn about in tidy evaluation.

eval() is paired with two helper functions:

  • evalq(x, env) quotes its first argument, and is hence a shortcut for eval(quote(x), env).

  • eval.parent(expr, n) is shortcut for eval(x, env = parent.frame(n)).

base::eval() has special behaviour for expression objects, evaluating each component in turn. This makes for a very compact implementation of source2() because base::parse() also returns an expression object:

source3 <- function(file, env = parent.frame()) {
  lines <- parse(file)
  res <- eval(lines, envir = env)
  invisible(res)
}

While source3() is considerably more concise than source2(), this one use case is the strongest argument for expression objects, and overall we don’t believe this one benefit outweighs the cost of introducing a new data structure.

21.2.5 Exercises

  1. Carefully read the documentation for source(). What environment does it use by default? What if you supply local = TRUE? How do you provide a custom argument?

  2. Predict the results of the following lines of code:

    eval(quote(eval(quote(eval(quote(2 + 2))))))
    eval(eval(quote(eval(quote(eval(quote(2 + 2)))))))
    quote(eval(quote(eval(quote(eval(quote(2 + 2)))))))
  3. Write an equivalent to get() using sym() and eval_bare(). Write an equivalent to assign() using sym(), expr(), and eval_bare(). (Don’t worry about the multiple ways of choosing an environment that get() and assign() support; assume that the user supplies it explicitly.)

    # name is a string
    get2 <- function(name, env) {}
    assign2 <- function(name, value, env) {}
  4. Modify source2() so it returns the result of every expression, not just the last one. Can you eliminate the for loop?

  5. The code generated by source2() lacks source references. Read the source code for sys.source() and the help for srcfilecopy(), then modify source2() to preserve source references. You can test your code by sourcing a function that contains a comment. If successful, when you look at the function, you’ll see the comment and not just the source code.

  6. The third argument in subset() allows you to select variables. It treats variable names as if they were positions. This allows you to do things like subset(mtcars, , -cyl) to drop the cylinder variable, or subset(mtcars, , disp:drat) to select all the variables between disp and drat. How does this work? I’ve made this easier to understand by extracting it out into its own function that uses tidy evaluation.

    select <- function(df, vars) {
      vars <- enexpr(vars)
      var_pos <- set_names(as.list(seq_along(df)), names(df))
    
      cols <- eval_tidy(vars, var_pos)
      df[, cols, drop = FALSE]
    }
    select(mtcars, -cyl)
  7. We can make base::local() slightly easier to understand by spreading out over multiple lines:

    local3 <- function(expr, envir = new.env()) {
      call <- substitute(eval(quote(expr), envir))
      eval(call, envir = parent.frame())
    }

    Explain how local() works in words. (Hint: you might want to print(call) to help understand what substitute() is doing, and read the documentation to remind yourself what environment new.env() will inherit from.)

21.3 Quosures

The simplest form of evaluation combines an expression and an environment. This coupling is sufficiently important that we need a data structure that captures both pieces. We call this data structure a quosure, a portmanteau of quoting and closure.

You almost always want to capture a quosure rather than an expression because it gives you uniformly more information. Once we’ve discussed its primary use case of tidy evaluation, we’ll come back to the few cases where you should prefer expressions.

21.3.1 Motivation

Quosures are needed when expressions to be evaluate mix variables from a data frame and variables in the environment. For example, the following mutate() call creates a new variable called log with a calculation that involves a varible in the dataset x, and a variable in the environment, base:

df <- data.frame(z = runif(5))
x <- 10
dplyr::mutate(df, log = log(z, base = x))
#>        z     log
#> 1 0.0808 -1.0929
#> 2 0.8343 -0.0787
#> 3 0.6008 -0.2213
#> 4 0.1572 -0.8035
#> 5 0.0074 -2.1308

Also remember that log() itself is found in the global environment, but there’s no confusion about where functions come from because (without gymnastics) you can’t put a function in a data frame.

Worrying about the execution environment of an argument is important when you write quoting functions. Take this simple example:

compute_mean <- function(df, x) {
  x <- enexpr(x)
  dplyr::summarise(df, mean = mean(!!x))
}

It works correctly for simple inputs:

compute_mean(df, z)
#>    mean
#> 1 0.336

It contains a subtle bug, which we can illustrate with this slightly forced example:

x <- 10
compute_mean(df, log(z, base = x))
#> Error in summarise_impl(.data, dots): Evaluation error: non-numeric argument to mathematical function.

We get this error because we have lost the evaluation environment associated with log(z, base = x) so it is evaluated inside compute_mean() where is x an AST. This type of bug is pernicious because it will happen rarely and the error message will be inscrutable.

We can avoid the bug by the expression along with its evaluation evniroment. That’s the job of enquo(), which otherwise works identically to enexpr():

compute_mean <- function(df, x) {
  x <- enquo(x)
  dplyr::summarise(df, mean = mean(!!x))
}

compute_mean(mtcars, log(mpg, base = x))
#>   mean
#> 1 1.28

21.3.2 Creating and manipulating

To create a quosure you will typically use one of the equivalents of the expr() functions that you learned about in the previous chapter:

  • Use quo() and quos() for experimenting interactively and for unquoting with fixed expressions inside a function.

  • Use enquo() and enquos() to capture user-supplied arguments to a function.

Alternatively, you can use new_quosure() to create a quosure from its components: an expression and an environment.

x <- new_quosure(expr(x + y), env(x = 1, y = 10))
x
#> <quosure>
#>   expr: ^x + y
#>   env:  0x4d86c90

Note how quosures are printed. If you unquote a quosure inside another quosure, each quosusre starts with ^ and if you’re in console that supports it, each quosure gets a differnt colour to help remind you that it has a different environment attached to it.

q2 <- quo(x + !!x)
q2
#> <quosure>
#>   expr: ^x + (^x + y)
#>   env:  global

(Note that because quosures capture the complete environment you need to be a little careful if your function returns quosures. If you have large temporary objects they will not get gc’d until the quosure has been gc’d. See XXXXXXX for more details.)

You can evaluated a quosure with eval_tidy(), which we’ll study in depth in the next section:

eval_tidy(x)
#> [1] 11

You can the extract components with quo_ helpers:

quo_get_env(x)
#> <environment: 0x4d86c90>
quo_get_expr(x)
#> x + y

And if you need to turn a quosure into text for output to the console you can use quo_name(), quo_label(), or quo_text(). quo_name() and quo_label() are garanteed to be short; quo_expr() may span multiple lines.

# https://github.com/tidyverse/rlang/issues/367
y <- quo(long_function_name(
  argument_1 = long_argument_value,
  argument_2 = long_argument_value,
  argument_3 = long_argument_value,
  argument_4 = long_argument_value
))
quo_name(y)   # e.g. for data frames
#> [1] "long_function_name(...)"
quo_label(y)  # e.g. for error messages
#> [1] "`long_function_name(...)`"
quo_text(y)   # for longer messages
#> [1] "long_function_name(argument_1 = long_argument_value, argument_2 = long_argument_value, \n    argument_3 = long_argument_value, argument_4 = long_argument_value)"

21.3.3 Implementation

Quosures are possible because internally R represents function arguments with a special type of object called a promise. A promise captures the expression needed to compute the value and the environment in which to compute it. You’re not normally aware of promises because the first time you access a promise its code is evaluated in its environment, yielding a value. This is what powers lazy evaluation.

However, you cannot manipulate promises with R code: they’re sort of quantum; if you attempt to manipulate with R code, they are immediately evaluated, and the promise nature goes away. To work around this, rlang manipulates the promise in C, reifying it into an R object that you can work with.

There is one big difference between promises and quosures. An argument is evaluated implicitly when you access it for the first time. Every time you access it subsequently it will return the same value. A quosure must be evaluated explicitly, and each evaluation is independent of the previous evaluations.

# The argument x is evaluated once, then reuses
foo <- function(x_arg) {
  list(x1 = x_arg, x2 = x_arg)
}
foo(runif(3))
#> $x1
#> [1] 0.466 0.498 0.290
#> 
#> $x2
#> [1] 0.466 0.498 0.290

# The quosure x is evaluated afresh each time
x_quo <- quo(runif(3))
eval_tidy(x_quo)
#> [1] 0.733 0.773 0.875
eval_tidy(x_quo)
#> [1] 0.1749 0.0342 0.3204

Quosures are inspired by the the formula operator, ~, which also captures both the expression and its environment, and is used extremely heavily in R’s modelling functions:

f <- ~runif(3)
f
#> ~runif(3)

str(f)
#> Class 'formula'  language ~runif(3)
#>   ..- attr(*, ".Environment")=<environment: R_GlobalEnv>

Initial versions of rlang used formulas as quosures: an attractive feature of ~ is that is provides quoting with just a single keystroke. Unfortunately, however, there is no way to add quasiquotation to ~, so we decided to use a new function, quo(), instead.

21.3.4 Multiple environments

Quosures are particularly important when used with ... because each argument can potentially have a different environment associated with it:

f <- function(...) {
  x <- 1
  g(..., x1 = x)
}
g <- function(...) {
  x <- 2
  h(..., x2 = x)
}
h <- function(...) {
  enquos(...)
}

x <- 0
qs <- f(x0 = x)
qs
#> $x0
#> <quosure>
#>   expr: ^x
#>   env:  global
#> 
#> $x1
#> <quosure>
#>   expr: ^x
#>   env:  0x48f86f0
#> 
#> $x2
#> <quosure>
#>   expr: ^x
#>   env:  0x48f8958

purrr::map_dbl(qs, eval_tidy)
#> x0 x1 x2 
#>  0  1  2

21.3.5 Embedded quosures

make_x <- function(x) quo(x)
thirty <- quo(!!make_x(0) + !!make_x(10) + !!make_x(20))
thirty
#> <quosure>
#>   expr: ^(^x) + (^x) + (^x)
#>   env:  global

If you’re viewing from the console, you’ll see that each quosure is coloured - the point of the colours is to emphasise that the quosures have different environments associated with them even though the expressions are the same.

eval_tidy(thirty)
#> [1] 30

This was a lot of work to get right. But means that quosures just work, even when embedded inside other quosures.

Note that this code doesn’t make any sense at all if we use expressions instead of quosures equivalents, the environment is never captured so all we have

make_x <- function(x) expr(x)
thirty <- expr(!!make_x(0) + !!make_x(10) + !!make_x(20))

thirty
#> x + x + x
eval_tidy(thirty)
#> [1] 0

21.3.6 When not to use quosures

  • In code generation.

  • When expression will be evaluated completely in data context

  • To call functions that don’t use tidy eval; fuller example next.

Sometimes you can avoid using a quosure by inlining/unquoting values.

base <- 2
quo(log(x, base = base))
#> <quosure>
#>   expr: ^log(x, base = base)
#>   env:  global
expr(log(x, base = !!base))
#> log(x, base = 2)

21.3.7 Exercises

  1. Predict what evaluating each of the following quosures will return.

    q1 <- new_quosure(expr(x), env(x = 1))
    q2 <- new_quosure(expr(x + !!q1), env(x = 10))
    q3 <- new_quosure(expr(x + !!q2), env(x = 100))
  2. Run this code in your head and predict what it will print. Confirm or refute your prediction by running the code in R.

    f <- function(...) {
      x <- "f"
      g(f = x, ...)
    }
    g <- function(...) {
      x <- "g"
      h(g = x, ...)
    }
    h <- function(...) {
      enquos(...)
    }
    x <- "top"
    
    out <- f(top = x)
    out
    purrr::map_chr(out, eval_tidy)

21.4 Wrapping quoting functions

One downside of quoting functions is that when you call a lot of them, it is harder to wrap them in another function in order to reduce duplication. For example, if you see the following code:

(x - min(x)) / (max(x) - min(x))
(y - min(y)) / (max(y) - min(y))
((x + y) - min(x + y)) / (max(x + y) - min(x + y))

You can eliminate the reptition with a rescale01 function:

rescale01 <- function(x) (x - min(x)) / (max(x) - min(x))
rescale01(x)
rescale01(y)
rescale01(x + y)

Reducing the duplication here is a very good idea, because when you realise that your rescaling technique doesn’t handle missing values gracefully, you only have one place to fix it. It is easy to create rescale01() because min(), max(), and - , are regular functions: they don’t quote any of their arguments.

It’s harder to reduce duplication when a function quotes one or more arguments. For example, if you notice this repeated code:

df %>% group_by(x1) %>% summmarise(mean = mean(y1))
df %>% group_by(x2) %>% summmarise(mean = mean(y2))
df %>% group_by(x3) %>% summmarise(mean = mean(y3))

This naive function will not work:

grouped_mean <- function(df, x, y) {
  df %>% group_by(x) %>% summmarise(mean = mean(y))
}

Because regardless of the input, grouped_mean() will always group by x and compute the mean of y. However, because group_by() and summarise() (like all quoting functions in the tidyverse) use quasiquotation, there’s a standard way to wrap these functions: you quote and then unquote:

grouped_mean <- function(df, x, y) {
  x <- enexpr(x)
  y <- enexpr(y)
  
  df %>% 
    group_by(!!x) %>% 
    summarise(mean = mean(!!y))
}

mtcars %>% grouped_mean(cyl, mpg)
#> # A tibble: 3 x 2
#>     cyl  mean
#>   <dbl> <dbl>
#> 1  4.00  26.7
#> 2  6.00  19.7
#> 3  8.00  15.1

(In the next chapter, we’ll learn why enexpr() is not quite general enough and how and why to use quo(), enquo(), and enquos() instead.)

This is a powerful pattern that allows you reduce duplication in your code when it uses quasiquoting functions.

21.4.1 Tangling with dots

In our grouped_mean() example above, we allow the user to select one grouping variable, and one summary variable. What if we wanted to allow the user to select more than one? One option would be to use .... There are three possible ways we could use ... it:

  • Pass ... onto the mean() function. That would make it easy to set na.rm = TRUE. This is easiest to implement.

  • Allow the user to select multiple groups

  • Allow the user to select multiple variables to summarise.

Implementing each one of these is relatively straightforward, but what if we want to be able to group by multiple variables, summarise multiple variables, and pass extra args on to mean(). Generally, I think it is better to avoid this sort of API (instead relying on multiple function that each do one thing) but sometimes it is the lesser of the two evils, so it is useful to have a technique in your backpocket to handle it.

grouped_mean <- function(df, groups, vars, args) {

  var_means <- map(vars, function(var) expr(mean(!!var, !!!args)))
  names(var_means) <- map_chr(vars, expr_name)
  
  df %>%
    dplyr::group_by(!!!groups) %>%
    dplyr::summarise(!!!var_means)
}

grouped_mean(mtcars, exprs(vs, am), exprs(hp, drat, wt), list(na.rm = TRUE))
#> # A tibble: 4 x 5
#> # Groups:   vs [?]
#>      vs    am    hp  drat    wt
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  0     0    194    3.12  4.10
#> 2  0     1.00 181    3.94  2.86
#> 3  1.00  0    102    3.57  3.19
#> 4  1.00  1.00  80.6  4.15  2.03

If you use this design a lot, you may also want to provide an alias to exprs() with a better name. For example, dplyr provides the vars() wrapper to support the scoped verbs (e.g. summarise_if(), mutate_at()). aes() in ggplot2 is similar, although it does a little more: requires all arguments be named, naming the the first arguments (x and y) by default, and automatically renames so you can use the base names for aesthetics (e.g. pch vs shape).

grouped_mean(mtcars, vars(vs, am), vars(hp, drat, wt), list(na.rm = TRUE))

21.4.2 Exercises

  1. Implement the three variants of grouped_mean() described above:

    # ... passed on to mean
    grouped_mean <- function(df, group_by, summarise, ...) {}
    # ... selects variables to summarise
    grouped_mean <- function(df, group_by, ...) {}
    # ... selects variables to group by
    grouped_mean <- function(df, ..., summarise) {}

21.5 Tidy evaluation

In the previous section, you learn how to capture quosures and why they are important when calling existing functions that use tidy evaluation. In this section, you’ll learn how to create your own functions that use tidy evaluations. There are two big new concepts both related to evaluating code in the context of a data frame:

  • A data mask is a data frame where the evaluated code will look first for variable definitions.

  • A data mask introduces ambiguity, so to remove that ambiguity when necessary we introduce pronouns.

We’ll explore tidy evaluation in the context of base::subset(), because it’s a simple yet powerful function that encapsulates one of the central ideas that makes R so elegant for data analysis. Once we’ve seen the tidy implementation, we’ll return to the base R implementation, learn how it works, and explore the downsides which make subset() suitable only for interactive usage.

21.5.1 eval_tidy()

Once you have a quosure, you will need to use eval_tidy() instead of eval_bare():

x <- 2

# These two calls are equivalent
eval_bare(expr(x), globalenv())
#> [1] 2
eval_tidy(quo(x))
#> [1] 2

Like eval_bare(), eval_tidy() has a env argument, but generally you will not use it, because the environment is captured by the quosure. Instead, you will typically use the second argument, data. This lets you set up a data mask, where variables in the environment are potentially masked by variables in data frame. This allows you to mingle variables from the environment and variables from a data frame:

df <- data.frame(y = 1:10)

eval_tidy(quo(x * y), df)
#>  [1]  2  4  6  8 10 12 14 16 18 20

This is the key idea that powers base R functions like with(), subset() and transform(), and that is used through tidyverse packages like dplyr.

21.5.2 Data masks

Unlike environments, data frames don’t have parents. This is what allows data masks to work: eval_tidy() effectively creates a new environment that contains the values of data and has a parent of env:

df <- data.frame(y = 1:10)
x <- 2
q1 <- quo(x * y)

# eval_tidy(q1, mtcars) is equivalent to:
df_env <- as_env(df, parent = quo_get_env(q1))
q2 <- quo_set_env(q1, df_env)

eval_tidy(q2)
#>  [1]  2  4  6  8 10 12 14 16 18 20

base::eval() has similar functionality. If the 2nd argument is a data frame it becomes a data mask, and you provide the environment in the 3rd argument:

eval(quo_get_expr(q1), df, quo_get_env(q1))
#>  [1]  2  4  6  8 10 12 14 16 18 20

21.5.3 Application: subset()

To see why the data mask is so important, lets implement our own version of subset(). If you haven’t used it before, subset() (like dplyr::filter()), provides a convenient way of selecting rows of a data frame using an expression that is evaluated in the context of the data frame. It allows you to subset without repeatedly referring to the name of the data frame:

sample_df <- data.frame(a = 1:5, b = 5:1, c = c(5, 3, 1, 4, 1))

# Shorthand for sample_df[sample_df$a >= 4, ]
subset(sample_df, a >= 4)
#>   a b c
#> 4 4 2 4
#> 5 5 1 1

# Shorthand for sample_df[sample_df$b == sample_df$c, ]
subset(sample_df, b == c)
#>   a b c
#> 1 1 5 5
#> 5 5 1 1

The core of subset2() is quite simple. It takes two arguments: a data frame, data, and a quoted expression, rows. We evaluate subset in using data as a data mask, then use the results to subset the data frame with [. I’ve included a very simple check to ensure the result is a logical vector; real code should do more work to create an informative error.

subset2 <- function(data, rows) {
  rows <- enquo(rows)
  
  rows_val <- eval_tidy(rows, data)
  stopifnot(is.logical(rows_val))
  
  data[rows_val, , drop = FALSE]
}

subset(sample_df, b == c)
#>   a b c
#> 1 1 5 5
#> 5 5 1 1

21.5.4 Application: arrange()

A slightly more complicated exercise is to implement a basic version of dplyr::arrange(). The goal of arrange() is to allow you to sort a data frame by multiple variables, each evaluated in the context of the data frame.

arrange2 <- function(data, ..., na.last = TRUE) {
  # Capture all dots
  args <- enquos(...)
  
  # Create a call to order, using `!!!` to splice in the 
  # individual expressions, and `!!` to splice in .na.last
  order_call <- quo(order(!!!args, na.last = !!na.last))
  
  # Evaluate the call to order with 
  ord <- eval_tidy(order_call, data)
  
  data[ord, , drop = FALSE]
}

df <- data.frame(x = c(2, 3, 1), y = runif(3))

arrange2(df, x)
#>   x     y
#> 3 1 0.404
#> 1 2 0.402
#> 2 3 0.196
arrange2(df, -y)
#>   x     y
#> 3 1 0.404
#> 1 2 0.402
#> 2 3 0.196

Next we’ll talk a problem introduced by the data mask and how to fix it. Then we’ll come back to base::subset() and discuss why it’s documentation strongly advises against putting it in a function, and show how tidy evaluation overcomes each challenge.

21.5.5 Ambiguity and pronouns

One of the downsides of the data mask is that it introduces ambiguity: when you say x, are you refering to a variable in the data or in the environment? This ambiguity is ok when doing interactive data analysis because you are familiar with the variables, and if there are problems, you spot them quickly because you are looking at the data frequently. However, ambiguity becomes a problem when you start programming with functions that use tidy evaluation. For example, take this simple wrapper:

threshold_x <- function(df, val) {
  subset2(df, x >= val)
}

This function silently return an incorrect result in two ways:

  • If df does not contain a variable called x, threshold_x() will silently return an incorrect result if x exists in the calling environment:

    x <- 10
    no_x <- data.frame(y = 1:3)
    threshold_x(no_x, 2)
    #>   y
    #> 1 1
    #> 2 2
    #> 3 3
  • If df contains a variable called val, the function will always return an incorrect answer:

    has_val <- data.frame(x = 1:3, val = 9:11)
    threshold_x(has_val, 2)
    #> [1] x   val
    #> <0 rows> (or 0-length row.names)

These failure modes arise because tidy evaluation is ambiguous: each variable can be found in either the data mask or the environment. To make this function work we need to remove that ambiguity and ensure that x is always found in the data and val in the environment. To make this possible eval_tidy() provides the .data and .env pronouns:

threshold_x <- function(df, val) {
  subset2(df, .data$x >= .env$val)
}

x <- 10
threshold_x(no_x, 2)
#> Error: Column `x` not found in `.data`
threshold_x(has_val, 2)
#>   x val
#> 2 2  10
#> 3 3  11

(NB: unlike indexing an ordinary list or environment with $, if the variable is not found then these pronouns will throw an error)

Generally, whenever you use the .env pronoun, you can use unquoting instead:

threshold_x <- function(df, val) {
  subset2(df, .data$x >= !!val)
}

There are subtle differences in when val is evaluated. If you unquote, val will be evaluated by enquo(); if you use a pronoun, val will be evaluated by eval_tidy(). These differences are usually unimportant, so pick the form that looks most natural.

What if we generalise threshold_x() slightly so that the user can pick the variable used for thresholding. There are two basic approaches. Both start by capturing a symbol:

threshold_var1 <- function(df, var, val) {
  var <- ensym(var)
  subset2(df, `$`(data, !!var) >= !!val)
}

threshold_var2 <- function(df, var, val) {
  var <- as.character(ensym(var))
  subset2(df, data[[!!var]] >= !!val)
}

In threshold_var1 we need to use the prefix form of $, because df$!!var is not valid syntax. Alternatively, we can convert the symbol to a string, and use [[.

Note that it is not always the responsibility of the function author to avoid ambiguity. Imagine we generalise further to allow thresholding based on any expression:

threshold_expr <- function(df, expr, val) {
  expr <- enquo(expr)
  subset2(df, !!expr >= !!val)
}

There’s no way to ensure that expr is only evaluated in the data, and indeed that would not be desirable because data will not include any functions (like + or <). In this case, it is now the users responsibility to avoid ambiguity. As a function author it’s your responsibility to avoid ambiguity with an expressions that you create; it’s the users responsibility to avoid ambiguity in expressions that they create.

21.5.6 Base subset()

The documentation of subset() includes the following warning:

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

Why is subset() dangerous for programming and how does tidy evaluation help us avoid those dangers? First, lets implement the key parts of subset() following the same structure as subset2(). We convert enquo() to substitute() and eval_tidy() to eval(). We also need to supply a backup environment to eval(). There’s no way to access the environment associated with an argument in base R, so we take the best approximation: the caller environment (aka parent frame):

subset_base <- function(data, rows) {
  rows <- substitute(rows)
  
  rows_val <- eval(rows, data, caller_env())
  stopifnot(is.logical(rows_val))
  
  data[rows_val, , drop = FALSE]
}

There are three problems with this implementation:

  • subset() doesn’t support unquoting, so wrapping the function is hard. First, you use substitute() to capture the complete expression, then you evaluate it. Because substitute() doesn’t use a syntactic marker for unquoting, it is hard to see exactly what’s happening here.

    f1a <- function(df, expr) {
      eval(substitute(subset(df, expr)), caller_env())
    }
    
    df <- data.frame(x = 1:3, y = 3:1)
    f1a(df, x == 1)
    #>   x y
    #> 1 1 3

    I think the tidy evaluation equivalent is easier to understand because the quoting and unquoting is explicit:

    f1b <- function(df, expr) {
      expr <- enquo(expr)
      subset2(df, !!expr)
    }
    f1b(df, x == 1)
    #>   x y
    #> 1 1 3
  • base::subset() always evaluates rows in the parent frame, but if ... has been used, then the expression might need to be evaluated elsewhere:

    f <- function(df, ...) {
      xval <- 3
      subset(df, ...)
    }
    
    xval <- 1
    f(df, x == xval)
    #>   x y
    #> 3 3 1

    Because enquo() captures the environment of the argument as well as its expression, this is not a problem with subset2():

    f <- function(df, ...) {
      xval <- 10
      subset2(df, ...)
    }
    
    xval <- 1
    f(df, x == xval)
    #>   x y
    #> 1 1 3
  • Finally, subset() doesn’t have any pronouns so there’s no way to write a safe version of threshold_x().

You might wonder if all this rigamorale is worth it when you can just use [. Firstly, it seems unappealing to have functions that can only be used safely in an interactive context. Then every interactive function needs to be paired with a programming function which behaves slightly differently. Secondly, even the simple subset() function, provides two useful features:

  • It sets drop = FALSE by default, so it’s garuanteed to return a data frame
  • It drops rows where the conditional evaluates to NA.

That means subset(df, x == y) is not equivalent to df[x == y,] as you might naively expect. Instead, it is equivalent to df[x == y & !is.na(x == y), , drop = FALSE]: that’s a lot more typing!

21.5.7 Performance

Note that there some performance overhead when evaluating a quosure compared to evaluating an expression:

n <- 1000
x1 <- expr(runif(n))
e1 <- globalenv()
q1 <- quo(runif(n))

microbenchmark::microbenchmark(
  runif(n),
  eval_bare(x1, e1),
  eval_tidy(q1),
  eval_tidy(q1, mtcars)
)
#> Unit: microseconds
#>                   expr  min   lq  mean median    uq   max neval
#>               runif(n) 37.5 38.7  42.3   39.7  43.8  93.0   100
#>      eval_bare(x1, e1) 38.8 40.1  43.7   42.3  45.5  76.6   100
#>          eval_tidy(q1) 41.7 44.2  47.3   46.2  48.8  73.9   100
#>  eval_tidy(q1, mtcars) 92.5 95.3 100.5   96.8 101.6 200.1   100

However, most of the overhead is due to setting up the data mask so if you need to evaluate code repeatedly, it’s a good idea to the data mask once then reuse it:

d_mtcars <- as_data_mask(mtcars)

microbenchmark::microbenchmark(
  as_data_mask(mtcars), 
  eval_tidy(q1, mtcars),
  eval_tidy(q1, d_mtcars)
)
#> Unit: microseconds
#>                     expr   min    lq mean median   uq   max neval
#>     as_data_mask(mtcars)  6.66  8.73 10.5   10.1 11.3  62.7   100
#>    eval_tidy(q1, mtcars) 90.85 93.81 99.0   95.0 96.4 176.8   100
#>  eval_tidy(q1, d_mtcars) 38.83 40.75 42.8   41.9 42.9  74.4   100

(The amount of savings is surprising because eval_tidy() also calls data_mask_clean(). Currently discussing if that should be the default at https://github.com/tidyverse/rlang/issues/372)

21.5.8 Exercises

  1. Improve subset2() to make it more like base::subset():

    • Drop rows where subset evaluates to NA.
    • Give a clear error message if subset doesn’t yield a logical vector.
    • What happens if subset doesn’t yield a logical vector with length equal to the number of rows in data? What do you think should happen?
  2. Here’s an alternative implementation of arrange():

    invoke <- function(fun, ...) do.call(fun, dots_list(...))
    arrange3 <- function(.data, ..., .na.last = TRUE) {
      args <- enquos(...)
    
      ords <- purrr::map(args, eval_tidy, data = .data)
      ord <- invoke(order, !!!ords, na.last = .na.last)
    
      .data[ord, , drop = FALSE]
    }

    Describe the primary difference in approach compared to the function defined in the text.

    One advantage of this approach is that you could check each element of ... to make sure that input is correct. What property should each element of ords have?

  3. Here’s an alternative implementation of subset2():

    subset3 <- function(data, rows) {
      eval_tidy(quo(data[!!enquo(rows), , drop = FALSE]))
    }

    Rewrite the function to improve clarity then explain how this approach differs to the approach in the text.

  4. Implement a form of arrange() where you can request a variable to sorted in descending order using named arguments:

    arrange(mtcars, cyl, desc = mpg, vs)

    (Hint: The descreasing argument to order() will not help you. Instead, look at the definition of dplyr::desc(), and read the help for xtfrm().)

  5. Why do you not need to worry about ambiguity in arrange()?

  6. What does transform() do? Read the documentation. How does it work? Read the source code for transform.data.frame(). What does substitute(list(...)) do?

  7. Use tidy evaluation to implement your own version of transform(). Extend it so that a calculation can refer to variables created by transform, i.e. make this work:

    df <- data.frame(x = 1:3)
    transform(df, x1 = x + 1, x2 = x1 + 1)
    #> Error in x1 + 1: non-numeric argument to binary operator
  8. What does with() do? How does it work? Read the source code for with.default(). What does within() do? How does it work? Read the source code for within.data.frame(). Why is the code so much more complex than with()?

  9. Implement a version of within.data.frame() that uses tidy evaluation. Read the documentation and make sure that you understand what within() does, then read the source code.

21.6 Case study: calling base NSE functions

To finish up this chapter we’re going to show how to wrap base NSE functions. We’ll focus on wrapping models because this is a common need, and illustrates the spectrum of challenges you’ll need to overcome for another base funtion.

Unfortunately it’s not possible to use tidy evaluation in our wrappers, because the semantics of NSE functions are not quite rich enough. This means that the wrappers we will create can not in turn be easily wrapped. This makes them useful for reducing duplication in your analysis code, but not suitable for inclusion in a package.

21.6.1 Basics

Let’s start with a very simple wrapper around lm():

lm2 <- function(formula, data) {
  lm(formula, data)
}

This wrapper works, but is supoptimal because lm() captures its call, and displays it when printing:

lm2(mpg ~ disp, mtcars)
#> 
#> Call:
#> lm(formula = formula, data = data)
#> 
#> Coefficients:
#> (Intercept)         disp  
#>     29.5999      -0.0412

This is important because this call is the chief way that you see the model specification when printing the model. To overcome this problem, we need to capture the arguments, create the call to lm() using unquoting, then evaluate that call:

lm3 <- function(formula, data) {
  formula <- enexpr(formula)
  data <- enexpr(data)
  
  lm_call <- expr(lm(!!formula, data = !!data))
  eval_bare(lm_call, caller_env())
}
lm3(mpg ~ disp, mtcars)$call
#> lm(formula = mpg ~ disp, data = mtcars)

Note that we manually supply an evaluation environment, caller_env(). We’ll discuss that in more detail shortly.

Note that this technique works for all the arguments, even those that use NSE, like subset():

lm4 <- function(formula, data, subset = NULL) {
  formula <- enexpr(formula)
  data <- enexpr(data)
  subset <- enexpr(subset)
  
  lm_call <- expr(lm(!!formula, data = !!data, subset = !!subset))
  eval_bare(lm_call, caller_env())
}
coef(lm4(mpg ~ disp, mtcars))
#> (Intercept)        disp 
#>     29.5999     -0.0412
coef(lm4(mpg ~ disp, mtcars, subset = cyl == 4))
#> (Intercept)        disp 
#>      40.872      -0.135

Note that I’ve supplied a default argument to subset. I think this is good practice because it clearly indicates that subset is optional: arguments with no default are ususally required. NULL has two nice properties here:

  1. lm() already knows how to handle subset = NULL: it treats it the same way as a missing subset.

  2. expr(NULL) is NULL; which makes it easier to detect progammatically.

However, the current approach has one small downside: subset = NULL is shown in the call.

lm4(mpg ~ disp, mtcars)$call
#> lm(formula = mpg ~ disp, data = mtcars, subset = NULL)

It’s possible, if a little more work, to generate a call where subset is simply absent. There are two tricks needed to do this:

  1. We use the %||% helper to replace a NULL subset with missing_arg().

  2. We use maybe_missing() in expr(): if we don’t do that the essential weirdness of the missing argument crops up and generates an error.

This leads to lm5():

lm5 <- function(formula, data, subset = NULL) {
  formula <- enexpr(formula)
  data <- enexpr(data)
  subset <- enexpr(subset) %||% missing_arg()
  
  lm_call <- expr(lm(!!formula, data = !!data, subset = !!maybe_missing(subset)))
  eval_bare(lm_call, caller_env())
}
lm5(mpg ~ disp, mtcars)$call
#> lm(formula = mpg ~ disp, data = mtcars)

Note that all these wrappers have one small advantage over lm(): we can use unquoting.

f <- mpg ~ disp
lm5(!!f, mtcars)$call
#> lm(formula = mpg ~ disp, data = mtcars)

resp <- expr(mpg)
lm5(!!resp ~ disp, mtcars)$call
#> lm(formula = mpg ~ disp, data = mtcars)

21.6.2 The evaluation environment

What if you want to mingle object supplied by the user with objects that you create in the function? For example, imagine you want to make an auto-boostrapping version of lm(). You might write it like this:

boot_lm0 <- function(formula, data) {
  formula <- enexpr(formula)
  boot_data <- data[sample(nrow(data), replace = TRUE), , drop = FALSE]
  
  lm_call <- expr(lm(!!formula, data = boot_data))
  eval_bare(lm_call, caller_env())
}

df <- data.frame(x = 1:10, y = 5 + 3 * (1:10) + rnorm(10))
boot_lm0(y ~ x, data = df)
#> Error in is.data.frame(data): object 'boot_data' not found

Why doesn’t this code work? It’s because we’re evaluating lm_call in the caller environment, but boot_data exists in the execution environment. We could instead evaluate in the execution environment of boot_lm0(), but there’s no guarantee that formula could be evaluated in that environment.

There are two basic way to overcome this challenge:

  1. Unquote the data frame into the call. This means that no look up has to occur, but has all the problems of inlining expressions. For modelling functions this means that captured call is suboptimal:

    boot_lm1 <- function(formula, data) {
      formula <- enexpr(formula)
      boot_data <- data[sample(nrow(data), replace = TRUE), , drop = FALSE]
    
      lm_call <- expr(lm(!!formula, data = !!boot_data))
      eval_bare(lm_call, caller_env())
    }
    boot_lm1(y ~ x, data = df)
    #> 
    #> Call:
    #> lm(formula = y ~ x, data = structure(list(x = c(9L, 4L, 1L, 7L, 
    #> 3L, 8L, 6L, 3L, 2L, 10L), y = c(31.8432299491039, 17.2821895330579, 
    #> 9.81418051082896, 25.5806797154872, 15.6196693921342, 29.2009109824721, 
    #> 21.6696573392776, 15.6196693921342, 11.1735930555292, 35.807326619964
    #> )), .Names = c("x", "y"), row.names = c("9", "4", "1", "7", "3", 
    #> "8", "6", "3.1", "2", "10"), class = "data.frame"))
    #> 
    #> Coefficients:
    #> (Intercept)            x  
    #>        6.31         2.84
  2. Alternatively you can create a new environment that inherits from the caller, and you can bind variables that you’ve created inside the function to that environment.

    boot_lm2 <- function(formula, data) {
      formula <- enexpr(formula)
      boot_data <- data[sample(nrow(data), replace = TRUE), , drop = FALSE]
    
      lm_env <- child_env(caller_env(), boot_data = boot_data)
      lm_call <- expr(lm(!!formula, data = boot_data))
      eval_bare(lm_call, lm_env)
    }
    boot_lm2(y ~ x, data = df)
    #> 
    #> Call:
    #> lm(formula = y ~ x, data = boot_data)
    #> 
    #> Coefficients:
    #> (Intercept)            x  
    #>        6.50         2.75

21.6.3 Making formulas

One final aspect to wrapping modelling functions is generating formulas. You just need to learn about one small wrinkle and then you can use the techniques you learned in Quotation. Formulas they print the same when evaluated and unevaluated:

y ~ x
#> y ~ x
expr(y ~ x)
#> y ~ x

Instead, check the class to make sure you have an actual formula:

class(y ~ x)
#> [1] "formula"
class(expr(y ~ x))
#> [1] "call"
class(eval_bare(expr(y ~ x)))
#> [1] "formula"

Once you understand this, you can generate formulas with unquoting and reduce(). Just remember to evaluate the result before returning it. Like in another base NSE wrapper, you should use caller_env() as the evaluation environment.

Here’s a simple example that generates a formula by combining a response variable with a set of predictors.

build_formula <- function(resp, ...) {
  resp <- enexpr(resp)
  preds <- enexprs(...)
  
  pred_sum <- purrr::reduce(preds, ~ expr(!!.x + !!.y))
  eval_bare(expr(!!resp ~ !!pred_sum), caller_env())
}
build_formula(y, a, b, c)
#> y ~ a + b + c

21.6.4 Exercises

  1. When model building, typically the predictor and data are relatively constant while you rapidly experiment with different predictors. Write a small wrapper that allows you to reduce duplication in this situation.

    pred_mpg <- function(resp, ...) {
    
    }
    pred_mpg(~ disp)
    pred_mpg(~ I(1 / disp))
    pred_mpg(~ disp * cyl)
  2. Another way to way to write boot_lm() would be to include the boostrapping expression (data[sample(nrow(data), replace = TRUE), , drop = FALSE]) in to the data argument. Implement that approach. What are the advantages? What are the disadvantages?

  3. We could capture quosures, and then extract the environment from them. There are multiple environments associated with a quosore, but eval_bare() can only use one. Write a function that takes a list of quosures and returns the common environment, if they have one, or otherwise throws an error.

  4. Write a function that takes a data frame and a list of formulas, fitting a linear model with each formula, generating a useful model call.

  5. Create a formula generation function that allows you to optionally supply a transformation function (e.g. log()) to the response or the predictors.