# 21 Evaluation

## 21.1 Introduction

The user-facing opposite of quotation is unquotation: it gives the *user* the ability to selectively evaluate parts of an otherwise quoted argument. The developer-facing complement of quotation is evaluation: this gives the *developer* of the function the ability to evaluate quoted expressions in special ways to create domain specific languages for data analysis like ggplot2 and dplyr.

Mention tidy evaluation because it’s one of the principles underlying the tidyverse. Why do I teach here when it’s not base R? Because it gives us the following good things.

Tidy evaluation is the combination of four big ideas:

Quasiquotation to give the user control of quoting.

Quosures to capture arguments expressions and their evaluation environment.

Data masks to enable expressions to mingle variables from an environment and from a data source.

Pronouns to eliminate ambiguity between environment and data source when needed.

`library(rlang)`

### Outline

### Prerequisites

Environments play a big role in evaluation, so make sure you’re familiar with Environments before continuing.

## 21.2 Evaluation basics

In the previous chapter, we briefly mentioned `eval()`

. Here, rather than starting with `eval()`

, we’re going to start with `rlang::eval_bare()`

which is the purest evocation of the idea of evaluation. The first argument, `expr`

is an expression to evaluate. This will usually be either a symbol or expression:

```
x <- 10
eval_bare(expr(x))
#> [1] 10
y <- 2
eval_bare(expr(x + y))
#> [1] 12
```

(As well as be self-quoting, constants are also self-evaluating)

The second argument, `env`

, gives the environment in which the expression should be evaluated, i.e. where should the values of `x`

, `y`

, and `+`

be looked for? By default, this is the current environment, i.e. the calling environment of `eval_bare()`

, but you can override it if you want:

```
eval_bare(expr(x + y), env(x = 1000))
#> [1] 1002
```

Because R looks up functions in the same way as variables, we can also override the meaning of functions. This is a key technique for generating DSLs, as discussed in the next chapter.

```
eval_bare(expr(x + y), env(`+` = function(x, y) paste0(x, " + ", y)))
#> [1] "10 + 2"
```

If passed an object other than a symbol or expression, the evaluation functions will simply return the input as is (because it’s already evaluated). This can lead to confusing results if you forget to `quote()`

the input: `eval_bare()`

doesn’t quote `expr`

so it is passed by value.

```
eval_bare(x + y)
#> [1] 12
eval_bare(x + y, env = env)
#> [1] 12
```

Now that you’ve seen the basics, let’s explore some applications. We’ll focus primarily on base R functions that you might have used before; now you can learn how they work. To focus on the underlying principles, we’ll extracting their essence and rewrite to use functions from rlang. We’ll then circle back and talk about the base R functions most important for evaluation.

### 21.2.1 Application: `local()`

Sometimes you want to perform a chunk of calculation that creates a bunch of intermediate variables. The intermediate variables have no long term use and could be quite large, so you’d rather not keep them around. One approach is to clean up after yourself using `rm()`

. Another approach is to wrap the code in a function, and just call it once.

A more elegant approach is to use `local()`

:

```
# Clean up variables created earlier
rm(x, y)
foo <- local({
x <- 10
y <- 200
x + y
})
foo
#> [1] 210
x
#> Error in eval(expr, envir, enclos): object 'x' not found
y
#> Error in eval(expr, envir, enclos): object 'y' not found
```

The essence of `local()`

is quite simple. We capture the expression, and create an new environment in which to evaluate it. This environment inherits from the caller environment so it can access the current lexical scope.

```
local2 <- function(expr, env = child_env(caller_env())) {
eval_bare(enexpr(expr), env)
}
foo <- local2({
x <- 10
y <- 200
x + y
})
env_has(nms = c("x", "y"))
#> [1] FALSE FALSE
```

It’s a bit harder to understand how `base::local()`

works, as it takes uses `eval()`

and `substitute()`

together in rather complicated ways. Figuring out exactly what’s going on is good practice if you really want to understand the subtleties of `substitute()`

and the base `eval()`

funtions.

### 21.2.2 Application: `source()`

We can create a simple version of `source()`

by combining `expr_text()`

and `eval_tidy()`

. We read in the file from disk, use `parse_expr()`

to parse the string into an list of expressions, and then use `eval_bare()`

to evaluate each component. This version evaluates the code in the caller environment, and invisibly returns the result of the last expression in the file (like `source()`

).

```
source2 <- function(file, env = caller_env()) {
lines <- readLines(file, warn = FALSE)
code <- paste(lines, collapse = "\n")
exprs <- parse_exprs(code)
res <- NULL
for (i in seq_along(exprs)) {
res <- eval_bare(exprs[[i]], env)
}
invisible(res)
}
```

The real `source()`

is considerably more complicated because it can `echo`

input and output, and also has many additional settings to control behaviour.

### 21.2.3 Gotcha: `function()`

```
x <- 10
y <- 20
f <- eval_bare(expr(function(x, y) !!x + !!y))
f
#> function(x, y) !!x + !!y
```

But it works!

```
f()
#> [1] 30
```

What is going on? `srcrefs`

!

Two options: remove srcrefs by setting srcref attr to `NULL`

. Alternatively, use `new_function()`

.

```
new_function(
exprs(x = , y = ),
expr({!!x + !!y})
)
#> function (x, y)
#> {
#> 10 + 20
#> }
```

### 21.2.4 Base R

The base function equivalent to `eval_bare()`

is the two-argument form of `eval()`

: `eval(expr, envir)`

:

```
eval(expr(x + y), env(x = 1000, y = 1))
#> [1] 1001
```

The final argument, `enclos`

provides support for data masks, which you’ll learn about in tidy evaluation.

`eval()`

is paired with two helper functions:

`evalq(x, env)`

quotes its first argument, and is hence a shortcut for`eval(quote(x), env)`

.`eval.parent(expr, n)`

is shortcut for`eval(x, env = parent.frame(n))`

.

`base::eval()`

has special behaviour for expression **objects**, evaluating each component in turn. This makes for a very compact implementation of `source2()`

because `base::parse()`

also returns an expression object:

```
source3 <- function(file, env = parent.frame()) {
lines <- parse(file)
res <- eval(lines, envir = env)
invisible(res)
}
```

While `source3()`

is considerably more concise than `source2()`

, this one use case is the strongest argument for expression objects, and overall we don’t believe this one benefit outweighs the cost of introducing a new data structure.

### 21.2.5 Exercises

Carefully read the documentation for

`source()`

. What environment does it use by default? What if you supply`local = TRUE`

? How do you provide a custom argument?Predict the results of the following lines of code:

`eval(quote(eval(quote(eval(quote(2 + 2)))))) eval(eval(quote(eval(quote(eval(quote(2 + 2))))))) quote(eval(quote(eval(quote(eval(quote(2 + 2)))))))`

Write an equivalent to

`get()`

using`sym()`

and`eval_bare()`

. Write an equivalent to`assign()`

using`sym()`

,`expr()`

, and`eval_bare()`

. (Don’t worry about the multiple ways of choosing an environment that`get()`

and`assign()`

support; assume that the user supplies it explicitly.)`# name is a string get2 <- function(name, env) {} assign2 <- function(name, value, env) {}`

Modify

`source2()`

so it returns the result of*every*expression, not just the last one. Can you eliminate the for loop?The code generated by

`source2()`

lacks source references. Read the source code for`sys.source()`

and the help for`srcfilecopy()`

, then modify`source2()`

to preserve source references. You can test your code by sourcing a function that contains a comment. If successful, when you look at the function, you’ll see the comment and not just the source code.The third argument in

`subset()`

allows you to select variables. It treats variable names as if they were positions. This allows you to do things like`subset(mtcars, , -cyl)`

to drop the cylinder variable, or`subset(mtcars, , disp:drat)`

to select all the variables between`disp`

and`drat`

. How does this work? I’ve made this easier to understand by extracting it out into its own function that uses tidy evaluation.`select <- function(df, vars) { vars <- enexpr(vars) var_pos <- set_names(as.list(seq_along(df)), names(df)) cols <- eval_tidy(vars, var_pos) df[, cols, drop = FALSE] } select(mtcars, -cyl)`

We can make

`base::local()`

slightly easier to understand by spreading out over multiple lines:`local3 <- function(expr, envir = new.env()) { call <- substitute(eval(quote(expr), envir)) eval(call, envir = parent.frame()) }`

Explain how

`local()`

works in words. (Hint: you might want to`print(call)`

to help understand what`substitute()`

is doing, and read the documentation to remind yourself what environment`new.env()`

will inherit from.)

## 21.3 Quosures

The simplest form of evaluation combines an expression and an environment. This coupling is sufficiently important that we need a data structure that captures both pieces. We call this data structure a **quosure**, a portmanteau of quoting and closure.

You almost always want to capture a quosure rather than an expression because it gives you uniformly more information. Once we’ve discussed its primary use case of tidy evaluation, we’ll come back to the few cases where you should prefer expressions.

### 21.3.1 Motivation

Quosures are needed when expressions to be evaluate mix variables from a data frame and variables in the environment. For example, the following `mutate()`

call creates a new variable called `log`

with a calculation that involves a varible in the dataset `x`

, and a variable in the environment, `base`

:

```
df <- data.frame(z = runif(5))
x <- 10
dplyr::mutate(df, log = log(z, base = x))
#> z log
#> 1 0.0808 -1.0929
#> 2 0.8343 -0.0787
#> 3 0.6008 -0.2213
#> 4 0.1572 -0.8035
#> 5 0.0074 -2.1308
```

Also remember that `log()`

itself is found in the global environment, but there’s no confusion about where functions come from because (without gymnastics) you can’t put a function in a data frame.

Worrying about the execution environment of an argument is important when you write quoting functions. Take this simple example:

```
compute_mean <- function(df, x) {
x <- enexpr(x)
dplyr::summarise(df, mean = mean(!!x))
}
```

It works correctly for simple inputs:

```
compute_mean(df, z)
#> mean
#> 1 0.336
```

It contains a subtle bug, which we can illustrate with this slightly forced example:

```
x <- 10
compute_mean(df, log(z, base = x))
#> Error in summarise_impl(.data, dots): Evaluation error: non-numeric argument to mathematical function.
```

We get this error because we have lost the evaluation environment associated with `log(z, base = x)`

so it is evaluated inside `compute_mean()`

where is `x`

an AST. This type of bug is pernicious because it will happen rarely and the error message will be inscrutable.

We can avoid the bug by the expression along with its evaluation evniroment. That’s the job of `enquo()`

, which otherwise works identically to `enexpr()`

:

```
compute_mean <- function(df, x) {
x <- enquo(x)
dplyr::summarise(df, mean = mean(!!x))
}
compute_mean(mtcars, log(mpg, base = x))
#> mean
#> 1 1.28
```

### 21.3.2 Creating and manipulating

To create a quosure you will typically use one of the equivalents of the `expr()`

functions that you learned about in the previous chapter:

Use

`quo()`

and`quos()`

for experimenting interactively and for unquoting with fixed expressions inside a function.Use

`enquo()`

and`enquos()`

to capture user-supplied arguments to a function.

Alternatively, you can use `new_quosure()`

to create a quosure from its components: an expression and an environment.

```
x <- new_quosure(expr(x + y), env(x = 1, y = 10))
x
#> <quosure>
#> expr: ^x + y
#> env: 0x4d86c90
```

Note how quosures are printed. If you unquote a quosure inside another quosure, each quosusre starts with `^`

and if you’re in console that supports it, each quosure gets a differnt colour to help remind you that it has a different environment attached to it.

```
q2 <- quo(x + !!x)
q2
#> <quosure>
#> expr: ^x + (^x + y)
#> env: global
```

(Note that because quosures capture the complete environment you need to be a little careful if your function returns quosures. If you have large temporary objects they will not get gc’d until the quosure has been gc’d. See XXXXXXX for more details.)

You can evaluated a quosure with `eval_tidy()`

, which we’ll study in depth in the next section:

```
eval_tidy(x)
#> [1] 11
```

You can the extract components with `quo_`

helpers:

```
quo_get_env(x)
#> <environment: 0x4d86c90>
quo_get_expr(x)
#> x + y
```

And if you need to turn a quosure into text for output to the console you can use `quo_name()`

, `quo_label()`

, or `quo_text()`

. `quo_name()`

and `quo_label()`

are garanteed to be short; `quo_expr()`

may span multiple lines.

```
# https://github.com/tidyverse/rlang/issues/367
y <- quo(long_function_name(
argument_1 = long_argument_value,
argument_2 = long_argument_value,
argument_3 = long_argument_value,
argument_4 = long_argument_value
))
quo_name(y) # e.g. for data frames
#> [1] "long_function_name(...)"
quo_label(y) # e.g. for error messages
#> [1] "`long_function_name(...)`"
quo_text(y) # for longer messages
#> [1] "long_function_name(argument_1 = long_argument_value, argument_2 = long_argument_value, \n argument_3 = long_argument_value, argument_4 = long_argument_value)"
```

### 21.3.3 Implementation

Quosures are possible because internally R represents function arguments with a special type of object called a **promise**. A promise captures the expression needed to compute the value and the environment in which to compute it. You’re not normally aware of promises because the first time you access a promise its code is evaluated in its environment, yielding a value. This is what powers lazy evaluation.

However, you cannot manipulate promises with R code: they’re sort of quantum; if you attempt to manipulate with R code, they are immediately evaluated, and the promise nature goes away. To work around this, rlang manipulates the promise in C, reifying it into an R object that you can work with.

There is one big difference between promises and quosures. An argument is evaluated implicitly when you access it for the first time. Every time you access it subsequently it will return the same value. A quosure must be evaluated explicitly, and each evaluation is independent of the previous evaluations.

```
# The argument x is evaluated once, then reuses
foo <- function(x_arg) {
list(x1 = x_arg, x2 = x_arg)
}
foo(runif(3))
#> $x1
#> [1] 0.466 0.498 0.290
#>
#> $x2
#> [1] 0.466 0.498 0.290
# The quosure x is evaluated afresh each time
x_quo <- quo(runif(3))
eval_tidy(x_quo)
#> [1] 0.733 0.773 0.875
eval_tidy(x_quo)
#> [1] 0.1749 0.0342 0.3204
```

Quosures are inspired by the the formula operator, `~`

, which also captures both the expression and its environment, and is used extremely heavily in R’s modelling functions:

```
f <- ~runif(3)
f
#> ~runif(3)
str(f)
#> Class 'formula' language ~runif(3)
#> ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
```

Initial versions of rlang used formulas as quosures: an attractive feature of `~`

is that is provides quoting with just a single keystroke. Unfortunately, however, there is no way to add quasiquotation to `~`

, so we decided to use a new function, `quo()`

, instead.

### 21.3.4 Multiple environments

Quosures are particularly important when used with `...`

because each argument can potentially have a different environment associated with it:

```
f <- function(...) {
x <- 1
g(..., x1 = x)
}
g <- function(...) {
x <- 2
h(..., x2 = x)
}
h <- function(...) {
enquos(...)
}
x <- 0
qs <- f(x0 = x)
qs
#> $x0
#> <quosure>
#> expr: ^x
#> env: global
#>
#> $x1
#> <quosure>
#> expr: ^x
#> env: 0x48f86f0
#>
#> $x2
#> <quosure>
#> expr: ^x
#> env: 0x48f8958
purrr::map_dbl(qs, eval_tidy)
#> x0 x1 x2
#> 0 1 2
```

### 21.3.5 Embedded quosures

```
make_x <- function(x) quo(x)
thirty <- quo(!!make_x(0) + !!make_x(10) + !!make_x(20))
thirty
#> <quosure>
#> expr: ^(^x) + (^x) + (^x)
#> env: global
```

If you’re viewing from the console, you’ll see that each quosure is coloured - the point of the colours is to emphasise that the quosures have different environments associated with them even though the expressions are the same.

```
eval_tidy(thirty)
#> [1] 30
```

This was a lot of work to get right. But means that quosures just work, even when embedded inside other quosures.

Note that this code doesn’t make any sense at all if we use expressions instead of quosures equivalents, the environment is never captured so all we have

```
make_x <- function(x) expr(x)
thirty <- expr(!!make_x(0) + !!make_x(10) + !!make_x(20))
thirty
#> x + x + x
eval_tidy(thirty)
#> [1] 0
```

### 21.3.6 When not to use quosures

In code generation.

When expression will be evaluated completely in data context

To call functions that don’t use tidy eval; fuller example next.

Sometimes you can avoid using a quosure by inlining/unquoting values.

```
base <- 2
quo(log(x, base = base))
#> <quosure>
#> expr: ^log(x, base = base)
#> env: global
expr(log(x, base = !!base))
#> log(x, base = 2)
```

### 21.3.7 Exercises

Predict what evaluating each of the following quosures will return.

`q1 <- new_quosure(expr(x), env(x = 1)) q2 <- new_quosure(expr(x + !!q1), env(x = 10)) q3 <- new_quosure(expr(x + !!q2), env(x = 100))`

Run this code in your head and predict what it will print. Confirm or refute your prediction by running the code in R.

`f <- function(...) { x <- "f" g(f = x, ...) } g <- function(...) { x <- "g" h(g = x, ...) } h <- function(...) { enquos(...) } x <- "top" out <- f(top = x) out purrr::map_chr(out, eval_tidy)`

## 21.4 Wrapping quoting functions

One downside of quoting functions is that when you call a lot of them, it is harder to wrap them in another function in order to reduce duplication. For example, if you see the following code:

```
(x - min(x)) / (max(x) - min(x))
(y - min(y)) / (max(y) - min(y))
((x + y) - min(x + y)) / (max(x + y) - min(x + y))
```

You can eliminate the reptition with a `rescale01`

function:

```
rescale01 <- function(x) (x - min(x)) / (max(x) - min(x))
rescale01(x)
rescale01(y)
rescale01(x + y)
```

Reducing the duplication here is a very good idea, because when you realise that your rescaling technique doesn’t handle missing values gracefully, you only have one place to fix it. It is easy to create `rescale01()`

because `min()`

, `max()`

, and `-`

, are regular functions: they don’t quote any of their arguments.

It’s harder to reduce duplication when a function quotes one or more arguments. For example, if you notice this repeated code:

```
df %>% group_by(x1) %>% summmarise(mean = mean(y1))
df %>% group_by(x2) %>% summmarise(mean = mean(y2))
df %>% group_by(x3) %>% summmarise(mean = mean(y3))
```

This naive function will not work:

```
grouped_mean <- function(df, x, y) {
df %>% group_by(x) %>% summmarise(mean = mean(y))
}
```

Because regardless of the input, `grouped_mean()`

will always group by `x`

and compute the mean of `y`

. However, because `group_by()`

and `summarise()`

(like all quoting functions in the tidyverse) use quasiquotation, there’s a standard way to wrap these functions: you quote and then unquote:

```
grouped_mean <- function(df, x, y) {
x <- enexpr(x)
y <- enexpr(y)
df %>%
group_by(!!x) %>%
summarise(mean = mean(!!y))
}
mtcars %>% grouped_mean(cyl, mpg)
#> # A tibble: 3 x 2
#> cyl mean
#> <dbl> <dbl>
#> 1 4.00 26.7
#> 2 6.00 19.7
#> 3 8.00 15.1
```

(In the next chapter, we’ll learn why `enexpr()`

is not quite general enough and how and why to use `quo()`

, `enquo()`

, and `enquos()`

instead.)

This is a powerful pattern that allows you reduce duplication in your code when it uses quasiquoting functions.

### 21.4.1 Tangling with dots

In our `grouped_mean()`

example above, we allow the user to select one grouping variable, and one summary variable. What if we wanted to allow the user to select more than one? One option would be to use `...`

. There are three possible ways we could use `...`

it:

Pass

`...`

onto the`mean()`

function. That would make it easy to set`na.rm = TRUE`

. This is easiest to implement.Allow the user to select multiple groups

Allow the user to select multiple variables to summarise.

Implementing each one of these is relatively straightforward, but what if we want to be able to group by multiple variables, summarise multiple variables, and pass extra args on to `mean()`

. Generally, I think it is better to avoid this sort of API (instead relying on multiple function that each do one thing) but sometimes it is the lesser of the two evils, so it is useful to have a technique in your backpocket to handle it.

```
grouped_mean <- function(df, groups, vars, args) {
var_means <- map(vars, function(var) expr(mean(!!var, !!!args)))
names(var_means) <- map_chr(vars, expr_name)
df %>%
dplyr::group_by(!!!groups) %>%
dplyr::summarise(!!!var_means)
}
grouped_mean(mtcars, exprs(vs, am), exprs(hp, drat, wt), list(na.rm = TRUE))
#> # A tibble: 4 x 5
#> # Groups: vs [?]
#> vs am hp drat wt
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 0 194 3.12 4.10
#> 2 0 1.00 181 3.94 2.86
#> 3 1.00 0 102 3.57 3.19
#> 4 1.00 1.00 80.6 4.15 2.03
```

If you use this design a lot, you may also want to provide an alias to `exprs()`

with a better name. For example, dplyr provides the `vars()`

wrapper to support the scoped verbs (e.g. `summarise_if()`

, `mutate_at()`

). `aes()`

in ggplot2 is similar, although it does a little more: requires all arguments be named, naming the the first arguments (`x`

and `y`

) by default, and automatically renames so you can use the base names for aesthetics (e.g. `pch`

vs `shape`

).

`grouped_mean(mtcars, vars(vs, am), vars(hp, drat, wt), list(na.rm = TRUE))`

### 21.4.2 Exercises

Implement the three variants of

`grouped_mean()`

described above:`# ... passed on to mean grouped_mean <- function(df, group_by, summarise, ...) {} # ... selects variables to summarise grouped_mean <- function(df, group_by, ...) {} # ... selects variables to group by grouped_mean <- function(df, ..., summarise) {}`

## 21.5 Tidy evaluation

In the previous section, you learn how to capture quosures and why they are important when calling existing functions that use tidy evaluation. In this section, you’ll learn how to create your own functions that use tidy evaluations. There are two big new concepts both related to evaluating code in the context of a data frame:

A

**data mask**is a data frame where the evaluated code will look first for variable definitions.A data mask introduces ambiguity, so to remove that ambiguity when necessary we introduce

**pronouns**.

We’ll explore tidy evaluation in the context of `base::subset()`

, because it’s a simple yet powerful function that encapsulates one of the central ideas that makes R so elegant for data analysis. Once we’ve seen the tidy implementation, we’ll return to the base R implementation, learn how it works, and explore the downsides which make `subset()`

suitable only for interactive usage.

### 21.5.1 `eval_tidy()`

Once you have a quosure, you will need to use `eval_tidy()`

instead of `eval_bare()`

:

```
x <- 2
# These two calls are equivalent
eval_bare(expr(x), globalenv())
#> [1] 2
eval_tidy(quo(x))
#> [1] 2
```

Like `eval_bare()`

, `eval_tidy()`

has a `env`

argument, but generally you will not use it, because the environment is captured by the quosure. Instead, you will typically use the second argument, `data`

. This lets you set up a **data mask**, where variables in the environment are potentially masked by variables in data frame. This allows you to mingle variables from the environment and variables from a data frame:

```
df <- data.frame(y = 1:10)
eval_tidy(quo(x * y), df)
#> [1] 2 4 6 8 10 12 14 16 18 20
```

This is the key idea that powers base R functions like `with()`

, `subset()`

and `transform()`

, and that is used through tidyverse packages like dplyr.

### 21.5.2 Data masks

Unlike environments, data frames don’t have parents. This is what allows data masks to work: `eval_tidy()`

effectively creates a new environment that contains the values of `data`

and has a parent of `env`

:

```
df <- data.frame(y = 1:10)
x <- 2
q1 <- quo(x * y)
# eval_tidy(q1, mtcars) is equivalent to:
df_env <- as_env(df, parent = quo_get_env(q1))
q2 <- quo_set_env(q1, df_env)
eval_tidy(q2)
#> [1] 2 4 6 8 10 12 14 16 18 20
```

`base::eval()`

has similar functionality. If the 2nd argument is a data frame it becomes a data mask, and you provide the environment in the 3rd argument:

```
eval(quo_get_expr(q1), df, quo_get_env(q1))
#> [1] 2 4 6 8 10 12 14 16 18 20
```

### 21.5.3 Application: `subset()`

To see why the data mask is so important, lets implement our own version of `subset()`

. If you haven’t used it before, `subset()`

(like `dplyr::filter()`

), provides a convenient way of selecting rows of a data frame using an expression that is evaluated in the context of the data frame. It allows you to subset without repeatedly referring to the name of the data frame:

```
sample_df <- data.frame(a = 1:5, b = 5:1, c = c(5, 3, 1, 4, 1))
# Shorthand for sample_df[sample_df$a >= 4, ]
subset(sample_df, a >= 4)
#> a b c
#> 4 4 2 4
#> 5 5 1 1
# Shorthand for sample_df[sample_df$b == sample_df$c, ]
subset(sample_df, b == c)
#> a b c
#> 1 1 5 5
#> 5 5 1 1
```

The core of `subset2()`

is quite simple. It takes two arguments: a data frame, `data`

, and a quoted expression, `rows`

. We evaluate `subset`

in using `data`

as a data mask, then use the results to subset the data frame with `[`

. I’ve included a very simple check to ensure the result is a logical vector; real code should do more work to create an informative error.

```
subset2 <- function(data, rows) {
rows <- enquo(rows)
rows_val <- eval_tidy(rows, data)
stopifnot(is.logical(rows_val))
data[rows_val, , drop = FALSE]
}
subset(sample_df, b == c)
#> a b c
#> 1 1 5 5
#> 5 5 1 1
```

### 21.5.4 Application: `arrange()`

A slightly more complicated exercise is to implement a basic version of `dplyr::arrange()`

. The goal of `arrange()`

is to allow you to sort a data frame by multiple variables, each evaluated in the context of the data frame.

```
arrange2 <- function(data, ..., na.last = TRUE) {
# Capture all dots
args <- enquos(...)
# Create a call to order, using `!!!` to splice in the
# individual expressions, and `!!` to splice in .na.last
order_call <- quo(order(!!!args, na.last = !!na.last))
# Evaluate the call to order with
ord <- eval_tidy(order_call, data)
data[ord, , drop = FALSE]
}
df <- data.frame(x = c(2, 3, 1), y = runif(3))
arrange2(df, x)
#> x y
#> 3 1 0.404
#> 1 2 0.402
#> 2 3 0.196
arrange2(df, -y)
#> x y
#> 3 1 0.404
#> 1 2 0.402
#> 2 3 0.196
```

Next we’ll talk a problem introduced by the data mask and how to fix it. Then we’ll come back to `base::subset()`

and discuss why it’s documentation strongly advises against putting it in a function, and show how tidy evaluation overcomes each challenge.

### 21.5.5 Ambiguity and pronouns

One of the downsides of the data mask is that it introduces ambiguity: when you say `x`

, are you refering to a variable in the data or in the environment? This ambiguity is ok when doing interactive data analysis because you are familiar with the variables, and if there are problems, you spot them quickly because you are looking at the data frequently. However, ambiguity becomes a problem when you start programming with functions that use tidy evaluation. For example, take this simple wrapper:

```
threshold_x <- function(df, val) {
subset2(df, x >= val)
}
```

This function silently return an incorrect result in two ways:

If

`df`

does not contain a variable called`x`

,`threshold_x()`

will silently return an incorrect result if`x`

exists in the calling environment:`x <- 10 no_x <- data.frame(y = 1:3) threshold_x(no_x, 2) #> y #> 1 1 #> 2 2 #> 3 3`

If

`df`

contains a variable called`val`

, the function will always return an incorrect answer:`has_val <- data.frame(x = 1:3, val = 9:11) threshold_x(has_val, 2) #> [1] x val #> <0 rows> (or 0-length row.names)`

These failure modes arise because tidy evaluation is ambiguous: each variable can be found in **either** the data mask **or** the environment. To make this function work we need to remove that ambiguity and ensure that `x`

is always found in the data and `val`

in the environment. To make this possible `eval_tidy()`

provides the `.data`

and `.env`

pronouns:

```
threshold_x <- function(df, val) {
subset2(df, .data$x >= .env$val)
}
x <- 10
threshold_x(no_x, 2)
#> Error: Column `x` not found in `.data`
threshold_x(has_val, 2)
#> x val
#> 2 2 10
#> 3 3 11
```

(NB: unlike indexing an ordinary list or environment with `$`

, if the variable is not found then these pronouns will throw an error)

Generally, whenever you use the `.env`

pronoun, you can use unquoting instead:

```
threshold_x <- function(df, val) {
subset2(df, .data$x >= !!val)
}
```

There are subtle differences in when `val`

is evaluated. If you unquote, `val`

will be evaluated by `enquo()`

; if you use a pronoun, `val`

will be evaluated by `eval_tidy()`

. These differences are usually unimportant, so pick the form that looks most natural.

What if we generalise `threshold_x()`

slightly so that the user can pick the variable used for thresholding. There are two basic approaches. Both start by capturing a *symbol*:

```
threshold_var1 <- function(df, var, val) {
var <- ensym(var)
subset2(df, `$`(data, !!var) >= !!val)
}
threshold_var2 <- function(df, var, val) {
var <- as.character(ensym(var))
subset2(df, data[[!!var]] >= !!val)
}
```

In `threshold_var1`

we need to use the prefix form of `$`

, because `df$!!var`

is not valid syntax. Alternatively, we can convert the symbol to a string, and use `[[`

.

Note that it is not always the responsibility of the function author to avoid ambiguity. Imagine we generalise further to allow thresholding based on any expression:

```
threshold_expr <- function(df, expr, val) {
expr <- enquo(expr)
subset2(df, !!expr >= !!val)
}
```

There’s no way to ensure that `expr`

is only evaluated in the `data`

, and indeed that would not be desirable because `data`

will not include any functions (like `+`

or `<`

). In this case, it is now the users responsibility to avoid ambiguity. As a function author it’s your responsibility to avoid ambiguity with an expressions that you create; it’s the users responsibility to avoid ambiguity in expressions that they create.

### 21.5.6 Base `subset()`

The documentation of `subset()`

includes the following warning:

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like

`[`

, and in particular the non-standard evaluation of argument`subset`

can have unanticipated consequences.

Why is `subset()`

dangerous for programming and how does tidy evaluation help us avoid those dangers? First, lets implement the key parts of `subset()`

following the same structure as `subset2()`

. We convert `enquo()`

to `substitute()`

and `eval_tidy()`

to `eval()`

. We also need to supply a backup environment to `eval()`

. There’s no way to access the environment associated with an argument in base R, so we take the best approximation: the caller environment (aka parent frame):

```
subset_base <- function(data, rows) {
rows <- substitute(rows)
rows_val <- eval(rows, data, caller_env())
stopifnot(is.logical(rows_val))
data[rows_val, , drop = FALSE]
}
```

There are three problems with this implementation:

`subset()`

doesn’t support unquoting, so wrapping the function is hard. First, you use`substitute()`

to capture the complete expression, then you evaluate it. Because`substitute()`

doesn’t use a syntactic marker for unquoting, it is hard to see exactly what’s happening here.`f1a <- function(df, expr) { eval(substitute(subset(df, expr)), caller_env()) } df <- data.frame(x = 1:3, y = 3:1) f1a(df, x == 1) #> x y #> 1 1 3`

I think the tidy evaluation equivalent is easier to understand because the quoting and unquoting is explicit:

`f1b <- function(df, expr) { expr <- enquo(expr) subset2(df, !!expr) } f1b(df, x == 1) #> x y #> 1 1 3`

`base::subset()`

always evaluates`rows`

in the parent frame, but if`...`

has been used, then the expression might need to be evaluated elsewhere:`f <- function(df, ...) { xval <- 3 subset(df, ...) } xval <- 1 f(df, x == xval) #> x y #> 3 3 1`

Because

`enquo()`

captures the environment of the argument as well as its expression, this is not a problem with`subset2()`

:`f <- function(df, ...) { xval <- 10 subset2(df, ...) } xval <- 1 f(df, x == xval) #> x y #> 1 1 3`

Finally,

`subset()`

doesn’t have any pronouns so there’s no way to write a safe version of`threshold_x()`

.

You might wonder if all this rigamorale is worth it when you can just use `[`

. Firstly, it seems unappealing to have functions that can only be used safely in an interactive context. Then every interactive function needs to be paired with a programming function which behaves slightly differently. Secondly, even the simple `subset()`

function, provides two useful features:

- It sets
`drop = FALSE`

by default, so it’s garuanteed to return a data frame - It drops rows where the conditional evaluates to
`NA`

.

That means `subset(df, x == y)`

is not equivalent to `df[x == y,]`

as you might naively expect. Instead, it is equivalent to `df[x == y & !is.na(x == y), , drop = FALSE]`

: that’s a lot more typing!

### 21.5.7 Performance

Note that there some performance overhead when evaluating a quosure compared to evaluating an expression:

```
n <- 1000
x1 <- expr(runif(n))
e1 <- globalenv()
q1 <- quo(runif(n))
microbenchmark::microbenchmark(
runif(n),
eval_bare(x1, e1),
eval_tidy(q1),
eval_tidy(q1, mtcars)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> runif(n) 37.5 38.7 42.3 39.7 43.8 93.0 100
#> eval_bare(x1, e1) 38.8 40.1 43.7 42.3 45.5 76.6 100
#> eval_tidy(q1) 41.7 44.2 47.3 46.2 48.8 73.9 100
#> eval_tidy(q1, mtcars) 92.5 95.3 100.5 96.8 101.6 200.1 100
```

However, most of the overhead is due to setting up the data mask so if you need to evaluate code repeatedly, it’s a good idea to the data mask once then reuse it:

```
d_mtcars <- as_data_mask(mtcars)
microbenchmark::microbenchmark(
as_data_mask(mtcars),
eval_tidy(q1, mtcars),
eval_tidy(q1, d_mtcars)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> as_data_mask(mtcars) 6.66 8.73 10.5 10.1 11.3 62.7 100
#> eval_tidy(q1, mtcars) 90.85 93.81 99.0 95.0 96.4 176.8 100
#> eval_tidy(q1, d_mtcars) 38.83 40.75 42.8 41.9 42.9 74.4 100
```

(The amount of savings is surprising because `eval_tidy()`

also calls `data_mask_clean()`

. Currently discussing if that should be the default at https://github.com/tidyverse/rlang/issues/372)

### 21.5.8 Exercises

Improve

`subset2()`

to make it more like`base::subset()`

:- Drop rows where
`subset`

evaluates to`NA`

. - Give a clear error message if
`subset`

doesn’t yield a logical vector. - What happens if
`subset`

doesn’t yield a logical vector with length equal to the number of rows in`data`

? What do you think should happen?

- Drop rows where
Here’s an alternative implementation of

`arrange()`

:`invoke <- function(fun, ...) do.call(fun, dots_list(...)) arrange3 <- function(.data, ..., .na.last = TRUE) { args <- enquos(...) ords <- purrr::map(args, eval_tidy, data = .data) ord <- invoke(order, !!!ords, na.last = .na.last) .data[ord, , drop = FALSE] }`

Describe the primary difference in approach compared to the function defined in the text.

One advantage of this approach is that you could check each element of

`...`

to make sure that input is correct. What property should each element of`ords`

have?Here’s an alternative implementation of

`subset2()`

:`subset3 <- function(data, rows) { eval_tidy(quo(data[!!enquo(rows), , drop = FALSE])) }`

Rewrite the function to improve clarity then explain how this approach differs to the approach in the text.

Implement a form of

`arrange()`

where you can request a variable to sorted in descending order using named arguments:`arrange(mtcars, cyl, desc = mpg, vs)`

(Hint: The

`descreasing`

argument to`order()`

will not help you. Instead, look at the definition of`dplyr::desc()`

, and read the help for`xtfrm()`

.)Why do you not need to worry about ambiguity in

`arrange()`

?What does

`transform()`

do? Read the documentation. How does it work? Read the source code for`transform.data.frame()`

. What does`substitute(list(...))`

do?Use tidy evaluation to implement your own version of

`transform()`

. Extend it so that a calculation can refer to variables created by transform, i.e. make this work:`df <- data.frame(x = 1:3) transform(df, x1 = x + 1, x2 = x1 + 1) #> Error in x1 + 1: non-numeric argument to binary operator`

What does

`with()`

do? How does it work? Read the source code for`with.default()`

. What does`within()`

do? How does it work? Read the source code for`within.data.frame()`

. Why is the code so much more complex than`with()`

?Implement a version of

`within.data.frame()`

that uses tidy evaluation. Read the documentation and make sure that you understand what`within()`

does, then read the source code.

## 21.6 Case study: calling base NSE functions

To finish up this chapter we’re going to show how to wrap base NSE functions. We’ll focus on wrapping models because this is a common need, and illustrates the spectrum of challenges you’ll need to overcome for another base funtion.

Unfortunately it’s not possible to use tidy evaluation in our wrappers, because the semantics of NSE functions are not quite rich enough. This means that the wrappers we will create can not in turn be easily wrapped. This makes them useful for reducing duplication in your analysis code, but not suitable for inclusion in a package.

### 21.6.1 Basics

Let’s start with a very simple wrapper around `lm()`

:

```
lm2 <- function(formula, data) {
lm(formula, data)
}
```

This wrapper works, but is supoptimal because `lm()`

captures its call, and displays it when printing:

```
lm2(mpg ~ disp, mtcars)
#>
#> Call:
#> lm(formula = formula, data = data)
#>
#> Coefficients:
#> (Intercept) disp
#> 29.5999 -0.0412
```

This is important because this call is the chief way that you see the model specification when printing the model. To overcome this problem, we need to capture the arguments, create the call to `lm()`

using unquoting, then evaluate that call:

```
lm3 <- function(formula, data) {
formula <- enexpr(formula)
data <- enexpr(data)
lm_call <- expr(lm(!!formula, data = !!data))
eval_bare(lm_call, caller_env())
}
lm3(mpg ~ disp, mtcars)$call
#> lm(formula = mpg ~ disp, data = mtcars)
```

Note that we manually supply an evaluation environment, `caller_env()`

. We’ll discuss that in more detail shortly.

Note that this technique works for all the arguments, even those that use NSE, like `subset()`

:

```
lm4 <- function(formula, data, subset = NULL) {
formula <- enexpr(formula)
data <- enexpr(data)
subset <- enexpr(subset)
lm_call <- expr(lm(!!formula, data = !!data, subset = !!subset))
eval_bare(lm_call, caller_env())
}
coef(lm4(mpg ~ disp, mtcars))
#> (Intercept) disp
#> 29.5999 -0.0412
coef(lm4(mpg ~ disp, mtcars, subset = cyl == 4))
#> (Intercept) disp
#> 40.872 -0.135
```

Note that I’ve supplied a default argument to `subset`

. I think this is good practice because it clearly indicates that `subset`

is optional: arguments with no default are ususally required. `NULL`

has two nice properties here:

`lm()`

already knows how to handle`subset = NULL`

: it treats it the same way as a missing`subset`

.`expr(NULL)`

is`NULL`

; which makes it easier to detect progammatically.

However, the current approach has one small downside: `subset = NULL`

is shown in the call.

```
lm4(mpg ~ disp, mtcars)$call
#> lm(formula = mpg ~ disp, data = mtcars, subset = NULL)
```

It’s possible, if a little more work, to generate a call where `subset`

is simply absent. There are two tricks needed to do this:

We use the

`%||%`

helper to replace a`NULL`

subset with`missing_arg()`

.We use

`maybe_missing()`

in`expr()`

: if we don’t do that the essential weirdness of the missing argument crops up and generates an error.

This leads to `lm5()`

:

```
lm5 <- function(formula, data, subset = NULL) {
formula <- enexpr(formula)
data <- enexpr(data)
subset <- enexpr(subset) %||% missing_arg()
lm_call <- expr(lm(!!formula, data = !!data, subset = !!maybe_missing(subset)))
eval_bare(lm_call, caller_env())
}
lm5(mpg ~ disp, mtcars)$call
#> lm(formula = mpg ~ disp, data = mtcars)
```

Note that all these wrappers have one small advantage over `lm()`

: we can use unquoting.

```
f <- mpg ~ disp
lm5(!!f, mtcars)$call
#> lm(formula = mpg ~ disp, data = mtcars)
resp <- expr(mpg)
lm5(!!resp ~ disp, mtcars)$call
#> lm(formula = mpg ~ disp, data = mtcars)
```

### 21.6.2 The evaluation environment

What if you want to mingle object supplied by the user with objects that you create in the function? For example, imagine you want to make an auto-boostrapping version of `lm()`

. You might write it like this:

```
boot_lm0 <- function(formula, data) {
formula <- enexpr(formula)
boot_data <- data[sample(nrow(data), replace = TRUE), , drop = FALSE]
lm_call <- expr(lm(!!formula, data = boot_data))
eval_bare(lm_call, caller_env())
}
df <- data.frame(x = 1:10, y = 5 + 3 * (1:10) + rnorm(10))
boot_lm0(y ~ x, data = df)
#> Error in is.data.frame(data): object 'boot_data' not found
```

Why doesn’t this code work? It’s because we’re evaluating `lm_call`

in the caller environment, but `boot_data`

exists in the execution environment. We could instead evaluate in the execution environment of `boot_lm0()`

, but there’s no guarantee that `formula`

could be evaluated in that environment.

There are two basic way to overcome this challenge:

Unquote the data frame into the call. This means that no look up has to occur, but has all the problems of inlining expressions. For modelling functions this means that captured call is suboptimal:

`boot_lm1 <- function(formula, data) { formula <- enexpr(formula) boot_data <- data[sample(nrow(data), replace = TRUE), , drop = FALSE] lm_call <- expr(lm(!!formula, data = !!boot_data)) eval_bare(lm_call, caller_env()) } boot_lm1(y ~ x, data = df) #> #> Call: #> lm(formula = y ~ x, data = structure(list(x = c(9L, 4L, 1L, 7L, #> 3L, 8L, 6L, 3L, 2L, 10L), y = c(31.8432299491039, 17.2821895330579, #> 9.81418051082896, 25.5806797154872, 15.6196693921342, 29.2009109824721, #> 21.6696573392776, 15.6196693921342, 11.1735930555292, 35.807326619964 #> )), .Names = c("x", "y"), row.names = c("9", "4", "1", "7", "3", #> "8", "6", "3.1", "2", "10"), class = "data.frame")) #> #> Coefficients: #> (Intercept) x #> 6.31 2.84`

Alternatively you can create a new environment that inherits from the caller, and you can bind variables that you’ve created inside the function to that environment.

`boot_lm2 <- function(formula, data) { formula <- enexpr(formula) boot_data <- data[sample(nrow(data), replace = TRUE), , drop = FALSE] lm_env <- child_env(caller_env(), boot_data = boot_data) lm_call <- expr(lm(!!formula, data = boot_data)) eval_bare(lm_call, lm_env) } boot_lm2(y ~ x, data = df) #> #> Call: #> lm(formula = y ~ x, data = boot_data) #> #> Coefficients: #> (Intercept) x #> 6.50 2.75`

### 21.6.3 Making formulas

One final aspect to wrapping modelling functions is generating formulas. You just need to learn about one small wrinkle and then you can use the techniques you learned in Quotation. Formulas they print the same when evaluated and unevaluated:

```
y ~ x
#> y ~ x
expr(y ~ x)
#> y ~ x
```

Instead, check the class to make sure you have an actual formula:

```
class(y ~ x)
#> [1] "formula"
class(expr(y ~ x))
#> [1] "call"
class(eval_bare(expr(y ~ x)))
#> [1] "formula"
```

Once you understand this, you can generate formulas with unquoting and `reduce()`

. Just remember to evaluate the result before returning it. Like in another base NSE wrapper, you should use `caller_env()`

as the evaluation environment.

Here’s a simple example that generates a formula by combining a response variable with a set of predictors.

```
build_formula <- function(resp, ...) {
resp <- enexpr(resp)
preds <- enexprs(...)
pred_sum <- purrr::reduce(preds, ~ expr(!!.x + !!.y))
eval_bare(expr(!!resp ~ !!pred_sum), caller_env())
}
build_formula(y, a, b, c)
#> y ~ a + b + c
```

### 21.6.4 Exercises

When model building, typically the predictor and data are relatively constant while you rapidly experiment with different predictors. Write a small wrapper that allows you to reduce duplication in this situation.

`pred_mpg <- function(resp, ...) { } pred_mpg(~ disp) pred_mpg(~ I(1 / disp)) pred_mpg(~ disp * cyl)`

Another way to way to write

`boot_lm()`

would be to include the boostrapping expression (`data[sample(nrow(data), replace = TRUE), , drop = FALSE]`

) in to the data argument. Implement that approach. What are the advantages? What are the disadvantages?We could capture quosures, and then extract the environment from them. There are multiple environments associated with a quosore, but

`eval_bare()`

can only use one. Write a function that takes a list of quosures and returns the common environment, if they have one, or otherwise throws an error.Write a function that takes a data frame and a list of formulas, fitting a linear model with each formula, generating a useful model call.

Create a formula generation function that allows you to optionally supply a transformation function (e.g.

`log()`

) to the response or the predictors.