Introduction

One of the most intriguing things about R is its capability for metaprogramming: the idea that code is itself data, and can be inspected and modified programmatically. This is powerful idea and deeply influences much R code. At a simple level this tooling allows you to write library(purrr) instead of library("purrr") and enables plot(x, sin(x)) to label the axes with x and sin(x). At a deeper level it allows y ~ x1 + x2 to represent a model that predicts the value of y from x1 and x2. It allows subset(df, x == y) to be translated to df[df$x == df$y, , drop = FALSE], and for dplyr::filter(db, is.na(x)) to generate the SQL WHERE x IS NULL when db is a remote database table.

Closely related to metaprogramming is non-standard evalution, or NSE for short. This a term that’s commonly used to describe the behaviour of R functions, but there are two problems with the term that lead me to avoid it. Firstly, NSE is actually a property of an argument (or arguments) of a function, so talking about NSE functions is a little sloppy. Secondly, it’s confusing to define something by what it is not (standard), so in this book I’ll teach you more precise vocabulary. In particular, this book focusses on tidy evaluation, or tidy eval for short. Tidy eval which is made up of three major ideas: quasiquotation, quosures, and data masks. This book focusses on the theroetical side of tidy evaluation, so you can fully understand how it works from the ground up. If you are looking for a practical introduction, I recommend the “tidy evaluation book”, https://tidyeval.tidyverse.org52.

Metaprogramming is the hardest topic in this book because it forces you grapple with issues that you haven’t thought about before. Don’t be surprised if you’re frustrated or confused at first; this is a natural part of the process that happens to everyone!

15.4 Big ideas

But before you dive into details, I wanted to give you an overview of the most important ideas and vocabulary of metaprogramming::

  • Code is data; captured code is called an expression.
  • Code has a tree-like structure called an abstract syntax tree.
  • Expressions can be generated by code.
  • Evaluation executes an expression in an environment.
  • Evaluation can be customised by modifying or overriding the environment.
  • Data masks blur environments and data frames.
  • A quosure captures an expression with its environment.

Below, I’ll use tools primarily from the rlang package, as it allows you to focus on the big ideas, rather than implementation quirks that arise from R’s history. This approach seems backward to some, but it’s analogous to learning how to drive an automatic transmission before a manual transmission so you can focus on the big picture before learning the details.

15.4.1 Code is data

The first big idea is that code is data: you can capture code and compute on it like any other type of data. To compute on code, you first need some way to capture it. The first function that captures code is rlang::expr(). You can think of it returning exactly what you pass in:

More formally, captured code is called an expression. An expression isn’t a single type of object, but is a collective term for any of four types (call, symbol, constant, or pairlist), which you’ll learn more about in Chapter 16.

expr() lets you capture code that you’ve typed. You need a different tool to capture code passed to a function because expr() doesn’t work:

Here you need to use a function specifically designed to capture user input in a function argument: enexpr().

Once you have captured an expression, you can inspect and modify it. Complex expressions behave much like lists. That means you can modify them using [[ and $:

Note that the first element of the call is the function to be called, which means the first argument is in the second position. You’ll learn about the full details in Section 16.3.3.

15.4.2 Code is a tree

To do more complex manipulation with code, you need to fully understand its structure. Behind the scenes, almost every programming language represents code as a tree, often called the abstract syntax tree, or AST for short. R is unusual in that you can actually inspect and manipulate this tree.

A very convenient tool for understanding the tree-like structure is lobstr::ast(). Given some code, will display the underlying tree structure. Function calls form the branches of the tree, and are shown by rectangles. The leaves of the tree are symbols (like a) and constants (like "b").

Nested function calls create more deeply branching trees:

Because all function forms in can be written in prefix form (Section 5.8.2), every R expression can be displayed in this way:

Displaying the code tree in this way provides useful tools for exploring R’s grammar, the topic of Section 16.4.

15.4.3 Code can generate code

As well as seeing the tree from code typed by a human, you can also use code to create new trees. There are two main tools: call2() and unquoting.

rlang::call2() constructs a function call from its components: the function to call, and the arguments to call it with.

This is often convenient to program with, but is a bit clunkly for interactive use. An alternative technique is to build complex code trees by combining simpler code trees with a template. expr() and enexpr() have built-in support for this idea via !! (pronounced bang-bang), the unquote operator.

The precise details are the topic of Chapter 17, but basically !!x inserts the code tree stored in x. This makes it easy to build complex trees from simple fragments:

Notice that the output preserves the operator precedence so we get (x + x) / (y + y) not x + x / y + y (i.e. x + (x / y) + y). This is important to note, particularly if you’ve been thinking “wouldn’t this be easier to do by pasting strings?”.

Unquoting gets even more useful when you wrap it up into a function, first using enexpr() to capture the user’s expression, then expr() and !! to create an new expression using a template. The example below shows you might generate an expression that computes the coefficient of variation:

Importantly, this works even when given weird variable names:

Dealing with non-syntactic variable names is another good reason to paste() when generating R code. You might think this is an esoteric concern, but not worrying about it when generating SQL code in web applications lead to SQL injection attacks that have collectively cost billions of dollars.

These techniques become yet more powerful when combined with functional programming. You’ll explore these ideas in detail in Section ?? but the teaser belows shows how you might generate a complex model specification from simple inputs.

15.4.4 Evaluation excutes an expression in an environment

Inspecting and modifying code gives you one set of powerful tools. You get another set of powerful tools when you evaluate, i.e. execute, an expression. Evaluating an expression requires an environment. This tells R what the symbols (found in the leaves of tree) mean. You’ll learn the details of evaluation in Chapter 18.

The primary tool for evaluating expressions is base::eval(), which takes an expression and an environment:

If you omit the environment, it will use the current environment. Here that’s the global environment:

One of the big advantages of evaluating code manually is that you can tweak the execution environment. There are two main reaons to do this:

  • To temporarily override functions to implement a domain specific language.
  • To add a data mask so you can to refer to variables in a data frame as if they are variables in an environment.

15.4.5 You can override functions to make a DSL

It’s fairly straightforward to understand customising the environment with different variable values. It’s less obvious that you can also rebind functions to do different things. This is a big idea that we’ll come back to in Chapter ??, but I wanted to show a small example here.

The example below evalutes code in a special environment where the basic algebraic operators (+, -, *, /) have been overridden to work with string instead of numbers:

dplyr takes this idea to the extreme, running code in an environment that generates SQL for execution in a remote database:

15.4.6 Data masks blur the line between data frames and environments

Rebinding functions is an extremely powerful technique, but it tends to require a lot of investment. A more immediately practical application is modifying evaluation to look for variables in a data frame instead of an environment. This idea powers the base subset() and transform() functions, as well as many tidyverse functions like ggplot2::aes() and dplyr::mutate(). It’s possible to use eval() for this, but there are a few potential pitfalls, so we’ll use rlang::eval_tidy() instead.

As well as expression and environment, eval_tidy() also takes a data mask, which is typically a data frame:

Evaluating with a data mask is a useful technique for interactive analysis because it allows you to write x + y rather than df$x + df$y. However, that convenience comes at a cost: ambiguity. In Section 18.4.2 you’ll learn how to deal ambiugity using special .data and .env pronouns.

We can wrap this pattern up into a function by using enexpr(). This gives us a function very similar to base::with():

Unfortunately, however, this function has a subtle bug, and we need a new data structure to deal with it.

15.4.7 Quosures capture an expression with its environment

To make the problem more obvious, I’m going to modify with2():

(The problem occurs without this modification but it’s a sublter and creates error messages that are harder to understand.)

We can see the problem if we attempt to use with2() mingling a variable from the data frame, and a variable called a in the current environment:

That’s because we really want to evaluate the captured expression in the environment where it was written (where a is 10), not the environment inside of with2() (where a is 1000).

Fortunately we call solve this problem by using a new data structure: the quosure which bundles an expression with an environment. eval_tidy() knows how to work with quosures so all we need to do is switch out enexpr() for enquo():

Whenever you use a data mask, you must always use enquo() instead of enexpr(). This is the topic of Chapter 18.

Overview

In the following chapters, you’ll learn about the three pieces that underpin metaprogramming:

  • In Expressions, Chapter 16, you’ll learn that all R code forms a tree. You’ll learn how to visualise that tree, how the rules of R’s grammar convert linear sequences of characters into a tree, and how to use recursive functions to work with code trees.

  • In Quasiquotation, Chapter 17, you’ll learn to use tools from rlang to capture (“quote”) unevaluated function arguments. You’ll also learn about quasiquotation, which provides a set of techniques for “unquoting” input that makes it possible to easily generate new trees from code fragments.

  • In Evaluation, Chapter 18, you’ll learn about the inverse of quotation: evaluation. Here you’ll learn about an important data structure, the quosure, which ensures correct evaluation by capturing both the code to evaluate, and the environment in which to evaluate it. This chapter will show you how to put all the pieces together to understand how NSE in base R works, and how to write your own functions that work like subset().

  • Finally, in Translating R code, Chapter 19, you’ll see how to combine first-class environments, lexical scoping, and metaprogramming to translate R code into other languages, namely HTML and LaTeX.

Each chapter follows the same basic structure. You’ll get the lay of the land in introduction, then see a motivating example. Next you’ll learn the big ideas using functions from the rlang package (Henry and Wickham 2018), and then we’ll circle back to talk about how those ideas are expressed in base R.

References

Henry, Lionel, and Hadley Wickham. 2018. Rlang: Functions for Base Types and Core R and ’Tidyverse’ Features. https://rlang.r-lib.org.


  1. The tidy evaluation book is a work-in-progress at the time I wrote this chapter, but will hopefully be finished by the time you read it!