This chapter summarises the most important data types in base R: the vector types. You’ve probably used many (if not all) of the vectors before, but you may not have thought deeply about how they are interrelated. In this brief overview, I won’t discuss individual types in depth. Instead, I’ll show you how they fit together as a whole. If you need more details, you can find them in R’s documentation.
Vectors come in two flavours: atomic vectors and lists. They differ in the types of their elements: all elements of an atomic vector must be the same type, whereas the elements of a list can have different types. Closely related to vectors is
NULL is not a vector, but often serves the role of a generic 0-length vector.
Every vector can also have attributes, which you can think of as a named list containing arbitrary metadata. Two attributes are particularly important because they create important vector variants. The dimension attribute turns vectors into matrices and arrays. The class attribute powers the S3 object system, which you’ll learn how to use in Chapter 14. Here, you’ll learn about a handful of the most important S3 vectors: factors, date/times, data frames, and tibbles.
Take this short quiz to determine if you need to read this chapter. If the answers quickly come to mind, you can comfortably skip this chapter. You can check your answers in answers.
What are the four common types of atomic vectors? What are the two rare types?
What are attributes? How do you get them and set them?
How is a list different from an atomic vector? How is a matrix different from a data frame?
Can you have a list that is a matrix? Can a data frame have a column that is a matrix?
How do tibbles behave differently from data frames?
Vectors introduces you to atomic vectors and lists, R’s 1d data structures.
Attributes takes a small detour to discuss attributes, R’s flexible metadata specification. Here you’ll learn about factors, an important data structure created by setting attributes of an atomic vector.
Matrices and arrays introduces matrices and arrays, data structures for storing 2d and higher dimensional data.
Data frames teaches you about the data frame, the most important data structure for storing data in R. Data frames combine the behaviour of lists and matrices to make a structure ideally suited for the needs of statistical data.
4.2 Atomic vectors
There are four common types of atomic vectors: logical, integer, double, and character, and collectively integer and double vectors are known as numeric8. There are two rare types that I will not discuss further: complex and raw. Complex numbers are rarely need for statistics, and raw vectors are special type needed to work with binary data.
Each of the four primiary atomic vectors has some special syntax to create individual pieces:
Strings are surrounded by
Doubles can be specific in decimal form (
0.1234), scientific form (
1.23e4), or hex form (
0xcafe). Additionally, doubles have three special values:
Integers are written similarly to doubles but must be followed by
1e4L, or ), and can not include decimals.
Logicals can be spelled out (
FALSE), or abbreviated.
Each atomic vector also has its own missing value:
NA_real_ (double), and
NA_character_ (charater). You don’t usually need to know about the special types since, as you’ll learn shortly, whenver you use
NA with other values, it will be coerced to the correct type.
I’ll often use scalar to refer to indiviudal values, like
1. This is a bit of a simplification: they’re really vectors of length 1. However, scalar is a convenient word, even if it’s not completely true to the underlying relatively.
You’ll often use
c() vectors of length greater than one with
Throughout the book, I’ll draw vectors as connected rectangles, so the above code could be drawn as follows:
4.2.2 Testing and coercing
Given a vector, you can determine its type with
typeof(), and its length with
Use “is” functions with care.
is.logical() are ok. But the following may be surprising:
is.vector()tests for vectors that have no attributes apart from names.
is.atomic()tests for atomic vectors or NULL.
is.numeric()tests for the numerical-ness of a vector, not whether it’s built on top of an integer or double.
See Section 13.4 for more detail.
All elements of an atomic vector must be the same type, so when you attempt to combine different types they will be coerced to the most flexible one (logical << integer << double << character). For example, combining a character and an integer yields a character:
Coercion often happens automatically. Most mathematical functions (
abs, etc.) will coerce to numeric. This particularly useful for logical vectors because
TRUE becomes 1 and
FALSE becomes 0.
Vectorised logical operations (
any, etc) will coerce to a logical, but since this might lose information, it’s always accompanied a warning.
You can coerce deliberately by using
as.logical(). Failed coercions from strings generate a warning and a missing value:
How do you create vectors of type raw and complex? (See
Test your knowledge of vector coercion rules by predicting the output of the following uses of
1 == "1"true? Why is
-1 < FALSEtrue? Why is
"one" < 2false?
is.numeric()fundamentally different to
Why is the default missing value,
NA, a logical vector? What’s special about logical vectors? (Hint: think about
You might have noticed that the set of atomic vectors does not include a number important vectors like factors and date/times. These types are built on top of atomic vectors by adding attributes to them. In this section, you’ll learn the basics of attributes, and in the next section you’ll learn about factors and date/times.
4.3.1 Getting and setting
You can think of attributes as a named list9 used to attach metadata to an object. Individual attributes can be retrieved and modified with
attr(), or you can retrieved en masse with
attributes(), and set en masse with
Attributes should generally be thought of as ephemeral (unless they’re formalised into an S3 class). For example, most attributes will be lost when you operate on a vector:
There are three exceptions for particularly important attributes:
- Names, a character vector giving each element a name.
- Dimensions, an integer vector, used to turn vectors into matrices and arrays.
- Class, a character vector, used to implement the S3 object system.
Each of these three attributes is described in more detail below.
You can name a vector in three ways:
You can remove names from a vector by using
names(x) <- NULL.
You should avoid using
attr(x, "names) as it more typing and less readable than
To be technically correct, when drawing the named vector
x, I should draw it like so:
However, names are so special and so important, that unless I’m trying specifically to draw attention to the attributes data structure, I’ll use them to label the vector directly:
To be maximally useful for character subsetting (e.g. Section 5.5.1) names should be unique, and non-missing, but this is not enforced by R. Depending on how the names are set, missing names may be either
NA_character_. If all names are missing,
names() will return
dim attribute to a vector allows it to behave like a 2-dimensional matrix or multi-dimensional array. Matrices and arrays are primarily a mathematical/statistics tool, not a programming tool, so will be used infrequently in this book. You can create matrices and arrays with
array(), by using the assignment form of
dim(), or with
# Two scalar arguments specify row and column sizes a <- matrix(1:6, nrow = 2, ncol = 3) a #> [,1] [,2] [,3] #> [1,] 1 3 5 #> [2,] 2 4 6 # One vector argument to describe all dimensions b <- array(1:12, c(2, 3, 2)) b #> , , 1 #> #> [,1] [,2] [,3] #> [1,] 1 3 5 #> [2,] 2 4 6 #> #> , , 2 #> #> [,1] [,2] [,3] #> [1,] 7 9 11 #> [2,] 8 10 12 # You can also modify an object in place by setting dim() c <- 1:6 dim(c) <- c(3, 2) c #> [,1] [,2] #> [1,] 1 4 #> [2,] 2 5 #> [3,] 3 6 # Or with structure structure(1:6, dim = c(2, 3)) #> [,1] [,2] [,3] #> [1,] 1 3 5 #> [2,] 2 4 6
Many of the functions for working with vectors have generalisations for matrices and arrays:
A vector without
dim attribute set is often thought of as 1-dimensional, but actually have a
NULL dimensions. You also can have matrices with a single row or single column, or arrays with a single dimension. They may print similarly, but will behave differently. The differences aren’t too important, but it’s useful to know they exist in case you get strange output from a function (
tapply() is a frequent offender). As always, use
str() to reveal the differences.
The final attribute,
class, defines the S3 object oriented programming system. Having a class attribute makes an object an S3 object, which means that it will behave differently when passed to a generic function. Every S3 object is built on top of a base type, and often stores additional information in other attributes.
Having a class attribute and special defined methods makes attributes persistent. This, for example, is what ensures that factors keep the same levels even when you subset them:
You’ll learn the details of the S3 object system, and how to create your own S3 classes, in Chapter 14. In the following section, you’ll learn more about the most important S3 atomic vectors in base R.
setNames()implemented? How is
dim()return when applied to a vector? When might you use
How would you describe the following three objects? What makes them different to
An early draft used this code to illustrate
But when you print that object you don’t see the comment attribute. Why? Is the attribute missing, or is there something else special about it? (Hint: try using help.)
4.4 S3 atomic vectors
There are four important S3 vectors used in base R to represent three important types of data:
Categorical data, where values can only come from a fixed set of levels,
are recorded in factor vectors.
Dates (with day resolution) are recorded are Date vectors.
Date-times (with second or sub-second) resolution can be stored in POSIXct or POSIXlt vectors.
A factor is a vector that can contain only predefined values, and is used to store categorical data. Factors are built on top of integer vectors with two attributes: the
class, “factor”, which makes them behave differently from regular integer vectors, and the
levels, which defines the set of allowed values.
We can describe the the levels as a parameter of the factor type, since it does not make sense to compare a factor with levels “male” and “female”, to a factor with levels with “red” and “blue”.
Factors are useful when you know the possible values a variable may take, even if you don’t see all values in a given dataset. Using a factor instead of a character vector makes it obvious when some groups contain no observations:
Unfortunately, many base R functions (like
data.frame()) automatically convert character vectors to factors. This is suboptimal, because there’s no way for those functions to know the set of all possible levels or their optimal order: the levels are a property of the experimental design, not the data. Instead, use the argument
stringsAsFactors = FALSE to suppress this behaviour, and then manually convert character vectors to factors using your knowledge of the data.
A global option,
options(stringsAsFactors = FALSE), is available to control this behaviour, but I don’t recommend using it. Changing a global option may have unexpected consequences when combined with other code (either from packages, or code that you’re
source()ing), and global options make code harder to understand because they increase the number of lines you need to read to understand how a single line of code will behave. Instead you might want to consider packages from the tidyverse: they never automatically convert strings to factors.
While factors look like (and often behave like) character vectors, they are built on top of integers. Be careful when treating them like strings. Some string methods (like
grepl()) will coerce factors to strings automatically, while others (like
nchar()) will throw an error, and still others (like
c()) will use the underlying integer values. For this reason, it’s usually best to explicitly convert factors to character vectors if you need string-like behaviour.
The tidyverse provides the forcats package.
Date vectors are built on top of double vectors. They have class “Date” and no other attributes:
There are two ways of storing date-time information, POSIXct, and POSIXlt. These are odd names. “POSIX” is short for Portable Operating System Interface which is a family of cross-platform standards. “ct” standards for calendar time (the
time_t type in C), and “lt” for local time (the
struct tm type in C).
POSIXct is the simplest, and most appropriate for use in data frames. POSIXct vectors are built on top of double vectors, where the value represents the number of days since 1970-01-01.
The tzone attribute controls how the date-time is formated. Note that the time is not printed if it is midnight.
POSIXlt is built on top of a list, where the components of the list are components of the date time.
now_lt <- as.POSIXlt(today) typeof(now_lt) #>  "list" attributes(now_lt) #> $names #>  "sec" "min" "hour" "mday" "mon" "year" "wday" "yday" "isdst" #> #> $class #>  "POSIXlt" "POSIXt" #> #> $tzone #>  "UTC" str(unclass(now_lt)) #> List of 9 #> $ sec : num 0 #> $ min : int 0 #> $ hour : int 0 #> $ mday : int 10 #> $ mon : int 6 #> $ year : int 118 #> $ wday : int 2 #> $ yday : int 190 #> $ isdst: int 0 #> - attr(*, "tzone")= chr "UTC"
Date/times are a little idiosyncratic:
There’s are no
POSIXlt()functions to create date/times
The values of POSIXlt are not usable as is10)
So I recommend using the helper functions in the lubridate (Grolemund and Wickham 2011) package,
What sort of object does
table()return? What is its type? What attributes does it have? How does the dimensionality change as you tabulate more variables?
What happens to a factor when you modify its levels?
What does this code do? How do
Lists are different from atomic vectors because their elements can be of any type, including lists.
Construct lists with
Lists can contain complex objects so it’s not possible to pick one visual style that works for every list. Generally I’ll draw lists like vectors, using colour to remind you of the hierarchy.
Lists are sometimes called recursive vectors, because a list can contain other lists. This makes them fundamentally different from atomic vectors.
c() will combine several lists into one. If given a combination of atomic vectors and lists,
c() will coerce the vectors to lists before combining them. Compare the results of
4.5.2 Testing and coercing
typeof() a list is
list. You can test for a list with
A coerce to a list with
You can turn a list into an atomic vector with
unlist(). When combining S3 vectors, note that it uses a somewhat different set of coercion rules than
4.5.3 Matrices and arrays
While atomic vectors are most commonly turned into matrices, the dimension attribute can also be set on lists to make list-matrices or list-arrays:
These are relatively esoteric data structures, but can be useful if you want to arrange objects into a grid-like structure. For example, if you’re running models on a spatio-temporal grid, it might be natural to preserve the grid structure by storing the models in a 3d array.
How does a list differ from an atomic vector?
Why do you need to use
unlist()to convert a list to an atomic vector? Why doesn’t
Compare and contrast
unlist()when you combining a date and date-time into a single vector.
4.6 Data frames and tibbles
There are two important S3 vectors that are built on top of lists: data frames and tibbles.
A data frame is the most common way of storing data in R, and crucial for effective data analysis. A data frames is a named list of equal-length vectors. It has attributes providing the (column)
row.names, and a class of “data.frame”:
Because each element of the list has the same length, data frames have a rectangular structure, and hence shares properties of both the matrix and the list:
A data frame has 1d
names(), but also 2d
A data frame has 1d
length(), but also 2d
length()is the number of columns.
Data frames are one of the biggest and most powerful ideas of R, and one of the things that makes R different from other programming languages. However, in the over 20 years since their creation, the ways people use R have changed, and some of the design decisions that made sense at the time data frames were created now cause frustration.
This frustration lead to the creation of the tibble (Müller and Wickham 2018), a modern reimagining of the data frame. Tibbles have been designed to be (as much as possible) drop-in replacements for data frames, while still fixing the greatest frustrations. A concise, and fun way to summarise the main difference is that tibbles are lazy and surly: they tend to do less and complain more. You’ll see what that means as you work through this section.
Tibbles are provided by the tibble package and share the the same structure as a data frame:
The only difference is that the class vector is longer, and includes
tbl_df. This allows tibbles to behave differently in the key ways which we’ll discuss below.
When drawing data frames and tibbles, rather than focussing on the implementation details, i.e. the attributes:
I’ll draw them in the same way as a named list, but arranged to emphasised their columnar structure.
You create a data frame by supplying name-vector pairs to
Beware the default conversion of strings to factors. Use
stringsAsFactors = FALSE to suppress it and keep character vectors as character vectors:
Tibbles never coerce their input:
Generally tibbles do less on creation (this is what makes them lazy). Data frames automatically transform non-syntactic names (unless
check.names = FALSE); whereas tibbles do not (although they do print non-syntactic names surrounded by
While every element of a data frame (or tibble) must have the same length, both
tibble() can recycle their inputs. Data frames automatically recycle columns that are an integer multiple of the longest column; tibbles only ever recycle vectors of length 1.
data.frame(x = 1:4, y = 1:2) #> x y #> 1 1 1 #> 2 2 2 #> 3 3 1 #> 4 4 2 data.frame(x = 1:4, y = 1:3) #> Error in data.frame(x = 1:4, y = 1:3): arguments imply differing number of rows: 4, 3 tibble(x = 1:4, y = 1) #> # A tibble: 4 x 2 #> x y #> <int> <dbl> #> 1 1 1. #> 2 2 1. #> 3 3 1. #> 4 4 1. tibble(x = 1:4, y = 1:2) #> Error: Column `y` must be length 1 or 4, not 2
4.6.2 Row names
Data frames allow you to label each row with a “name”, a character vector containing only unique values:
You can get and set row names with
row.names()11, and you can use them to subset rows:
Row names arise naturally if you think of data frames as 2d structures like matrices: the columns (variables) have names so the rows (observations) should too. Most matrices are numeric, so having a place to store character labels is important. But this analogy to matrices is misleading because matrices possess an important property that data frames do not: they are transposable. In matrices the rows and columns are interchangeable, and transposing a matrix gives you another matrix (and transposing again gives you back the original matrix). With data frames, however, the rows and columns are not interchangeable, and the transpose of a data frame is not a data frame.
There are three other components of row names that make them suboptimal for use with data frames:
Metadata is data, so storing it in a different way to the rest of the data is fundamentally a bad idea. It also means that you need to learn a new set of tools to work with row names; you can’t use what you already know about manipulating variables.
Row names are poor abstraction for labelling rows because they only work when a row can be identified by a single string. This fails in many cases, for example when you want to identify a row by a non-character vector (e.g. a time point), or with multiple vectors (e.g. position, encoded by latitidue and longitude).
Row names must be unique, so any replication of rows (e.g. from bootstrapping) will create new row names. If you want to match rows from before and after the transformation you’ll need to perform complicated string surgery.
For these reasons, tibbles do not support row names. Instead tibble package provides tools to easily convert row names into a regular column with either
rownames_to_column(), or the
rownames argument to
One of the most obvious differences between tibbles and data frames is how they are printed. You’re already familiar with how data frames are printed, so here I’ll highlight some of the biggest differences using an example dataset included in the dplyr package:
dplyr::starwars #> # A tibble: 87 x 13 #> name height mass hair_color skin_color eye_color birth_year gender #> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> #> 1 Luke Sk… 172 77. blond fair blue 19.0 male #> 2 C-3PO 167 75. <NA> gold yellow 112. <NA> #> 3 R2-D2 96 32. <NA> white, bl… red 33.0 <NA> #> 4 Darth V… 202 136. none white yellow 41.9 male #> 5 Leia Or… 150 49. brown light brown 19.0 female #> 6 Owen La… 178 120. brown, gr… light blue 52.0 male #> 7 Beru Wh… 165 75. brown light blue 47.0 female #> 8 R5-D4 97 32. <NA> white, red red NA <NA> #> 9 Biggs D… 183 84. black light brown 24.0 male #> 10 Obi-Wan… 182 77. auburn, w… fair blue-gray 57.0 male #> # ... with 77 more rows, and 5 more variables: homeworld <chr>, #> # species <chr>, films <list>, vehicles <list>, starships <list>
Only show the first 10 rows and all the columns that will fit on screen. Additional columns are shown at the bottom.
Each column is labelled with its type, abbreviated to three or four letters.
Wide columns are truncated to avoid a single long string occupying an entire row. (This is still a work in progress: it’s tricky to get the tradeoff right between showing as many columns as possible and showing a single wide column fully.)
When used in console environments that support it, colour is used judiciously to highlight important information, and de-emphasise supplemental details.
As you will learn in Chapter 5, you can subset a data frame or a tibble like a 1d structure (where it behaves like a list), or a 2d structure (where it behaves like a matrix).
In my opinion, data frames have two suboptimal subsetting behaviours:
When you subset columns with
df[, vars], you will get a vector if
varsselects one variable, otherwise you’ll get a data frame. This is a frequent source of bugs when using
[in a function, unless you always remember to do
df[, vars, drop = FALSE].
When you extract a single column with
df$xand there is no column
x, a data frame will instead select any variable that starts with
x. If no variable starts with
df$xwill return a
NULL. This makes it easier to select the wrong variable, or to select a variable that doesn’t exist.
Tibbles tweak these behaviours so that
[ always returns a tibble, and
$ doesn’t partial match, and warns if it can’t find a variable (this is what makes tibbles surly).
A tibble’s insistence on returning a data frame from
[ can cause problems with legacy code, which often uses
df[, "col"] to extract a single column. To fix this, use
df[["col"]] instead. This is more expressive (since
[[ always extracts a single element) and works with both data frames and tibbles.
4.6.5 Testing and coercing
To check if an object is a data frame or tibble, use
Typically, it should not matter if you have a tibble or data frame, but if you do need to distinguish, use
You can coerce an object to a data frame with
as.data.frame() or to as tibble with
4.6.6 List columns
Since a data frame is a list of vectors, it is possible for a data frame to have a column that is a list. This very useful. Since a list can contain any other R object, this means that you can have a column of data frames or of models. This allows you to keep related objects together in a row, no matter how complex the individual objects are. You can see an application of this in the Many Models chapter of “R for Data Sicence”, http://r4ds.had.co.nz/many-models.html.
List-columns are allowed in data frames but you have to do a little extra work, either adding the list-column after creation, or wrapping the list in
List columns are easier to use with tibbles because you can provide them inside
tibble(), are they are handled specially when printing:
4.6.7 Matrix and data frame columns
It’s also possible to have a column of a data frame that’s a matrix or array, as long as the number of rows matches the data frame. (This requires a slight extention to our definition of a data frame: it’s not the
length() of each column that must be equal; but the
NROW().) Like with list-columns, you must either add after creation, or wrap in
dfm <- data.frame( x = 1:3 * 10 ) dfm$y <- matrix(1:9, nrow = 3) dfm$z <- data.frame(a = 3:1, b = letters[1:3], stringsAsFactors = FALSE) str(dfm) #> 'data.frame': 3 obs. of 3 variables: #> $ x: num 10 20 30 #> $ y: int [1:3, 1:3] 1 2 3 4 5 6 7 8 9 #> $ z:'data.frame': 3 obs. of 2 variables: #> ..$ a: int 3 2 1 #> ..$ b: chr "a" "b" "c"
Matrix and data frame columns require a little caution. Many functions that work with data frames assume that all columns are vectors, and the printed display can be confusing.
At the time of writing, tibbles did not support matrix or data frame columns, but the tidyverse team is working on adding support, and hopefully they’ll work by the time you read this.
Can you have a data frame with 0 rows? What about 0 columns?
What happens if you attempt to set rownames that are not unique?
dfis a data frame, what can you say about
t(t(df))? Perform some experiments, making sure to try different column types.
as.matrix()do when applied to a data frame with columns of different types? How does it differ from
To finish up the chapter, I wanted to talk about a final important data struture that’s closely related to vectors:
NULL is special because it has a unique type, is always length 0, and can’t have any attributes:
You can test for
There are two common uses of
To represent an empty vector (a vector of length 0) of arbitrary type. For example, if you use
c()but don’t include any arguments, you get
NULL, and concatenating
NULLto a vector leaves it unchanged12:
To represent an absent vector. For example,
NULLis often used as a default function argument, when the argument is optional but the default value requires some computation (see Section 6.4.4 for more on this idea). Contrast this with
NAwhich is used to indicate that an element of a vector is absent.
The four common types of atomic vector are logical, integer, double (sometimes called numeric), and character. The two rarer types are complex and raw.
Attributes allow you to associate arbitrary additional metadata to any object. You can get and set individual attributes with
attr(x, "y") <- value; or get and set all attributes at once with
The elements of a list can be any type (even a list); the elements of an atomic vector are all of the same type. Similarly, every element of a matrix must be the same type; in a data frame, the different columns can have different types.
You can make “list-array” by assigning dimensions to a list. You can make a matrix a column of a data frame with
df$x <- matrix(), or using
I()when creating a new data frame
data.frame(x = I(matrix())).
Tibbles have an enhanced print method, never coerce strings to factors, and provide stricter subsetting methods.
Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3):1–25. http://www.jstatsoft.org/v40/i03/.
Müller, Kirill, and Hadley Wickham. 2018. Tibble: Simple Data Frames. https://CRAN.R-project.org/package=tibble.
The reality is a little more complicated: attributes are actually stored in pairlists, which you can learn more about in pairlists. Pairlists are a profoundly different data structure under the hood but behave the same as lists.↩
There are the
ISOdatetime()functions but note that
ISOdate()returns a POSIXct (not a date), and
ISOdatetime()works by pasting together all it’s inputs, so it’s no faster than calling
You can use
row.names(), but its documentation suggests that
Technically, this makes
NULLthe identity element under vector concatenation.↩