Visualization, modeling and inference in R is simplest when the data is
collected into a single “tidy” data.frame. What constitutes a tidy
data.frame depends somewhat on the context, but at a minimum it
requires that each can be interpreted as an observation, each column
as a variable, and each cell contains a value.
The original data set may vary from this ideal tidy format in several
ways, and the tidyr package provides tools for us to convert data
from messy to tidy. In particular, we show several common issues that we
need to address to make data tidy. Here, we focus on three main verbs
for tidying data - gather, spread and separate. We also
briefly discuss what to do when the data is originally distributed over
several files.
Warning message:
“Installed Rcpp (0.12.12) different from Rcpp used to build dplyr (0.12.11).
Please reinstall dplyr to avoid random crashes or undefined behavior.”Warning message:
“package ‘dplyr’ was built under R version 3.4.1”
One common issue is that the values in a single column are actually a
combination of many variables. For example, there may be a “description”
field that combines site ID,, patient ID and date. Tidy data requires
that each column represents a single variable, and we need to
separate the variables.
The verb extract is like separate, but instead of using a separator,
it uses a regular expression to split strings. A crash course in
regular expressions:
abc matches the characters ‘abc’
- matches the character ‘-‘
[abc] matches a or b or c
[a-z] matches any lower case letter
[0-9] matches any digit
. matches any single character
\\d matches any single digit
+ matches one or more of the preceding character set
* matches zero or more of the preceding character set
{m,n} matches between m and n copies of the preceding character
set
() indicates a capture group - the separated values desired
Sometimes a single variable is spread out over multiple columns. For
example, we may wish to consider the Sepal.Length, Sepal.Width,
Petal.Length and Petal.Width as variants of a single variable
measure. The verb to use is gather which transforms a “wide”
data.frame into a “tall” one.
For example, suppose we want ggplot2 to plot each measurement in a
separate panel, colored by Species.
In [16]:
head(iris, n=3)
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
Species
5.1
3.5
1.4
0.2
setosa
4.9
3.0
1.4
0.2
setosa
4.7
3.2
1.3
0.2
setosa
In [17]:
iris %>% gather(key=measure, value = value,1:4)%>%head(n=3)
The verb spread is the reverse of gather - it makes a “tall”
data.frame into a “wide” one. However, it requires that each row
have a unique identifier to do so. If we just try to apply spread to
the tall version of iris, it will fail because there are 50 of each
Species. We therefore need to generate a new column to make each row
have a unique identifier.
When variables are stored in both rows and columns (advanced)¶
In the toy example below, we measure peak and trough levels of something
(blood glucose level, viral titers etc) recorded between visits. We
convert this into a tidy data frame in two steps:
Use tidyr::gather to create a new column variable visit
Use tidyr::spread to create peak and trough column
variables from each row of measure
Sometimes cells contain missing values. Here we show the simplest way to
deal with this common scenario. The approach shown is not always
appropriate - consult a statistician if in doubt.
This is known as complete case analysis, and is appropriate when you
have abundant observations and the missing values are believed to be
missing at random.
Sometimes data is distributed over many files. In the simplest case,
each data set has exactly the same format as the others, and we just
need to append additional rows. At other times, each data set contains
some different variable (e.g. merging clinical and assay data) and we
need to match rows according to some unique row identifier.