Cleaning and preparing data

This lesson is on the use of the tidyr and dplyr packages that are part of a collection called the ‘tidyverse’. These packages are for data cleaning, preparation and visualization. Currently, data frames and their manipulation in R are in transition from older programming paradigms, syntax and data classes to some newer, (hopefully) better implementations.

Here’s what we want to do:

Objectives

  • How to structure your spreadsheet for analysis
  • How to read a spreadsheet or text file into a data frame (or tibble)
  • How to create a data frame (or tibble) from scratch
  • How to interpret a data frame (or tibble)
  • How to select rows and columns of a data frame (or tibble)
  • How to sort a data frame (or tibble)
  • What is “tidy” data?
  • How to tidy common forms of “messy” data with tidyr

Because many researchers may not have adopted the newer methods, we will often present both the ‘traditional’ R syntax and the tidyverse syntax. We encourage you to write new code using tidyr, dplyr and the rest of the tidyverse.

Main Additions and Modifications to Traditional R.

tibbles vs data frames

Base R (meaning the core R libraries and functions) employs a data class called a ‘data frame’. A data frame is a 2-dimensional array whose columns can be any type (numeric, vector, list, string, etc.). A tibble is also a data frame, with some refinements in the printing and subsetting features. Pretty much anywhere a data frame is used, a tibble can be used. Data frames can be converted to tibbles using as.tibbles and tibbles can be converted to data frames using as.data.frame. If you need to use an older package that doesn’t work with tibbles, you can just wrap your tibble with as.data.frame and everything should work. We’ll see some examples shortly.

Pipes

Pipes are new to R, but the concept of a pipe is not at all new. In fact, it has been around for decades in Unix scripting language. The pipe works like a literal pipe - it is a connector that carries things from one output to a new input. For example, the water that is output by the municipal system flows via a pipe as input to your house, that is then run through your faucet, whose output goes to the drain, and the water output from the drain is carried via a pipe to become input to the sewer system.

Software pipes carry the output of one program to be input to another. Their use makes code more readable to a human. Readability to humans is a key goal in today’s programming environments, because it makes modifying code easier, and it reduces the possibility of bugs. The symbol for the pipe connector in R is %>%. (In unix the pipe symbol is |, and you may use this when you carry out the bioinformatics pipeline.)

In [1]:
#Loading the tidyverse library

library(tidyverse)
library(stringr)

Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats

How to structure your spreadsheet for analysis

Many people in the biological sciences use excel spreadsheets to store data. When we import a spreadsheet into a programming environment like R, we need to be much more careful about formatting. Spreadsheets are formatted for people to look at. We want our data in a format that is readable for a machine.

The following are some common issues in converting spreadsheets to usable data sets.

Do this

Not that

  1. Table is not rectangular
  2. Fancy annotations lost when exported to CSV
  3. Single column contains too much information

How to read a spreadsheet or text file into a data frame

Reading in a data frame from a text or csv file

This is the most common case when data is exported from a spreadsheet to a CSV file.

In [162]:
url <- "http://vincentarelbundock.github.io/Rdatasets/csv/datasets/Titanic.csv"
download.file(url = url, destfile="titanic.csv")

Using read.csv

In [163]:
titanic <- read.csv("titanic.csv")
head(titanic, n=4)
XNamePClassAgeSexSurvivedSexCode
11 Allen, Miss Elisabeth Walton1st 29 female 1 1
22 Allison, Miss Helen Loraine1st 2 female 0 1
33 Allison, Mr Hudson Joshua Creighton1st 30 male 0 0
44 Allison, Mrs Hudson JC (Bessie Waldo Daniels)1st 25 female 0 1
In [2]:
# read_csv is different from read.csv

titanic_tibble <- read_csv("titanic.csv")
titanic_tibble
Warning message:
“Missing column names filled in: 'X1' [1]”Parsed with column specification:
cols(
  X1 = col_integer(),
  Name = col_character(),
  PClass = col_character(),
  Age = col_double(),
  Sex = col_character(),
  Survived = col_integer(),
  SexCode = col_integer()
)
X1NamePClassAgeSexSurvivedSexCode
11 Allen, Miss Elisabeth Walton1st 29 female 1 1
22 Allison, Miss Helen Loraine1st 2 female 0 1
33 Allison, Mr Hudson Joshua Creighton1st 30 male 0 0
44 Allison, Mrs Hudson JC (Bessie Waldo Daniels)1st 25 female 0 1
55 Allison, Master Hudson Trevor1st 0.92 male 1 0
66 Anderson, Mr Harry1st 47 male 1 0
77 Andrews, Miss Kornelia Theodosia1st 63 female 1 1
88 Andrews, Mr Thomas, jr1st 39 male 0 0
99 Appleton, Mrs Edward Dale (Charlotte Lamson)1st 58 female 1 1
1010 Artagaveytia, Mr Ramon1st 71 male 0 0
1111 Astor, Colonel John Jacob1st 47 male 0 0
1212 Astor, Mrs John Jacob (Madeleine Talmadge Force)1st 19 female 1 1
1313 Aubert, Mrs Leontine Pauline1st NA female 1 1
1414 Barkworth, Mr Algernon H1st NA male 1 0
1515 Baumann, Mr John D1st NA male 0 0
1616 Baxter, Mrs James (Helene DeLaudeniere Chaput)1st 50 female 1 1
1717 Baxter, Mr Quigg Edmond1st 24 male 0 0
1818 Beattie, Mr Thomson1st 36 male 0 0
1919 Beckwith, Mr Richard Leonard1st 37 male 1 0
2020 Beckwith, Mrs Richard Leonard (Sallie Monypeny)1st 47 female 1 1
2121 Behr, Mr Karl Howell1st 26 male 1 0
2222 Birnbaum, Mr Jakob1st 25 male 0 0
2323 Bishop, Mr Dickinson H1st 25 male 1 0
2424 Bishop, Mrs Dickinson H (Helen Walton)1st 19 female 1 1
2525 Bjornstrm-Steffansson, Mr Mauritz Hakan1st 28 male 1 0
2626 Blackwell, Mr Stephen Weart1st 45 male 0 0
2727 Blank, Mr Henry1st 39 male 1 0
2828 Bonnell, Miss Caroline1st 30 female 1 1
2929 Bonnell, Miss Elizabeth1st 58 female 1 1
3030 Borebank, Mr John James1st NA male 0 0
12841284 Vestrom, Miss Hulda Amanda Adolfina3rd 14 female 0 1
12851285 Vonk, Mr Jenko3rd 22 male 0 0
12861286 Ware, Mr Frederick3rd NA male 0 0
12871287 Warren, Mr Charles William3rd NA male 0 0
12881288 Wazli, Mr Yousif3rd NA male 0 0
12891289 Webber, Mr James3rd NA male 0 0
12901290 Wennerstrom, Mr August Edvard3rd NA male 1 0
12911291 Wenzel, Mr Linhart3rd NA male 0 0
12921292 Widegren, Mr Charles Peter3rd 51 male 0 0
12931293 Wiklund, Mr Jacob Alfred3rd 18 male 0 0
12941294 Wilkes, Mrs Ellen3rd 45 female 1 1
12951295 Willer, Mr Aaron3rd NA male 0 0
12961296 Willey, Mr Edward3rd NA male 0 0
12971297 Williams, Mr Howard Hugh3rd NA male 0 0
12981298 Williams, Mr Leslie3rd 28 male 0 0
12991299 Windelov, Mr Einar3rd 21 male 0 0
13001300 Wirz, Mr Albert3rd 27 male 0 0
13011301 Wiseman, Mr Phillippe3rd NA male 0 0
13021302 Wittevrongel, Mr Camiel3rd 36 male 0 0
13031303 Yalsevac, Mr Ivan3rd NA male 1 0
13041304 Yasbeck, Mr Antoni3rd 27 male 0 0
13051305 Yasbeck, Mrs Antoni3rd 15 female 1 1
13061306 Youssef, Mr Gerios3rd NA male 0 0
13071307 Zabour, Miss Hileni3rd NA female 0 1
13081308 Zabour, Miss Tamini3rd NA female 0 1
13091309 Zakarian, Mr Artun3rd 27 male 0 0
13101310 Zakarian, Mr Maprieder3rd 26 male 0 0
13111311 Zenni, Mr Philip3rd 22 male 0 0
13121312 Lievens, Mr Rene3rd 24 male 0 0
13131313 Zimmerman, Leo3rd 29 male 0 0
In [3]:
print(class(titanic))
print(class(titanic_tibble))
Error in print(class(titanic)): object 'titanic' not found
Traceback:

1. print(class(titanic))

read.csv is just a wrapper for read.table with some defaults given

In [166]:
titanic.1 <- read.table("titanic.csv", sep=",", header = T)
head(titanic.1, n=4)

#Same applies to read_csv - it is a wrapper for read_delim
XNamePClassAgeSexSurvivedSexCode
11 Allen, Miss Elisabeth Walton1st 29 female 1 1
22 Allison, Miss Helen Loraine1st 2 female 0 1
33 Allison, Mr Hudson Joshua Creighton1st 30 male 0 0
44 Allison, Mrs Hudson JC (Bessie Waldo Daniels)1st 25 female 0 1

Let’s get rid of the first column, as it isn’t useful.

In [167]:
head(titanic[, -1], n=4)
NamePClassAgeSexSurvivedSexCode
1Allen, Miss Elisabeth Walton1st 29 female 1 1
2Allison, Miss Helen Loraine1st 2 female 0 1
3Allison, Mr Hudson Joshua Creighton1st 30 male 0 0
4Allison, Mrs Hudson JC (Bessie Waldo Daniels)1st 25 female 0 1

Alternative ising dplyr

An alternative approach using dplyr that we encourage getting used to as it will provides a consistent grammar for data frame manipulations.

In [168]:
head(titanic_tibble %>% select(-1), n=4)
NamePClassAgeSexSurvivedSexCode
1Allen, Miss Elisabeth Walton1st 29 female 1 1
2Allison, Miss Helen Loraine1st 2 female 0 1
3Allison, Mr Hudson Joshua Creighton1st 30 male 0 0
4Allison, Mrs Hudson JC (Bessie Waldo Daniels)1st 25 female 0 1

2. Write a function called peek that takes as arguments df and n. It then returns n random rows ffrom the data frame df (without repeats). If n is larger than the number of rows, it should retrun the original data frame. Give n a default value of 6.

In [169]:
peek <- function(df, n=6) {
    nr <- nrow(df)
    if (n >= nr) {
        return(df)
    } else {
        idx <- sample(nr, n, replace=F)
        return(df[idx, ])
    }
}

Alternative using sample_n from dplyr

In [170]:
peek.1 <- function(df, n=6) {
    nr <- nrow(df)
    if (n >= nr) {
        return(df)
    } else {
        return(sample_n(df, n, replace=F))
    }
}

How to create a data frame from scratch

In [171]:
n <- 12
subjects <- sample(1:1000, n, replace=FALSE)
treatments <- sample(c("Placebo", "Drug"), size = n, replace = TRUE)
n1 <- sum(treatments == "Placebo")
n2 <- n - n1
placebo.vals <- rnorm(n=n1, mean = 10, sd=1)
drug.vals <- rnorm(n=n2, mean=15, sd=1)
gene1 <- ifelse(treatments=="Placebo", placebo.vals, drug.vals)
gene2 <- rnorm(n=n, mean=12, sd=1)
df <- data.frame(subjects=subjects, treatments=treatments, gene1=gene1, gene2=gene2)
df
subjectstreatmentsgene1gene2
1153 Drug 15.516308347886411.8019177313714
298 Placebo 11.389712779273510.4411774515486
3250 Drug 15.804307613815711.5690995283775
4545 Drug 15.446139325202111.240600432231
5442 Placebo 10.766739141802612.4025164044546
6832 Drug 16.536032739197310.7263837943643
7625 Drug 14.226996631866213.5853056018832
8293 Placebo 9.3836984172358112.1824448855273
932 Drug 13.671926122039 11.5003612782279
10469 Drug 15.804307613815712.8664592872279
11955 Placebo 10.379605058535613.6823416307802
12443 Placebo 11.389712779273511.9559040521596

How to interpret a data frame

In [172]:
class(df)
'data.frame'
In [173]:
rownames(df)
  1. '1'
  2. '2'
  3. '3'
  4. '4'
  5. '5'
  6. '6'
  7. '7'
  8. '8'
  9. '9'
  10. '10'
  11. '11'
  12. '12'
In [174]:
colnames(df)
  1. 'subjects'
  2. 'treatments'
  3. 'gene1'
  4. 'gene2'
In [175]:
dim(df)
  1. 12
  2. 4
In [176]:
str(df)
'data.frame':   12 obs. of  4 variables:
 $ subjects  : int  153 98 250 545 442 832 625 293 32 469 ...
 $ treatments: Factor w/ 2 levels "Drug","Placebo": 1 2 1 1 2 1 1 2 1 1 ...
 $ gene1     : num  15.5 11.4 15.8 15.4 10.8 ...
 $ gene2     : num  11.8 10.4 11.6 11.2 12.4 ...
In [177]:
summary(df)
subjects       treatments     gene1            gene2
 Min.   : 32.0   Drug   :7    Min.   : 9.384   Min.   :10.44
 1st Qu.:225.8   Placebo:5    1st Qu.:11.234   1st Qu.:11.44
 Median :442.5                Median :13.949   Median :11.88
 Mean   :428.1                Mean   :13.360   Mean   :12.00
 3rd Qu.:565.0                3rd Qu.:15.588   3rd Qu.:12.52
 Max.   :955.0                Max.   :16.536   Max.   :13.68
In [178]:
head(df)
subjectstreatmentsgene1gene2
1153 Drug 15.516308347886411.8019177313714
298 Placebo 11.389712779273510.4411774515486
3250 Drug 15.804307613815711.5690995283775
4545 Drug 15.446139325202111.240600432231
5442 Placebo 10.766739141802612.4025164044546
6832 Drug 16.536032739197310.7263837943643
In [179]:
tail(df)
subjectstreatmentsgene1gene2
7625 Drug 14.226996631866213.5853056018832
8293 Placebo 9.3836984172358112.1824448855273
932 Drug 13.671926122039 11.5003612782279
10469 Drug 15.804307613815712.8664592872279
11955 Placebo 10.379605058535613.6823416307802
12443 Placebo 11.389712779273511.9559040521596

How to select rows and columns of a data frame

In [180]:
df[[8, 3]]
9.38369841723581
In [181]:
df$treatments
  1. Drug
  2. Placebo
  3. Drug
  4. Drug
  5. Placebo
  6. Drug
  7. Drug
  8. Placebo
  9. Drug
  10. Drug
  11. Placebo
  12. Placebo
In [182]:
df[[2]]
  1. Drug
  2. Placebo
  3. Drug
  4. Drug
  5. Placebo
  6. Drug
  7. Drug
  8. Placebo
  9. Drug
  10. Drug
  11. Placebo
  12. Placebo
In [183]:
df[,2]
  1. Drug
  2. Placebo
  3. Drug
  4. Drug
  5. Placebo
  6. Drug
  7. Drug
  8. Placebo
  9. Drug
  10. Drug
  11. Placebo
  12. Placebo
In [184]:
df[2:4,]
subjectstreatmentsgene1gene2
298 Placebo 11.389712779273510.4411774515486
3250 Drug 15.804307613815711.5690995283775
4545 Drug 15.446139325202111.240600432231
In [185]:
df[treatments == "Placebo", ]
subjectstreatmentsgene1gene2
298 Placebo 11.389712779273510.4411774515486
5442 Placebo 10.766739141802612.4025164044546
8293 Placebo 9.3836984172358112.1824448855273
11955 Placebo 10.379605058535613.6823416307802
12443 Placebo 11.389712779273511.9559040521596
In [186]:
df[(gene1 < 12) & (gene2 > 12), ]
subjectstreatmentsgene1gene2
5442 Placebo 10.766739141802612.4025164044546
8293 Placebo 9.3836984172358112.1824448855273
11955 Placebo 10.379605058535613.6823416307802

How to sort a data frame

In [187]:
df[order(subjects),]
subjectstreatmentsgene1gene2
932 Drug 13.671926122039 11.5003612782279
298 Placebo 11.389712779273510.4411774515486
1153 Drug 15.516308347886411.8019177313714
3250 Drug 15.804307613815711.5690995283775
8293 Placebo 9.3836984172358112.1824448855273
5442 Placebo 10.766739141802612.4025164044546
12443 Placebo 11.389712779273511.9559040521596
10469 Drug 15.804307613815712.8664592872279
4545 Drug 15.446139325202111.240600432231
7625 Drug 14.226996631866213.5853056018832
6832 Drug 16.536032739197310.7263837943643
11955 Placebo 10.379605058535613.6823416307802
In [188]:
df[order(gene1),]
subjectstreatmentsgene1gene2
8293 Placebo 9.3836984172358112.1824448855273
11955 Placebo 10.379605058535613.6823416307802
5442 Placebo 10.766739141802612.4025164044546
298 Placebo 11.389712779273510.4411774515486
12443 Placebo 11.389712779273511.9559040521596
932 Drug 13.671926122039 11.5003612782279
7625 Drug 14.226996631866213.5853056018832
4545 Drug 15.446139325202111.240600432231
1153 Drug 15.516308347886411.8019177313714
3250 Drug 15.804307613815711.5690995283775
10469 Drug 15.804307613815712.8664592872279
6832 Drug 16.536032739197310.7263837943643
In [189]:
df[order(-df$gene1),] # Explicit naming of dataframe column
subjectstreatmentsgene1gene2
6832 Drug 16.536032739197310.7263837943643
3250 Drug 15.804307613815711.5690995283775
10469 Drug 15.804307613815712.8664592872279
1153 Drug 15.516308347886411.8019177313714
4545 Drug 15.446139325202111.240600432231
7625 Drug 14.226996631866213.5853056018832
932 Drug 13.671926122039 11.5003612782279
298 Placebo 11.389712779273510.4411774515486
12443 Placebo 11.389712779273511.9559040521596
5442 Placebo 10.766739141802612.4025164044546
11955 Placebo 10.379605058535613.6823416307802
8293 Placebo 9.3836984172358112.1824448855273

Examples

1. Find all rows where subjects were given placebo with gene1 values above 10 and gene2 values below 12.

In [190]:
df[(treatments=="Placebo") & (gene1 > 10) & (gene2 < 12), ]
subjectstreatmentsgene1gene2
298 Placebo 11.389712779273510.4411774515486
12443 Placebo 11.389712779273511.9559040521596

2. What is the log ratio of gene2 to gene1 for subjects with IDs in the range 100 to 500? Save the information for this subset in a new data frame and sort by the log ratio in descending order.

In [191]:
df.1 <- df[(subjects >= 100) & (subjects <= 500), ]
df.1$lr <- log(df.1$gene2) - log(df.1$gene1)
df.1[order(-df.1$lr), ]
subjectstreatmentsgene1gene2lr
8293 Placebo 9.38369841723581 12.1824448855273 0.261021998797103
5442 Placebo 10.7667391418026 12.4025164044546 0.141437714890785
12443 Placebo 11.3897127792735 11.9559040521596 0.0485146590903738
10469 Drug 15.8043076138157 12.8664592872279 -0.205658666561063
1153 Drug 15.5163083478864 11.8019177313714 -0.273629584403845
3250 Drug 15.8043076138157 11.5690995283775 -0.311944826627759

What is “tidy” data?

First we need some definitions:

  1. A value is a number or a string.
  2. A variable contains all values that measure the same underlying attribute (e.g. height, gene expression). A fixed variable is part of the experimental design (e.g. subject ID, treatment group) while a measured variable refers to data collected as part of the experiment (e.g. viral titer, gene expression level).
  3. An observation contains all values measured on the same unit (e.g. person, assay)

To simplify data analysis, we want to clean and prepare tidy data, where

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

With tidy data, a table is made up of rows and columns containing values, where each row represents an observation and each column represents a variable. Each table describes a type of observation - for example, in a medical experiment we may have three tables - demographic data about each person, gene expression data for each person on each visit, and proteomic data for each person on each visit.

See the tidy data paper for a philosophy of data.

There are many other aspects of cleaning up messy data not addressed here. From the paper above, here are some:

Apart from tidying, there are many other tasks involved in cleaning data: parsing dates and numbers, identifying missing values, correcting character encodings (for international data), matching similar but not identical values (created by typos), verifying experimental design, and filling in structural missing values, not to mention model-based data cleaning that identifies suspicious values.

Examples

It is simplest to grasp the concept of tidy data via concrete examples. Suppose we perform two different experiments on subjects Ann and Bob, and measure some physiological attribute in each experiment. We can record the data as a table in several ways:

Not tidy data (Table 1)

  Expt 1 Expt 2
Ann 175 45
Bob 183 72

Not tidy data (Table 2)

  Ann Bob
Expt 1 175 183
Expt 2 45 72

Tidy data (Table 3)

Subject Expt Value
Ann 1 175
Ann 2 45
Bob 1 183
Bob 2 72

Because of the uniform structure, tables in the tidy data format make it easy to answer questions about different variables in a consistent way, and also to merge different tables (e.g. subject demographic data with results of laboratory assays for each subject). It is conventional to order the columns so that fixed variables (e.g. subject) come before measured variables.

What is a variable?

Table 4

Subject Gene 1 Gene 2
Ann 175 45
Bob 183 72

Table 1 and 4 have almost the exact same formal structure - yet Table 1 is messy while Table 4 is tidy. This is because it is natural to consider the expression values for a particular gene across subjects as a variable, but the same is not true for a particular experiment. Pragmatically, whether something should be a variable or a variable depends on its role in a statistical model. From a modeling perspective, it is natural to consider outcome ~ gene1 * gene2 but outcome ~ expt1 * expt2 doesn’t really make any sense.

How to tidy common forms of “messy” data with tidyr

Here we give examples of how to convert common forms of messy data into tidy data. We use toy examples for illustration, but the tidyr and dplyr functions are highly efficient and work well with HTS data sets.

There are only a few important verbs (functions) to remember:

For working with single data frames

  • gather: takes multiple columns into a single column
  • spread: takes a single column into multiple columns
  • separate: splits a variable name based on a specified separator
  • extract: splits a variable name using regular expression capture groups

For combining multiple data frames

  • bind_rows: “combine data frames “vertically”; increases number of rows
  • *_join: combine data frames “horizontally” by matching on one or more common columns; increases number of columns

We illustrate these verbs with examples from common data tidying scenarios. There are a few common types of messy data that we need to handle:

  • Converting columns into rows (convert from “wide” to “tall” format)
  • Multiple variables in a single column (splitting a single value into two or more)
  • Variables stored in both rows and columns (move all variables to columns)
  • One type in multiple tables (combine several similar tables into one)
  • Multiple types in multiple tables (combine tables with different columns into one based on some identifier)

(1) Converting columns into rows

One common situation is that column headers are values rather than variables. For example, we often need to convert data like Table 1 into the form given by Table 3. This can be done in two steps:

  1. make a new column from the row names using tidy::add_rownames()
  2. reshape the data frame using tidyr::gather()

Create a data frame representing table 1

In [192]:
expt1 <- c(175, 183)
expt2 <- c(45, 72)
subject <- c("Ann", "Bob")
t1 <- data.frame(expt1, expt2, row.names=subject)
t1
expt1expt2
Ann175 45
Bob183 72

Add row names

In [193]:
t <- add_rownames(t1, "subject")
t
Warning message:
“Deprecated, use tibble::rownames_to_column() instead.”
subjectexpt1expt2
1Ann17545
2Bob18372

Reshape from “wide” to “tall”

The tidyr::gather function takes the following arguments - data, key, value, list of columns to gather. The list of columns can be specified as a comma separated list, as a slice with : notation, or by exclusion of variables with -.

In [196]:
gather(t, key=expt, value=value, expt1, expt2)
subjectexptvalue
1Ann expt1175
2Bob expt1183
3Ann expt245
4Bob expt272
In [197]:
gather(t, key=expt, value=value, expt1:expt2)
subjectexptvalue
1Ann expt1175
2Bob expt1183
3Ann expt245
4Bob expt272
In [198]:
gather(t, expt, value, -subject)
subjectexptvalue
1Ann expt1175
2Bob expt1183
3Ann expt245
4Bob expt272

Using piping

When using the tidyr and dplyr packages, a common idiom is to “pipe” a chain of transformations using the %>% pipe operator. In this idiom, the data frame is implicitly passed on down the chain, and does not need to be specified in the transformation function call.

In [199]:
t %>% gather(expt, value, -subject)
subjectexptvalue
1Ann expt1175
2Bob expt1183
3Ann expt245
4Bob expt272

(2) Multiple variables in a single column

Sometimes data in a column is concatenated from multiple variables. To convert to tidy data, we need to split into separate columns for each variable. The next example shows a column where the variable is a concatenation of the chromosome number and offset position. We can use tidyr::separate to do this easily if there is an obvious delimiter (that can be a regular expression).

In [200]:
chr.offset <- c("Chr1:345", "Chr2:712")
subject <- c("Ann", "Bob")
df <- data.frame(subject, chr.offset)
df
subjectchr.offset
1Ann Chr1:345
2Bob Chr2:712
In [201]:
df %>% separate(chr.offset, into=c("chromosome", "offset"), sep=":" )
subjectchromosomeoffset
1Ann Chr1345
2Bob Chr2712

When there is no obvious separator

Sometimes there is no obvious separator to use to split the variables. In many cases, we can use extract to capture the separate parts based on matching to regular expression patterns.

In [202]:
pid.name.visit <- c("00345Ann1", "02122John13")
df <- data.frame(pid.name.visit)
df
pid.name.visit
100345Ann1
202122John13
In [203]:
df %>% extract(pid.name.visit, c("pid", "name", "visit"), "([0-9]+)([a-zA-Z_-]+)([0-9]+)")
pidnamevisit
100345Ann 1
202122John 13
In [204]:
df %>% extract(pid.name.visit, c("pid", "name", "visit"), "(\\d+)(\\w+)(\\d+)")
pidnamevisit
100345Ann 1
202122John13

(3) Variables stored in both rows and columns

In the toy example below, we measure peak and trough levels of something (blood glucose level, viral titers etc) recorded between visits. We convert this into a tidy data frame in two steps:

  1. Use tidyr::gather to create a new column variable visit
  2. Use tidyr::spread to create peak and trough column variables from each row of measure

Note that gather and spread are inverse operations.

In [205]:
subject <- c("Ann", "Ann", "Bob", "Bob", "Charlie", "Charlie")
measure <- c("peak", "trough", "peak", "trough", "peak", "trough")
visit1 <- c(250, 125, 1000, 750, 1500, 250)
visit2 <- c(500, 125, 900, 650, 1550, 500)
df <- data.frame(subject, measure, visit1, visit2)
df
subjectmeasurevisit1visit2
1Ann peak250 500
2Ann trough125 125
3Bob peak1000900
4Bob trough750 650
5Charliepeak 1500 1550
6Charlietrough 250 500
In [206]:
df1 <- df %>% gather(key=visit, value=titer, visit1:visit2)
df1
subjectmeasurevisittiter
1Ann peak visit1250
2Ann troughvisit1125
3Bob peak visit11000
4Bob troughvisit1750
5Charliepeak visit1 1500
6Charlietrough visit1 250
7Ann peak visit2500
8Ann troughvisit2125
9Bob peak visit2900
10Bob troughvisit2650
11Charliepeak visit2 1550
12Charlietrough visit2 500
In [207]:
df1 %>% spread(key=measure, value=titer)
subjectvisitpeaktrough
1Ann visit1250 125
2Ann visit2500 125
3Bob visit11000 750
4Bob visit2900 650
5Charlievisit1 1500 250
6Charlievisit2 1550 500

We can use the pipe operator to combine into a single operation

In [208]:
df %>% gather(visit, titer, visit1:visit2) %>% spread(measure, titer)
subjectvisitpeaktrough
1Ann visit1250 125
2Ann visit2500 125
3Bob visit11000 750
4Bob visit2900 650
5Charlievisit1 1500 250
6Charlievisit2 1550 500

(4) One type in multiple tables

Data may be collected in batches and stored in separate files per batch. To perform analysis, we have to combine multiple tables into one. This can be easily done using dplyr:bind_rows.

In [209]:
jan <- data.frame(subject=c("Ann", "Bob"), value=rnorm(2))
feb <- data.frame(subject=c("Ann", "Bob"), value=rnorm(2))
mar <- data.frame(subject=c("Ann", "Bob"), value=rnorm(2))
In [210]:
jan
subjectvalue
1Ann 0.0473575253372743
2Bob -1.16544371780111
In [211]:
feb
subjectvalue
1Ann -0.58470873143165
2Bob 1.73897748176552
In [212]:
mar
subjectvalue
1Ann -0.67510165070467
2Bob -1.83655926295541
In [213]:
bind_rows(jan, feb, mar)
subjectvalue
1Ann 0.0473575253372743
2Bob -1.16544371780111
3Ann -0.58470873143165
4Bob 1.73897748176552
5Ann -0.67510165070467
6Bob -1.83655926295541

We have the option of creating a new column to indicate the original data frame source

In [214]:
bind_rows("jan"=jan, "feb"=feb, "mar"=mar, .id="month")
monthsubjectvalue
1jan Ann 0.0473575253372743
2jan Bob -1.16544371780111
3feb Ann -0.58470873143165
4feb Bob 1.73897748176552
5mar Ann -0.67510165070467
6mar Bob -1.83655926295541

(5) Multiple types in multiple tables

The analysis tools in R typically work on a single data frame. Hence there may be a need to combine different tables (e..g demographic and assay data) in order to manipulate, analyze or visualize the complete data.

In [215]:
subject <- c("Ann", "Bob", "Charlie")
age <- c(34,45,56)
outcome <- c("CR", "PR", "CR")
df1 <- data.frame(subject, age, outcome)
head(df1)
subjectageoutcome
1Ann34 CR
2Bob45 PR
3Charlie56 CR
In [216]:
subject <- c("Ann", "Ann", "Bob", "Bob", "Charlie", "Charlie")
measure <- c("peak", "trough", "peak", "trough", "peak", "trough")
visit1 <- c(250, 125, 1000, 750, 1500, 250)
visit2 <- c(500, 125, 900, 650, 1550, 500)
df <- data.frame(subject, measure, visit1, visit2)
df2 <- df %>% gather(visit, titer, visit1:visit2) %>% spread(measure, titer)
head(df2)
subjectvisitpeaktrough
1Ann visit1250 125
2Ann visit2500 125
3Bob visit11000 750
4Bob visit2900 650
5Charlievisit1 1500 250
6Charlievisit2 1550 500
In [217]:
inner_join(df1, df2)
Joining, by = "subject"
subjectageoutcomevisitpeaktrough
1Ann 34 CR visit1250 125
2Ann 34 CR visit2500 125
3Bob 45 PR visit11000 750
4Bob 45 PR visit2900 650
5Charlie56 CR visit1 1500 250
6Charlie56 CR visit2 1550 500

When the column to join by has different names in each table

In [218]:
df3 <- df1
colnames(df3)[1] = "name"
df3
nameageoutcome
1Ann34 CR
2Bob45 PR
3Charlie56 CR
In [219]:
inner_join(df2, df3, by=c("subject"="name"))
subjectvisitpeaktroughageoutcome
1Ann visit1250 125 34 CR
2Ann visit2500 125 34 CR
3Bob visit11000 750 45 PR
4Bob visit2900 650 45 PR
5Charlievisit1 1500 250 56 CR
6Charlievisit2 1550 500 56 CR

Exercises

beaver1 and beaver2 are time series data collected on the body temperatures of two beavers (You can read more here)

In [220]:
head(beaver1)
daytimetempactiv
1346.00840.00 36.33 0.00
2346.00850.00 36.34 0.00
3346.00900.00 36.35 0.00
4346.00910.00 36.42 0.00
5346.00920.00 36.55 0.00
6346.00930.00 36.69 0.00
In [221]:
str(beaver1)
'data.frame':   114 obs. of  4 variables:
 $ day  : num  346 346 346 346 346 346 346 346 346 346 ...
 $ time : num  840 850 900 910 920 930 940 950 1000 1010 ...
 $ temp : num  36.3 36.3 36.4 36.4 36.5 ...
 $ activ: num  0 0 0 0 0 0 0 0 0 0 ...
In [222]:
head(beaver2)
daytimetempactiv
1307.00930.00 36.58 0.00
2307.00940.00 36.73 0.00
3307.00950.00 36.93 0.00
4 307.001000.00 37.15 0.00
5 307.001010.00 37.23 0.00
6 307.001020.00 37.24 0.00
In [223]:
str(beaver2)
'data.frame':   100 obs. of  4 variables:
 $ day  : num  307 307 307 307 307 307 307 307 307 307 ...
 $ time : num  930 940 950 1000 1010 1020 1030 1040 1050 1100 ...
 $ temp : num  36.6 36.7 36.9 37.1 37.2 ...
 $ activ: num  0 0 0 0 0 0 0 0 0 0 ...

Exercise 1: Combine the beaver1 and beaver2 data frames, adding a new column beaver with entries that take value 1 or 2 depending on which data frame the data came from.

In [224]:
head(bind_rows(beaver1, beaver2, .id="beaver"), n=4)
beaverdaytimetempactiv
11 346 840 36.330
21 346 850 36.340
31 346 900 36.350
41 346 910 36.420

Exercise 2: The data frame given below has multiple variables in a singel column. Convert this to a tidy data frame with the following variables = pid, age, date and buffer- the fields are separated by a comma and a space.

In [225]:
df <- data.frame(desc = c(
"541-0194-1-0, 5wk, 5/Jan/09,  PLA",
"541-0215-1-0, 5wk, 1/Apr/09,  PLA",
"541-0231-6-0, 5wk, 31/Mar/09,  PLA",
"541-0235-4-0, 5wk, 22/Jan/09,  PLA",
"541-0237-8-0, 5wk, 21/Jan/09,  PLA",
"541-0248-0-0, 5wk, 1/Apr/09,  PLA",
"541-0254-6-0, 5wk, 24/Mar/09,  PLA",
"541-0258-4-0, 5wk, 14/May/09,  PLA"), values=rnorm(8))
In [226]:
df
descvalues
1541-0194-1-0, 5wk, 5/Jan/09, PLA1.68401691250244
2541-0215-1-0, 5wk, 1/Apr/09, PLA-1.82541741635734
3541-0231-6-0, 5wk, 31/Mar/09, PLA0.322267715332376
4541-0235-4-0, 5wk, 22/Jan/09, PLA1.20301573073413
5541-0237-8-0, 5wk, 21/Jan/09, PLA-0.527461157819855
6541-0248-0-0, 5wk, 1/Apr/09, PLA-0.298129442756429
7541-0254-6-0, 5wk, 24/Mar/09, PLA-0.0780923846070984
8541-0258-4-0, 5wk, 14/May/09, PLA0.990178319959082
In [227]:
df %>% separate(desc, sep = ", ", into = c("pid", "age", "date", "buffer"))
pidagedatebuffervalues
1541-0194-1-0 5wk 5/Jan/09 PLA 1.68401691250244
2541-0215-1-0 5wk 1/Apr/09 PLA -1.82541741635734
3541-0231-6-0 5wk 31/Mar/09 PLA 0.322267715332376
4541-0235-4-0 5wk 22/Jan/09 PLA 1.20301573073413
5541-0237-8-0 5wk 21/Jan/09 PLA -0.527461157819855
6541-0248-0-0 5wk 1/Apr/09 PLA -0.298129442756429
7541-0254-6-0 5wk 24/Mar/09 PLA -0.0780923846070984
8541-0258-4-0 5wk 14/May/09 PLA 0.990178319959082

Manipulating Data

We illustrate five common data manipulation techniques on tidy data. When combined, these techniques provide power tools for slicing and dicing your data:

  • filter: slice
  • transform: mutate, transmute, mutate_each, transmute_each
  • aggregate: group_by, count, summarize, summarize_each
  • sort: arrange
  • sample: sample_n, sample_frac
  • column operations: select, rename

String functions used:

  • str_detect
  • substring

References

Start with a tidy data frame

In [228]:
df = iris
head(df)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
15.1 3.5 1.4 0.2 setosa
24.9 3 1.4 0.2 setosa
34.7 3.2 1.3 0.2 setosa
44.6 3.1 1.5 0.2 setosa
55 3.6 1.4 0.2 setosa
65.4 3.9 1.7 0.4 setosa
In [229]:
str(df)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Filter

subsetting or removing observations based on some condition

Filter on rows

In [230]:
df[11:20,]
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
115.4 3.7 1.5 0.2 setosa
124.8 3.4 1.6 0.2 setosa
134.8 3 1.4 0.1 setosa
144.3 3 1.1 0.1 setosa
155.8 4 1.2 0.2 setosa
165.7 4.4 1.5 0.4 setosa
175.4 3.9 1.3 0.4 setosa
185.1 3.5 1.4 0.3 setosa
195.7 3.8 1.7 0.3 setosa
205.1 3.8 1.5 0.3 setosa
In [231]:
df %>% slice(11:20)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
15.4 3.7 1.5 0.2 setosa
24.8 3.4 1.6 0.2 setosa
34.8 3 1.4 0.1 setosa
44.3 3 1.1 0.1 setosa
55.8 4 1.2 0.2 setosa
65.7 4.4 1.5 0.4 setosa
75.4 3.9 1.3 0.4 setosa
85.1 3.5 1.4 0.3 setosa
95.7 3.8 1.7 0.3 setosa
105.1 3.8 1.5 0.3 setosa
In [246]:
head(df[df$Species == "versicolor",])

df %>% filter (df$Species == "versicolor") %>% head
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
517 3.2 4.7 1.4 versicolor
526.4 3.2 4.5 1.5 versicolor
536.9 3.1 4.9 1.5 versicolor
545.5 2.3 4 1.3 versicolor
556.5 2.8 4.6 1.5 versicolor
565.7 2.8 4.5 1.3 versicolor
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
17 3.2 4.7 1.4 versicolor
26.4 3.2 4.5 1.5 versicolor
36.9 3.1 4.9 1.5 versicolor
45.5 2.3 4 1.3 versicolor
56.5 2.8 4.6 1.5 versicolor
65.7 2.8 4.5 1.3 versicolor
In [239]:
head(df[df$Petal.Width > mean(df$Petal.Width),])

df %>% filter( df$Petal.Width > mean(df$Petal.Width)) %>% head
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
517 3.2 4.7 1.4 versicolor
526.4 3.2 4.5 1.5 versicolor
536.9 3.1 4.9 1.5 versicolor
545.5 2.3 4 1.3 versicolor
556.5 2.8 4.6 1.5 versicolor
565.7 2.8 4.5 1.3 versicolor
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
17 3.2 4.7 1.4 versicolor
26.4 3.2 4.5 1.5 versicolor
36.9 3.1 4.9 1.5 versicolor
45.5 2.3 4 1.3 versicolor
56.5 2.8 4.6 1.5 versicolor
65.7 2.8 4.5 1.3 versicolor
In [238]:
head(df[str_detect(df$Species, "virgin"),])

df %>% filter(str_detect(df$Species,"virgin")) %>% head
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
1016.3 3.3 6 2.5 virginica
1025.8 2.7 5.1 1.9 virginica
1037.1 3 5.9 2.1 virginica
1046.3 2.9 5.6 1.8 virginica
1056.5 3 5.8 2.2 virginica
1067.6 3 6.6 2.1 virginica
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
16.3 3.3 6 2.5 virginica
25.8 2.7 5.1 1.9 virginica
37.1 3 5.9 2.1 virginica
46.3 2.9 5.6 1.8 virginica
56.5 3 5.8 2.2 virginica
67.6 3 6.6 2.1 virginica

Example

  1. Display only rows where the species is not versicolor and the petal width is greater than the mean petal width.
In [18]:
head(iris[(iris$Petal.Width > mean(iris$Petal.Width)) & (iris$Species != "versicolor"),])
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
1016.3 3.3 6 2.5 virginica
1025.8 2.7 5.1 1.9 virginica
1037.1 3 5.9 2.1 virginica
1046.3 2.9 5.6 1.8 virginica
1056.5 3 5.8 2.2 virginica
1067.6 3 6.6 2.1 virginica

Transform

adding or modifying variables. These modifications can involve either a single variable (e.g., log-transformation), or multiple variables (e.g., computing density from weight and volume).
In [251]:
df %>%
    mutate(sum.length = Sepal.Length + Petal.Length,
            sum.width = Sepal.Width + Petal.Width) %>%
    head
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpeciessum.lengthsum.width
15.1 3.5 1.4 0.2 setosa6.5 3.7
24.9 3 1.4 0.2 setosa6.3 3.2
34.7 3.2 1.3 0.2 setosa6 3.4
44.6 3.1 1.5 0.2 setosa6.1 3.3
55 3.6 1.4 0.2 setosa6.4 3.8
65.4 3.9 1.7 0.4 setosa7.1 4.3
In [249]:
df %>%
  transmute(sum.length = Sepal.Length + Petal.Length, sum.width =
            Sepal.Width + Petal.Width) %>%
                       head
sum.lengthsum.width
16.53.7
26.33.2
36.03.4
46.13.3
56.43.8
67.14.3
In [247]:
f <- function(x) {2 * x + 1}
In [248]:
df %>% transmute(Fat.Petal.Width = f(Petal.Width)) %>%head
Fat.Petal.Width
11.4
21.4
31.4
41.4
51.4
61.8
In [252]:
#Multiple columns

df %>% mutate_each(funs(scale(., center = T, scale = T)), -Species) %>% head()
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
1-0.8976738791967661.01560199071363 -1.33575163424152 -1.31105214820513 setosa
2-1.13920048346495-0.13153881205026-1.33575163424152-1.31105214820513setosa
3-1.380727087733140.327317509055298-1.39239928624498-1.31105214820513setosa
4-1.50149038986724 0.0978893485025193-1.27910398223806 -1.31105214820513 setosa
5-1.018437181330861.24503015126641 -1.33575163424152-1.31105214820513setosa
6-0.5353839727944831.93331463292475 -1.16580867823115 -1.0486667949953 setosa

Example

2. The iris data set measurements are in centimeters. Convert them all to inches rounding to 2 decimal places and using the same column names.

Note: 1 inch = 2.54 cm.

In [24]:
cm.to.inch <- function(x) round(x/2.54, 2)

head(iris %>% mutate_each(funs(cm.to.inch), -Species))
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
12.01 1.38 0.55 0.08 setosa
21.93 1.18 0.55 0.08 setosa
31.85 1.26 0.51 0.08 setosa
41.81 1.22 0.59 0.08 setosa
51.97 1.42 0.55 0.08 setosa
62.13 1.54 0.67 0.16 setosa

Aggregate

collapsing multiple values into a single value (e.g., by summing or taking means).
In [25]:
df %>% count()
n
1150
In [26]:
df %>% count(Species)
Speciesn
1setosa50
2versicolor50
3virginica50
In [27]:
df %>% summarize(mean(Sepal.Length))
mean(Sepal.Length)
15.843333
In [28]:
df %>% summarize_each(c("mean", "sd"), c(-Species))
Sepal.Length_meanSepal.Width_meanPetal.Length_meanPetal.Width_meanSepal.Length_sdSepal.Width_sdPetal.Length_sdPetal.Width_sd
15.84333333.05733333.75800001.19933330.82806610.43586631.76529820.7622377
In [29]:
df %>% group_by(Species) %>% summarize_each(c("mean"))
SpeciesSepal.LengthSepal.WidthPetal.LengthPetal.Width
1setosa5.006 3.428 1.462 0.246
2versicolor5.936 2.77 4.26 1.326
3virginica6.588 2.974 5.552 2.026

Example

3. Find the min and max values for each group in the PlantGrowth data set. Name the columns smallest and largest respectively.

In [30]:
str(PlantGrowth)
'data.frame':   30 obs. of  2 variables:
 $ weight: num  4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
 $ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ...
In [31]:
PlantGrowth %>%
group_by(group) %>%
summarise_each(funs(smallest=min, largest=max))
groupsmallestlargest
1ctrl4.176.11
2trt13.596.03
3trt24.926.31

Sort

changing the order of observations
In [ ]:
head(df %>% arrange(Sepal.Length))
In [ ]:
head(df %>% arrange(desc(Sepal.Length)))
In [ ]:
head(df %>% arrange(Petal.Width, desc(Sepal.Length)))

Sample

select random rows from a table

Sample 10 rows without replacement

In [ ]:
df %>% sample_n(size=10)
In [ ]:
### Sample 10 rows with replacement

df %>% sample_n(size=10, replace=T)
In [ ]:
### Sample 5% of the data

df %>% sample_frac(size=0.05)

Operations on columns

Sometimes we want to “filter” on columns rather than rows. We can use standard indexing or the dplyr functions

  • select
  • rename
In [ ]:
head(df[,2:3])
In [ ]:
head(df %>% select(sl = Sepal.Length, sw=Sepal.Width))
In [ ]:
head(df %>% select(petal = starts_with("Petal")))
In [ ]:
head(df %>% select(width = contains("Width")))

Exercises

Exercise 1: Recall that we converted the following data set into a tidy frame. Now we are informed that the pid field is actually a composite of expt.id, pid, sample.id and replicate.id separated by -, while the date field needs to be further broken down into day, month and year columns. In other words, we want a final data set which has the variables expt.id, pid, sample.id, replicate.id, age,day, month, year, buffer and value. Discard the initial semi-colon.

In [ ]:
df <- data.frame(desc = c(
";541-0194-1-0, 5wk, 5/Jan/09,  PLA",
";541-0215-1-0, 5wk, 1/Apr/09,  PLA",
";541-0231-6-0, 5wk, 31/Mar/09,  PLA",
";541-0235-4-0, 5wk, 22/Jan/09,  PLA",
";541-0237-8-0, 5wk, 21/Jan/09,  PLA",
";541-0248-0-0, 5wk, 1/Apr/09,  PLA",
";541-0254-6-0, 5wk, 24/Mar/09,  PLA",
";541-0258-4-0, 5wk, 14/May/09,  PLA"), values=rnorm(8))

head(df)
In [ ]:
df1 <- df %>%
separate(desc, c("tmp", "age", "date", "buffer"), sep=", ") %>%
separate(tmp, c("expt.id", "pid", "sample.id", "replicate.id"), sep="-") %>%
separate(date, c("day", "month", "year"), sep="/") %>%
mutate(expt.id=substring(expt.id, 2))
df1

Exercise 2: You are now informed that the values were in the wrong units. You need to apply the formula y = 39 + 3*x to recover the original values. Round to 2 decimal points.

In [ ]:
df2 <- df1 %>% mutate(values = round(39 + 3*values, 2))
df2

Exercise 3: Find the mean, min and max values when grouped by month.

In [ ]:
df2 %>% group_by(month) %>% summarize_each(c("mean", "min", "max"), values)

Exercise 4: You are now informed that the assay machine was not working properly in March. Create a new data set without any March values.

In [ ]:
df2[df2$month != "Mar",]
In [ ]: