Cleaning and preparing data¶

This lesson is on the use of the tidyr and dplyr packages that are part of a collection called the ‘tidyverse’. These packages are for data cleaning, preparation and visualization. Currently, data frames and their manipulation in R are in transition from older programming paradigms, syntax and data classes to some newer, (hopefully) better implementations.

Here’s what we want to do:

Objectives¶

How to structure your spreadsheet for analysis
How to read a spreadsheet or text file into a data frame (or tibble)
How to create a data frame (or tibble) from scratch
How to interpret a data frame (or tibble)
How to select rows and columns of a data frame (or tibble)
How to sort a data frame (or tibble)
What is “tidy” data?
How to tidy common forms of “messy” data with tidyr

Because many researchers may not have adopted the newer methods, we will often present both the ‘traditional’ R syntax and the tidyverse syntax. We encourage you to write new code using tidyr, dplyr and the rest of the tidyverse.

Main Additions and Modifications to Traditional R.¶

tibbles vs data frames¶

Base R (meaning the core R libraries and functions) employs a data class called a ‘data frame’. A data frame is a 2-dimensional array whose columns can be any type (numeric, vector, list, string, etc.). A tibble is also a data frame, with some refinements in the printing and subsetting features. Pretty much anywhere a data frame is used, a tibble can be used. Data frames can be converted to tibbles using as.tibbles and tibbles can be converted to data frames using as.data.frame. If you need to use an older package that doesn’t work with tibbles, you can just wrap your tibble with as.data.frame and everything should work. We’ll see some examples shortly.

Pipes¶

Pipes are new to R, but the concept of a pipe is not at all new. In fact, it has been around for decades in Unix scripting language. The pipe works like a literal pipe - it is a connector that carries things from one output to a new input. For example, the water that is output by the municipal system flows via a pipe as input to your house, that is then run through your faucet, whose output goes to the drain, and the water output from the drain is carried via a pipe to become input to the sewer system.

Software pipes carry the output of one program to be input to another. Their use makes code more readable to a human. Readability to humans is a key goal in today’s programming environments, because it makes modifying code easier, and it reduces the possibility of bugs. The symbol for the pipe connector in R is %>%. (In unix the pipe symbol is |, and you may use this when you carry out the bioinformatics pipeline.)

In [1]:

#Loading the tidyverse library

library(tidyverse)
library(stringr)

Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats

How to structure your spreadsheet for analysis¶

Many people in the biological sciences use excel spreadsheets to store data. When we import a spreadsheet into a programming environment like R, we need to be much more careful about formatting. Spreadsheets are formatted for people to look at. We want our data in a format that is readable for a machine.

The following are some common issues in converting spreadsheets to usable data sets.

Do this¶

Not that¶

Table is not rectangular
Fancy annotations lost when exported to CSV
Single column contains too much information

How to read a spreadsheet or text file into a data frame¶

Reading in a data frame from a text or csv file¶

This is the most common case when data is exported from a spreadsheet to a CSV file.

In [162]:

url <- "http://vincentarelbundock.github.io/Rdatasets/csv/datasets/Titanic.csv"
download.file(url = url, destfile="titanic.csv")

Using `read.csv`¶

In [163]:

titanic <- read.csv("titanic.csv")
head(titanic, n=4)

	X	Name	PClass	Age	Sex	Survived	SexCode
1	1	Allen, Miss Elisabeth Walton	1st	29	female	1	1
2	2	Allison, Miss Helen Loraine	1st	2	female	0	1
3	3	Allison, Mr Hudson Joshua Creighton	1st	30	male	0	0
4	4	Allison, Mrs Hudson JC (Bessie Waldo Daniels)	1st	25	female	0	1

In [2]:

# read_csv is different from read.csv

titanic_tibble <- read_csv("titanic.csv")
titanic_tibble

Warning message:
“Missing column names filled in: 'X1' [1]”Parsed with column specification:
cols(
  X1 = col_integer(),
  Name = col_character(),
  PClass = col_character(),
  Age = col_double(),
  Sex = col_character(),
  Survived = col_integer(),
  SexCode = col_integer()
)

	X1	Name	PClass	Age	Sex	Survived	SexCode
1	1	Allen, Miss Elisabeth Walton	1st	29	female	1	1
2	2	Allison, Miss Helen Loraine	1st	2	female	0	1
3	3	Allison, Mr Hudson Joshua Creighton	1st	30	male	0	0
4	4	Allison, Mrs Hudson JC (Bessie Waldo Daniels)	1st	25	female	0	1
5	5	Allison, Master Hudson Trevor	1st	0.92	male	1	0
6	6	Anderson, Mr Harry	1st	47	male	1	0
7	7	Andrews, Miss Kornelia Theodosia	1st	63	female	1	1
8	8	Andrews, Mr Thomas, jr	1st	39	male	0	0
9	9	Appleton, Mrs Edward Dale (Charlotte Lamson)	1st	58	female	1	1
10	10	Artagaveytia, Mr Ramon	1st	71	male	0	0
11	11	Astor, Colonel John Jacob	1st	47	male	0	0
12	12	Astor, Mrs John Jacob (Madeleine Talmadge Force)	1st	19	female	1	1
13	13	Aubert, Mrs Leontine Pauline	1st	NA	female	1	1
14	14	Barkworth, Mr Algernon H	1st	NA	male	1	0
15	15	Baumann, Mr John D	1st	NA	male	0	0
16	16	Baxter, Mrs James (Helene DeLaudeniere Chaput)	1st	50	female	1	1
17	17	Baxter, Mr Quigg Edmond	1st	24	male	0	0
18	18	Beattie, Mr Thomson	1st	36	male	0	0
19	19	Beckwith, Mr Richard Leonard	1st	37	male	1	0
20	20	Beckwith, Mrs Richard Leonard (Sallie Monypeny)	1st	47	female	1	1
21	21	Behr, Mr Karl Howell	1st	26	male	1	0
22	22	Birnbaum, Mr Jakob	1st	25	male	0	0
23	23	Bishop, Mr Dickinson H	1st	25	male	1	0
24	24	Bishop, Mrs Dickinson H (Helen Walton)	1st	19	female	1	1
25	25	Bjornstrm-Steffansson, Mr Mauritz Hakan	1st	28	male	1	0
26	26	Blackwell, Mr Stephen Weart	1st	45	male	0	0
27	27	Blank, Mr Henry	1st	39	male	1	0
28	28	Bonnell, Miss Caroline	1st	30	female	1	1
29	29	Bonnell, Miss Elizabeth	1st	58	female	1	1
30	30	Borebank, Mr John James	1st	NA	male	0	0
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
1284	1284	Vestrom, Miss Hulda Amanda Adolfina	3rd	14	female	0	1
1285	1285	Vonk, Mr Jenko	3rd	22	male	0	0
1286	1286	Ware, Mr Frederick	3rd	NA	male	0	0
1287	1287	Warren, Mr Charles William	3rd	NA	male	0	0
1288	1288	Wazli, Mr Yousif	3rd	NA	male	0	0
1289	1289	Webber, Mr James	3rd	NA	male	0	0
1290	1290	Wennerstrom, Mr August Edvard	3rd	NA	male	1	0
1291	1291	Wenzel, Mr Linhart	3rd	NA	male	0	0
1292	1292	Widegren, Mr Charles Peter	3rd	51	male	0	0
1293	1293	Wiklund, Mr Jacob Alfred	3rd	18	male	0	0
1294	1294	Wilkes, Mrs Ellen	3rd	45	female	1	1
1295	1295	Willer, Mr Aaron	3rd	NA	male	0	0
1296	1296	Willey, Mr Edward	3rd	NA	male	0	0
1297	1297	Williams, Mr Howard Hugh	3rd	NA	male	0	0
1298	1298	Williams, Mr Leslie	3rd	28	male	0	0
1299	1299	Windelov, Mr Einar	3rd	21	male	0	0
1300	1300	Wirz, Mr Albert	3rd	27	male	0	0
1301	1301	Wiseman, Mr Phillippe	3rd	NA	male	0	0
1302	1302	Wittevrongel, Mr Camiel	3rd	36	male	0	0
1303	1303	Yalsevac, Mr Ivan	3rd	NA	male	1	0
1304	1304	Yasbeck, Mr Antoni	3rd	27	male	0	0
1305	1305	Yasbeck, Mrs Antoni	3rd	15	female	1	1
1306	1306	Youssef, Mr Gerios	3rd	NA	male	0	0
1307	1307	Zabour, Miss Hileni	3rd	NA	female	0	1
1308	1308	Zabour, Miss Tamini	3rd	NA	female	0	1
1309	1309	Zakarian, Mr Artun	3rd	27	male	0	0
1310	1310	Zakarian, Mr Maprieder	3rd	26	male	0	0
1311	1311	Zenni, Mr Philip	3rd	22	male	0	0
1312	1312	Lievens, Mr Rene	3rd	24	male	0	0
1313	1313	Zimmerman, Leo	3rd	29	male	0	0

In [3]:

print(class(titanic))
print(class(titanic_tibble))

Error in print(class(titanic)): object 'titanic' not found
Traceback:

1. print(class(titanic))

`read.csv` is just a wrapper for `read.table` with some defaults given¶

In [166]:

titanic.1 <- read.table("titanic.csv", sep=",", header = T)
head(titanic.1, n=4)

#Same applies to read_csv - it is a wrapper for read_delim

	X	Name	PClass	Age	Sex	Survived	SexCode
1	1	Allen, Miss Elisabeth Walton	1st	29	female	1	1
2	2	Allison, Miss Helen Loraine	1st	2	female	0	1
3	3	Allison, Mr Hudson Joshua Creighton	1st	30	male	0	0
4	4	Allison, Mrs Hudson JC (Bessie Waldo Daniels)	1st	25	female	0	1

Let’s get rid of the first column, as it isn’t useful.

In [167]:

head(titanic[, -1], n=4)

	Name	PClass	Age	Sex	Survived	SexCode
1	Allen, Miss Elisabeth Walton	1st	29	female	1	1
2	Allison, Miss Helen Loraine	1st	2	female	0	1
3	Allison, Mr Hudson Joshua Creighton	1st	30	male	0	0
4	Allison, Mrs Hudson JC (Bessie Waldo Daniels)	1st	25	female	0	1

Alternative ising `dplyr`¶

An alternative approach using dplyr that we encourage getting used to as it will provides a consistent grammar for data frame manipulations.

In [168]:

head(titanic_tibble %>% select(-1), n=4)

	Name	PClass	Age	Sex	Survived	SexCode
1	Allen, Miss Elisabeth Walton	1st	29	female	1	1
2	Allison, Miss Helen Loraine	1st	2	female	0	1
3	Allison, Mr Hudson Joshua Creighton	1st	30	male	0	0
4	Allison, Mrs Hudson JC (Bessie Waldo Daniels)	1st	25	female	0	1

2. Write a function called peek that takes as arguments df and n. It then returns n random rows ffrom the data frame df (without repeats). If n is larger than the number of rows, it should retrun the original data frame. Give n a default value of 6.

In [169]:

peek <- function(df, n=6) {
    nr <- nrow(df)
    if (n >= nr) {
        return(df)
    } else {
        idx <- sample(nr, n, replace=F)
        return(df[idx, ])
    }
}

Alternative using `sample_n` from `dplyr`¶

In [170]:

peek.1 <- function(df, n=6) {
    nr <- nrow(df)
    if (n >= nr) {
        return(df)
    } else {
        return(sample_n(df, n, replace=F))
    }
}

How to create a data frame from scratch¶

In [171]:

n <- 12
subjects <- sample(1:1000, n, replace=FALSE)
treatments <- sample(c("Placebo", "Drug"), size = n, replace = TRUE)
n1 <- sum(treatments == "Placebo")
n2 <- n - n1
placebo.vals <- rnorm(n=n1, mean = 10, sd=1)
drug.vals <- rnorm(n=n2, mean=15, sd=1)
gene1 <- ifelse(treatments=="Placebo", placebo.vals, drug.vals)
gene2 <- rnorm(n=n, mean=12, sd=1)
df <- data.frame(subjects=subjects, treatments=treatments, gene1=gene1, gene2=gene2)
df

	subjects	treatments	gene1	gene2
1	153	Drug	15.5163083478864	11.8019177313714
2	98	Placebo	11.3897127792735	10.4411774515486
3	250	Drug	15.8043076138157	11.5690995283775
4	545	Drug	15.4461393252021	11.240600432231
5	442	Placebo	10.7667391418026	12.4025164044546
6	832	Drug	16.5360327391973	10.7263837943643
7	625	Drug	14.2269966318662	13.5853056018832
8	293	Placebo	9.38369841723581	12.1824448855273
9	32	Drug	13.671926122039	11.5003612782279
10	469	Drug	15.8043076138157	12.8664592872279
11	955	Placebo	10.3796050585356	13.6823416307802
12	443	Placebo	11.3897127792735	11.9559040521596

How to interpret a data frame¶

In [172]:

class(df)

'data.frame'

In [173]:

rownames(df)

'1'
'2'
'3'
'4'
'5'
'6'
'7'
'8'
'9'
'10'
'11'
'12'

In [174]:

colnames(df)

'subjects'
'treatments'
'gene1'
'gene2'

In [175]:

dim(df)

12
4

In [176]:

str(df)

'data.frame':   12 obs. of  4 variables:
 $ subjects  : int  153 98 250 545 442 832 625 293 32 469 ...
 $ treatments: Factor w/ 2 levels "Drug","Placebo": 1 2 1 1 2 1 1 2 1 1 ...
 $ gene1     : num  15.5 11.4 15.8 15.4 10.8 ...
 $ gene2     : num  11.8 10.4 11.6 11.2 12.4 ...

In [177]:

summary(df)

subjects       treatments     gene1            gene2
 Min.   : 32.0   Drug   :7    Min.   : 9.384   Min.   :10.44
 1st Qu.:225.8   Placebo:5    1st Qu.:11.234   1st Qu.:11.44
 Median :442.5                Median :13.949   Median :11.88
 Mean   :428.1                Mean   :13.360   Mean   :12.00
 3rd Qu.:565.0                3rd Qu.:15.588   3rd Qu.:12.52
 Max.   :955.0                Max.   :16.536   Max.   :13.68

In [178]:

head(df)

	subjects	treatments	gene1	gene2
1	153	Drug	15.5163083478864	11.8019177313714
2	98	Placebo	11.3897127792735	10.4411774515486
3	250	Drug	15.8043076138157	11.5690995283775
4	545	Drug	15.4461393252021	11.240600432231
5	442	Placebo	10.7667391418026	12.4025164044546
6	832	Drug	16.5360327391973	10.7263837943643

In [179]:

tail(df)

	subjects	treatments	gene1	gene2
7	625	Drug	14.2269966318662	13.5853056018832
8	293	Placebo	9.38369841723581	12.1824448855273
9	32	Drug	13.671926122039	11.5003612782279
10	469	Drug	15.8043076138157	12.8664592872279
11	955	Placebo	10.3796050585356	13.6823416307802
12	443	Placebo	11.3897127792735	11.9559040521596

How to select rows and columns of a data frame¶

In [180]:

df[[8, 3]]

9.38369841723581

In [181]:

df$treatments

Drug
Placebo
Drug
Drug
Placebo
Drug
Drug
Placebo
Drug
Drug
Placebo
Placebo

In [182]:

df[[2]]

Drug
Placebo
Drug
Drug
Placebo
Drug
Drug
Placebo
Drug
Drug
Placebo
Placebo

In [183]:

df[,2]

Drug
Placebo
Drug
Drug
Placebo
Drug
Drug
Placebo
Drug
Drug
Placebo
Placebo

In [184]:

df[2:4,]

	subjects	treatments	gene1	gene2
2	98	Placebo	11.3897127792735	10.4411774515486
3	250	Drug	15.8043076138157	11.5690995283775
4	545	Drug	15.4461393252021	11.240600432231

In [185]:

df[treatments == "Placebo", ]

	subjects	treatments	gene1	gene2
2	98	Placebo	11.3897127792735	10.4411774515486
5	442	Placebo	10.7667391418026	12.4025164044546
8	293	Placebo	9.38369841723581	12.1824448855273
11	955	Placebo	10.3796050585356	13.6823416307802
12	443	Placebo	11.3897127792735	11.9559040521596

In [186]:

df[(gene1 < 12) & (gene2 > 12), ]

	subjects	treatments	gene1	gene2
5	442	Placebo	10.7667391418026	12.4025164044546
8	293	Placebo	9.38369841723581	12.1824448855273
11	955	Placebo	10.3796050585356	13.6823416307802

How to sort a data frame¶

In [187]:

df[order(subjects),]

	subjects	treatments	gene1	gene2
9	32	Drug	13.671926122039	11.5003612782279
2	98	Placebo	11.3897127792735	10.4411774515486
1	153	Drug	15.5163083478864	11.8019177313714
3	250	Drug	15.8043076138157	11.5690995283775
8	293	Placebo	9.38369841723581	12.1824448855273
5	442	Placebo	10.7667391418026	12.4025164044546
12	443	Placebo	11.3897127792735	11.9559040521596
10	469	Drug	15.8043076138157	12.8664592872279
4	545	Drug	15.4461393252021	11.240600432231
7	625	Drug	14.2269966318662	13.5853056018832
6	832	Drug	16.5360327391973	10.7263837943643
11	955	Placebo	10.3796050585356	13.6823416307802

In [188]:

df[order(gene1),]

	subjects	treatments	gene1	gene2
8	293	Placebo	9.38369841723581	12.1824448855273
11	955	Placebo	10.3796050585356	13.6823416307802
5	442	Placebo	10.7667391418026	12.4025164044546
2	98	Placebo	11.3897127792735	10.4411774515486
12	443	Placebo	11.3897127792735	11.9559040521596
9	32	Drug	13.671926122039	11.5003612782279
7	625	Drug	14.2269966318662	13.5853056018832
4	545	Drug	15.4461393252021	11.240600432231
1	153	Drug	15.5163083478864	11.8019177313714
3	250	Drug	15.8043076138157	11.5690995283775
10	469	Drug	15.8043076138157	12.8664592872279
6	832	Drug	16.5360327391973	10.7263837943643

In [189]:

df[order(-df$gene1),] # Explicit naming of dataframe column

	subjects	treatments	gene1	gene2
6	832	Drug	16.5360327391973	10.7263837943643
3	250	Drug	15.8043076138157	11.5690995283775
10	469	Drug	15.8043076138157	12.8664592872279
1	153	Drug	15.5163083478864	11.8019177313714
4	545	Drug	15.4461393252021	11.240600432231
7	625	Drug	14.2269966318662	13.5853056018832
9	32	Drug	13.671926122039	11.5003612782279
2	98	Placebo	11.3897127792735	10.4411774515486
12	443	Placebo	11.3897127792735	11.9559040521596
5	442	Placebo	10.7667391418026	12.4025164044546
11	955	Placebo	10.3796050585356	13.6823416307802
8	293	Placebo	9.38369841723581	12.1824448855273

Examples¶

1. Find all rows where subjects were given placebo with gene1 values above 10 and gene2 values below 12.

In [190]:

df[(treatments=="Placebo") & (gene1 > 10) & (gene2 < 12), ]

	subjects	treatments	gene1	gene2
2	98	Placebo	11.3897127792735	10.4411774515486
12	443	Placebo	11.3897127792735	11.9559040521596

2. What is the log ratio of gene2 to gene1 for subjects with IDs in the range 100 to 500? Save the information for this subset in a new data frame and sort by the log ratio in descending order.

In [191]:

df.1 <- df[(subjects >= 100) & (subjects <= 500), ]
df.1$lr <- log(df.1$gene2) - log(df.1$gene1)
df.1[order(-df.1$lr), ]

	subjects	treatments	gene1	gene2	lr
8	293	Placebo	9.38369841723581	12.1824448855273	0.261021998797103
5	442	Placebo	10.7667391418026	12.4025164044546	0.141437714890785
12	443	Placebo	11.3897127792735	11.9559040521596	0.0485146590903738
10	469	Drug	15.8043076138157	12.8664592872279	-0.205658666561063
1	153	Drug	15.5163083478864	11.8019177313714	-0.273629584403845
3	250	Drug	15.8043076138157	11.5690995283775	-0.311944826627759

What is “tidy” data?¶

First we need some definitions:

A value is a number or a string.
A variable contains all values that measure the same underlying attribute (e.g. height, gene expression). A fixed variable is part of the experimental design (e.g. subject ID, treatment group) while a measured variable refers to data collected as part of the experiment (e.g. viral titer, gene expression level).
An observation contains all values measured on the same unit (e.g. person, assay)

To simplify data analysis, we want to clean and prepare tidy data, where

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

With tidy data, a table is made up of rows and columns containing values, where each row represents an observation and each column represents a variable. Each table describes a type of observation - for example, in a medical experiment we may have three tables - demographic data about each person, gene expression data for each person on each visit, and proteomic data for each person on each visit.

See the tidy data paper for a philosophy of data.

There are many other aspects of cleaning up messy data not addressed here. From the paper above, here are some:

Apart from tidying, there are many other tasks involved in cleaning data: parsing dates and numbers, identifying missing values, correcting character encodings (for international data), matching similar but not identical values (created by typos), verifying experimental design, and filling in structural missing values, not to mention model-based data cleaning that identifies suspicious values.

Examples¶

It is simplest to grasp the concept of tidy data via concrete examples. Suppose we perform two different experiments on subjects Ann and Bob, and measure some physiological attribute in each experiment. We can record the data as a table in several ways:

Not tidy data (Table 1)¶

	Expt 1	Expt 2
Ann	175	45
Bob	183	72

Not tidy data (Table 2)¶

	Ann	Bob
Expt 1	175	183
Expt 2	45	72

Tidy data (Table 3)¶

Subject	Expt	Value
Ann	1	175
Ann	2	45
Bob	1	183
Bob	2	72

Because of the uniform structure, tables in the tidy data format make it easy to answer questions about different variables in a consistent way, and also to merge different tables (e.g. subject demographic data with results of laboratory assays for each subject). It is conventional to order the columns so that fixed variables (e.g. subject) come before measured variables.

What is a variable?¶

Table 4¶

Subject	Gene 1	Gene 2
Ann	175	45
Bob	183	72

Table 1 and 4 have almost the exact same formal structure - yet Table 1 is messy while Table 4 is tidy. This is because it is natural to consider the expression values for a particular gene across subjects as a variable, but the same is not true for a particular experiment. Pragmatically, whether something should be a variable or a variable depends on its role in a statistical model. From a modeling perspective, it is natural to consider outcome ~ gene1 * gene2 but outcome ~ expt1 * expt2 doesn’t really make any sense.

How to tidy common forms of “messy” data with `tidyr`¶

Here we give examples of how to convert common forms of messy data into tidy data. We use toy examples for illustration, but the tidyr and dplyr functions are highly efficient and work well with HTS data sets.

There are only a few important verbs (functions) to remember:

For working with single data frames¶

gather: takes multiple columns into a single column
spread: takes a single column into multiple columns
separate: splits a variable name based on a specified separator
extract: splits a variable name using regular expression capture groups

For combining multiple data frames¶

bind_rows: “combine data frames “vertically”; increases number of rows
*_join: combine data frames “horizontally” by matching on one or more common columns; increases number of columns

We illustrate these verbs with examples from common data tidying scenarios. There are a few common types of messy data that we need to handle:

Converting columns into rows (convert from “wide” to “tall” format)
Multiple variables in a single column (splitting a single value into two or more)
Variables stored in both rows and columns (move all variables to columns)
One type in multiple tables (combine several similar tables into one)
Multiple types in multiple tables (combine tables with different columns into one based on some identifier)

(1) Converting columns into rows¶

One common situation is that column headers are values rather than variables. For example, we often need to convert data like Table 1 into the form given by Table 3. This can be done in two steps:

make a new column from the row names using tidy::add_rownames()
reshape the data frame using tidyr::gather()

Create a data frame representing table 1¶

In [192]:

expt1 <- c(175, 183)
expt2 <- c(45, 72)
subject <- c("Ann", "Bob")
t1 <- data.frame(expt1, expt2, row.names=subject)
t1

	expt1	expt2
Ann	175	45
Bob	183	72

Add row names¶

In [193]:

t <- add_rownames(t1, "subject")
t

Warning message:
“Deprecated, use tibble::rownames_to_column() instead.”

	subject	expt1	expt2
1	Ann	175	45
2	Bob	183	72

Reshape from “wide” to “tall”¶

The tidyr::gather function takes the following arguments - data, key, value, list of columns to gather. The list of columns can be specified as a comma separated list, as a slice with : notation, or by exclusion of variables with -.

In [196]:

gather(t, key=expt, value=value, expt1, expt2)

	subject	expt	value
1	Ann	expt1	175
2	Bob	expt1	183
3	Ann	expt2	45
4	Bob	expt2	72

In [197]:

gather(t, key=expt, value=value, expt1:expt2)

	subject	expt	value
1	Ann	expt1	175
2	Bob	expt1	183
3	Ann	expt2	45
4	Bob	expt2	72

In [198]:

gather(t, expt, value, -subject)

	subject	expt	value
1	Ann	expt1	175
2	Bob	expt1	183
3	Ann	expt2	45
4	Bob	expt2	72

Using `piping`¶

When using the tidyr and dplyr packages, a common idiom is to “pipe” a chain of transformations using the %>% pipe operator. In this idiom, the data frame is implicitly passed on down the chain, and does not need to be specified in the transformation function call.

In [199]:

t %>% gather(expt, value, -subject)

	subject	expt	value
1	Ann	expt1	175
2	Bob	expt1	183
3	Ann	expt2	45
4	Bob	expt2	72

(2) Multiple variables in a single column¶

Sometimes data in a column is concatenated from multiple variables. To convert to tidy data, we need to split into separate columns for each variable. The next example shows a column where the variable is a concatenation of the chromosome number and offset position. We can use tidyr::separate to do this easily if there is an obvious delimiter (that can be a regular expression).

In [200]:

chr.offset <- c("Chr1:345", "Chr2:712")
subject <- c("Ann", "Bob")
df <- data.frame(subject, chr.offset)
df

	subject	chr.offset
1	Ann	Chr1:345
2	Bob	Chr2:712

In [201]:

df %>% separate(chr.offset, into=c("chromosome", "offset"), sep=":" )

	subject	chromosome	offset
1	Ann	Chr1	345
2	Bob	Chr2	712

When there is no obvious separator¶

Sometimes there is no obvious separator to use to split the variables. In many cases, we can use extract to capture the separate parts based on matching to regular expression patterns.

In [202]:

pid.name.visit <- c("00345Ann1", "02122John13")
df <- data.frame(pid.name.visit)
df

	pid.name.visit
1	00345Ann1
2	02122John13

In [203]:

df %>% extract(pid.name.visit, c("pid", "name", "visit"), "([0-9]+)([a-zA-Z_-]+)([0-9]+)")

	pid	name	visit
1	00345	Ann	1
2	02122	John	13

In [204]:

df %>% extract(pid.name.visit, c("pid", "name", "visit"), "(\\d+)(\\w+)(\\d+)")

	pid	name	visit
1	00345	Ann	1
2	02122	John1	3

(3) Variables stored in both rows and columns¶

In the toy example below, we measure peak and trough levels of something (blood glucose level, viral titers etc) recorded between visits. We convert this into a tidy data frame in two steps:

Use tidyr::gather to create a new column variable visit
Use tidyr::spread to create peak and trough column variables from each row of measure

Note that gather and spread are inverse operations.

In [205]:

subject <- c("Ann", "Ann", "Bob", "Bob", "Charlie", "Charlie")
measure <- c("peak", "trough", "peak", "trough", "peak", "trough")
visit1 <- c(250, 125, 1000, 750, 1500, 250)
visit2 <- c(500, 125, 900, 650, 1550, 500)
df <- data.frame(subject, measure, visit1, visit2)
df

	subject	measure	visit1	visit2
1	Ann	peak	250	500
2	Ann	trough	125	125
3	Bob	peak	1000	900
4	Bob	trough	750	650
5	Charlie	peak	1500	1550
6	Charlie	trough	250	500

In [206]:

df1 <- df %>% gather(key=visit, value=titer, visit1:visit2)
df1

	subject	measure	visit	titer
1	Ann	peak	visit1	250
2	Ann	trough	visit1	125
3	Bob	peak	visit1	1000
4	Bob	trough	visit1	750
5	Charlie	peak	visit1	1500
6	Charlie	trough	visit1	250
7	Ann	peak	visit2	500
8	Ann	trough	visit2	125
9	Bob	peak	visit2	900
10	Bob	trough	visit2	650
11	Charlie	peak	visit2	1550
12	Charlie	trough	visit2	500

In [207]:

df1 %>% spread(key=measure, value=titer)

	subject	visit	peak	trough
1	Ann	visit1	250	125
2	Ann	visit2	500	125
3	Bob	visit1	1000	750
4	Bob	visit2	900	650
5	Charlie	visit1	1500	250
6	Charlie	visit2	1550	500

We can use the pipe operator to combine into a single operation¶

In [208]:

df %>% gather(visit, titer, visit1:visit2) %>% spread(measure, titer)

	subject	visit	peak	trough
1	Ann	visit1	250	125
2	Ann	visit2	500	125
3	Bob	visit1	1000	750
4	Bob	visit2	900	650
5	Charlie	visit1	1500	250
6	Charlie	visit2	1550	500

(4) One type in multiple tables¶

Data may be collected in batches and stored in separate files per batch. To perform analysis, we have to combine multiple tables into one. This can be easily done using dplyr:bind_rows.

In [209]:

jan <- data.frame(subject=c("Ann", "Bob"), value=rnorm(2))
feb <- data.frame(subject=c("Ann", "Bob"), value=rnorm(2))
mar <- data.frame(subject=c("Ann", "Bob"), value=rnorm(2))

In [210]:

jan

	subject	value
1	Ann	0.0473575253372743
2	Bob	-1.16544371780111

In [211]:

feb

	subject	value
1	Ann	-0.58470873143165
2	Bob	1.73897748176552

In [212]:

mar

	subject	value
1	Ann	-0.67510165070467
2	Bob	-1.83655926295541

In [213]:

bind_rows(jan, feb, mar)

	subject	value
1	Ann	0.0473575253372743
2	Bob	-1.16544371780111
3	Ann	-0.58470873143165
4	Bob	1.73897748176552
5	Ann	-0.67510165070467
6	Bob	-1.83655926295541

We have the option of creating a new column to indicate the original data frame source¶

In [214]:

bind_rows("jan"=jan, "feb"=feb, "mar"=mar, .id="month")

	month	subject	value
1	jan	Ann	0.0473575253372743
2	jan	Bob	-1.16544371780111
3	feb	Ann	-0.58470873143165
4	feb	Bob	1.73897748176552
5	mar	Ann	-0.67510165070467
6	mar	Bob	-1.83655926295541

(5) Multiple types in multiple tables¶

The analysis tools in R typically work on a single data frame. Hence there may be a need to combine different tables (e..g demographic and assay data) in order to manipulate, analyze or visualize the complete data.

In [215]:

subject <- c("Ann", "Bob", "Charlie")
age <- c(34,45,56)
outcome <- c("CR", "PR", "CR")
df1 <- data.frame(subject, age, outcome)
head(df1)

	subject	age	outcome
1	Ann	34	CR
2	Bob	45	PR
3	Charlie	56	CR

In [216]:

subject <- c("Ann", "Ann", "Bob", "Bob", "Charlie", "Charlie")
measure <- c("peak", "trough", "peak", "trough", "peak", "trough")
visit1 <- c(250, 125, 1000, 750, 1500, 250)
visit2 <- c(500, 125, 900, 650, 1550, 500)
df <- data.frame(subject, measure, visit1, visit2)
df2 <- df %>% gather(visit, titer, visit1:visit2) %>% spread(measure, titer)
head(df2)

	subject	visit	peak	trough
1	Ann	visit1	250	125
2	Ann	visit2	500	125
3	Bob	visit1	1000	750
4	Bob	visit2	900	650
5	Charlie	visit1	1500	250
6	Charlie	visit2	1550	500

In [217]:

inner_join(df1, df2)

Joining, by = "subject"

	subject	age	outcome	visit	peak	trough
1	Ann	34	CR	visit1	250	125
2	Ann	34	CR	visit2	500	125
3	Bob	45	PR	visit1	1000	750
4	Bob	45	PR	visit2	900	650
5	Charlie	56	CR	visit1	1500	250
6	Charlie	56	CR	visit2	1550	500

When the column to join by has different names in each table¶

In [218]:

df3 <- df1
colnames(df3)[1] = "name"
df3

	name	age	outcome
1	Ann	34	CR
2	Bob	45	PR
3	Charlie	56	CR

In [219]:

inner_join(df2, df3, by=c("subject"="name"))

	subject	visit	peak	trough	age	outcome
1	Ann	visit1	250	125	34	CR
2	Ann	visit2	500	125	34	CR
3	Bob	visit1	1000	750	45	PR
4	Bob	visit2	900	650	45	PR
5	Charlie	visit1	1500	250	56	CR
6	Charlie	visit2	1550	500	56	CR

Exercises¶

beaver1 and beaver2 are time series data collected on the body temperatures of two beavers (You can read more here)

In [220]:

head(beaver1)

	day	time	temp
1	346.00	840.00	36.33
2	346.00	850.00	36.34
3	346.00	900.00	36.35
4	346.00	910.00	36.42
5	346.00	920.00	36.55
6	346.00	930.00	36.69

In [221]:

str(beaver1)

'data.frame':   114 obs. of  4 variables:
 $ day  : num  346 346 346 346 346 346 346 346 346 346 ...
 $ time : num  840 850 900 910 920 930 940 950 1000 1010 ...
 $ temp : num  36.3 36.3 36.4 36.4 36.5 ...
 $ activ: num  0 0 0 0 0 0 0 0 0 0 ...

In [222]:

head(beaver2)

	day	time	temp
1	307.00	930.00	36.58
2	307.00	940.00	36.73
3	307.00	950.00	36.93
4	307.00	1000.00	37.15
5	307.00	1010.00	37.23
6	307.00	1020.00	37.24

In [223]:

str(beaver2)

'data.frame':   100 obs. of  4 variables:
 $ day  : num  307 307 307 307 307 307 307 307 307 307 ...
 $ time : num  930 940 950 1000 1010 1020 1030 1040 1050 1100 ...
 $ temp : num  36.6 36.7 36.9 37.1 37.2 ...
 $ activ: num  0 0 0 0 0 0 0 0 0 0 ...

Exercise 1: Combine the beaver1 and beaver2 data frames, adding a new column beaver with entries that take value 1 or 2 depending on which data frame the data came from.

In [224]:

head(bind_rows(beaver1, beaver2, .id="beaver"), n=4)

	beaver	day	time	temp
1	1	346	840	36.33
2	1	346	850	36.34
3	1	346	900	36.35
4	1	346	910	36.42

Exercise 2: The data frame given below has multiple variables in a singel column. Convert this to a tidy data frame with the following variables = pid, age, date and buffer- the fields are separated by a comma and a space.

In [225]:

df <- data.frame(desc = c(
"541-0194-1-0, 5wk, 5/Jan/09,  PLA",
"541-0215-1-0, 5wk, 1/Apr/09,  PLA",
"541-0231-6-0, 5wk, 31/Mar/09,  PLA",
"541-0235-4-0, 5wk, 22/Jan/09,  PLA",
"541-0237-8-0, 5wk, 21/Jan/09,  PLA",
"541-0248-0-0, 5wk, 1/Apr/09,  PLA",
"541-0254-6-0, 5wk, 24/Mar/09,  PLA",
"541-0258-4-0, 5wk, 14/May/09,  PLA"), values=rnorm(8))

In [226]:

df

	desc	values
1	541-0194-1-0, 5wk, 5/Jan/09, PLA	1.68401691250244
2	541-0215-1-0, 5wk, 1/Apr/09, PLA	-1.82541741635734
3	541-0231-6-0, 5wk, 31/Mar/09, PLA	0.322267715332376
4	541-0235-4-0, 5wk, 22/Jan/09, PLA	1.20301573073413
5	541-0237-8-0, 5wk, 21/Jan/09, PLA	-0.527461157819855
6	541-0248-0-0, 5wk, 1/Apr/09, PLA	-0.298129442756429
7	541-0254-6-0, 5wk, 24/Mar/09, PLA	-0.0780923846070984
8	541-0258-4-0, 5wk, 14/May/09, PLA	0.990178319959082

In [227]:

df %>% separate(desc, sep = ", ", into = c("pid", "age", "date", "buffer"))

	pid	age	date	buffer	values
1	541-0194-1-0	5wk	5/Jan/09	PLA	1.68401691250244
2	541-0215-1-0	5wk	1/Apr/09	PLA	-1.82541741635734
3	541-0231-6-0	5wk	31/Mar/09	PLA	0.322267715332376
4	541-0235-4-0	5wk	22/Jan/09	PLA	1.20301573073413
5	541-0237-8-0	5wk	21/Jan/09	PLA	-0.527461157819855
6	541-0248-0-0	5wk	1/Apr/09	PLA	-0.298129442756429
7	541-0254-6-0	5wk	24/Mar/09	PLA	-0.0780923846070984
8	541-0258-4-0	5wk	14/May/09	PLA	0.990178319959082

Manipulating Data¶

We illustrate five common data manipulation techniques on tidy data. When combined, these techniques provide power tools for slicing and dicing your data:

filter: slice
transform: mutate, transmute, mutate_each, transmute_each
aggregate: group_by, count, summarize, summarize_each
sort: arrange
sample: sample_n, sample_frac
column operations: select, rename

String functions used:

str_detect
substring

References¶

Links to functions in packages¶

Start with a tidy data frame¶

In [228]:

df = iris
head(df)

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5	3.6	1.4	0.2	setosa
6	5.4	3.9	1.7	0.4	setosa

In [229]:

str(df)

'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Filter¶

subsetting or removing observations based on some condition

Filter on rows¶

In [230]:

df[11:20,]

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
11	5.4	3.7	1.5	0.2	setosa
12	4.8	3.4	1.6	0.2	setosa
13	4.8	3	1.4	0.1	setosa
14	4.3	3	1.1	0.1	setosa
15	5.8	4	1.2	0.2	setosa
16	5.7	4.4	1.5	0.4	setosa
17	5.4	3.9	1.3	0.4	setosa
18	5.1	3.5	1.4	0.3	setosa
19	5.7	3.8	1.7	0.3	setosa
20	5.1	3.8	1.5	0.3	setosa

In [231]:

df %>% slice(11:20)

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	5.4	3.7	1.5	0.2	setosa
2	4.8	3.4	1.6	0.2	setosa
3	4.8	3	1.4	0.1	setosa
4	4.3	3	1.1	0.1	setosa
5	5.8	4	1.2	0.2	setosa
6	5.7	4.4	1.5	0.4	setosa
7	5.4	3.9	1.3	0.4	setosa
8	5.1	3.5	1.4	0.3	setosa
9	5.7	3.8	1.7	0.3	setosa
10	5.1	3.8	1.5	0.3	setosa

In [246]:

head(df[df$Species == "versicolor",])

df %>% filter (df$Species == "versicolor") %>% head

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
51	7	3.2	4.7	1.4	versicolor
52	6.4	3.2	4.5	1.5	versicolor
53	6.9	3.1	4.9	1.5	versicolor
54	5.5	2.3	4	1.3	versicolor
55	6.5	2.8	4.6	1.5	versicolor
56	5.7	2.8	4.5	1.3	versicolor

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	7	3.2	4.7	1.4	versicolor
2	6.4	3.2	4.5	1.5	versicolor
3	6.9	3.1	4.9	1.5	versicolor
4	5.5	2.3	4	1.3	versicolor
5	6.5	2.8	4.6	1.5	versicolor
6	5.7	2.8	4.5	1.3	versicolor

In [239]:

head(df[df$Petal.Width > mean(df$Petal.Width),])

df %>% filter( df$Petal.Width > mean(df$Petal.Width)) %>% head

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
51	7	3.2	4.7	1.4	versicolor
52	6.4	3.2	4.5	1.5	versicolor
53	6.9	3.1	4.9	1.5	versicolor
54	5.5	2.3	4	1.3	versicolor
55	6.5	2.8	4.6	1.5	versicolor
56	5.7	2.8	4.5	1.3	versicolor

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	7	3.2	4.7	1.4	versicolor
2	6.4	3.2	4.5	1.5	versicolor
3	6.9	3.1	4.9	1.5	versicolor
4	5.5	2.3	4	1.3	versicolor
5	6.5	2.8	4.6	1.5	versicolor
6	5.7	2.8	4.5	1.3	versicolor

In [238]:

head(df[str_detect(df$Species, "virgin"),])

df %>% filter(str_detect(df$Species,"virgin")) %>% head

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
101	6.3	3.3	6	2.5	virginica
102	5.8	2.7	5.1	1.9	virginica
103	7.1	3	5.9	2.1	virginica
104	6.3	2.9	5.6	1.8	virginica
105	6.5	3	5.8	2.2	virginica
106	7.6	3	6.6	2.1	virginica

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	6.3	3.3	6	2.5	virginica
2	5.8	2.7	5.1	1.9	virginica
3	7.1	3	5.9	2.1	virginica
4	6.3	2.9	5.6	1.8	virginica
5	6.5	3	5.8	2.2	virginica
6	7.6	3	6.6	2.1	virginica

Example¶

Display only rows where the species is not versicolor and the petal width is greater than the mean petal width.

In [18]:

head(iris[(iris$Petal.Width > mean(iris$Petal.Width)) & (iris$Species != "versicolor"),])

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
101	6.3	3.3	6	2.5	virginica
102	5.8	2.7	5.1	1.9	virginica
103	7.1	3	5.9	2.1	virginica
104	6.3	2.9	5.6	1.8	virginica
105	6.5	3	5.8	2.2	virginica
106	7.6	3	6.6	2.1	virginica

Transform¶

adding or modifying variables. These modifications can involve either a single variable (e.g., log-transformation), or multiple variables (e.g., computing density from weight and volume).

In [251]:

df %>%
    mutate(sum.length = Sepal.Length + Petal.Length,
            sum.width = Sepal.Width + Petal.Width) %>%
    head

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species	sum.length	sum.width
1	5.1	3.5	1.4	0.2	setosa	6.5	3.7
2	4.9	3	1.4	0.2	setosa	6.3	3.2
3	4.7	3.2	1.3	0.2	setosa	6	3.4
4	4.6	3.1	1.5	0.2	setosa	6.1	3.3
5	5	3.6	1.4	0.2	setosa	6.4	3.8
6	5.4	3.9	1.7	0.4	setosa	7.1	4.3

In [249]:

df %>%
  transmute(sum.length = Sepal.Length + Petal.Length, sum.width =
            Sepal.Width + Petal.Width) %>%
                       head

	sum.length	sum.width
1	6.5	3.7
2	6.3	3.2
3	6.0	3.4
4	6.1	3.3
5	6.4	3.8
6	7.1	4.3

In [247]:

f <- function(x) {2 * x + 1}

In [248]:

df %>% transmute(Fat.Petal.Width = f(Petal.Width)) %>%head

	Fat.Petal.Width
1	1.4
2	1.4
3	1.4
4	1.4
5	1.4
6	1.8

In [252]:

#Multiple columns

df %>% mutate_each(funs(scale(., center = T, scale = T)), -Species) %>% head()

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	-0.897673879196766	1.01560199071363	-1.33575163424152	-1.31105214820513	setosa
2	-1.13920048346495	-0.13153881205026	-1.33575163424152	-1.31105214820513	setosa
3	-1.38072708773314	0.327317509055298	-1.39239928624498	-1.31105214820513	setosa
4	-1.50149038986724	0.0978893485025193	-1.27910398223806	-1.31105214820513	setosa
5	-1.01843718133086	1.24503015126641	-1.33575163424152	-1.31105214820513	setosa
6	-0.535383972794483	1.93331463292475	-1.16580867823115	-1.0486667949953	setosa

Example¶

2. The iris data set measurements are in centimeters. Convert them all to inches rounding to 2 decimal places and using the same column names.

Note: 1 inch = 2.54 cm.

In [24]:

cm.to.inch <- function(x) round(x/2.54, 2)

head(iris %>% mutate_each(funs(cm.to.inch), -Species))

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	2.01	1.38	0.55	0.08	setosa
2	1.93	1.18	0.55	0.08	setosa
3	1.85	1.26	0.51	0.08	setosa
4	1.81	1.22	0.59	0.08	setosa
5	1.97	1.42	0.55	0.08	setosa
6	2.13	1.54	0.67	0.16	setosa

Aggregate¶

collapsing multiple values into a single value (e.g., by summing or taking means).

In [25]:

df %>% count()

	n
1	150

In [26]:

df %>% count(Species)

	Species	n
1	setosa	50
2	versicolor	50
3	virginica	50

In [27]:

df %>% summarize(mean(Sepal.Length))

	mean(Sepal.Length)
1	5.843333

In [28]:

df %>% summarize_each(c("mean", "sd"), c(-Species))

	Sepal.Length_mean	Sepal.Width_mean	Petal.Length_mean	Petal.Width_mean	Sepal.Length_sd	Sepal.Width_sd	Petal.Length_sd	Petal.Width_sd
1	5.8433333	3.0573333	3.7580000	1.1993333	0.8280661	0.4358663	1.7652982	0.7622377

In [29]:

df %>% group_by(Species) %>% summarize_each(c("mean"))

	Species	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
1	setosa	5.006	3.428	1.462	0.246
2	versicolor	5.936	2.77	4.26	1.326
3	virginica	6.588	2.974	5.552	2.026

Example¶

3. Find the min and max values for each group in the PlantGrowth data set. Name the columns smallest and largest respectively.

In [30]:

str(PlantGrowth)

'data.frame':   30 obs. of  2 variables:
 $ weight: num  4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
 $ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ...

In [31]:

PlantGrowth %>%
group_by(group) %>%
summarise_each(funs(smallest=min, largest=max))

	group	smallest	largest
1	ctrl	4.17	6.11
2	trt1	3.59	6.03
3	trt2	4.92	6.31

Sort¶

changing the order of observations

In [ ]:

head(df %>% arrange(Sepal.Length))

In [ ]:

head(df %>% arrange(desc(Sepal.Length)))

In [ ]:

head(df %>% arrange(Petal.Width, desc(Sepal.Length)))

Sample¶

select random rows from a table

Sample 10 rows without replacement¶

In [ ]:

df %>% sample_n(size=10)

In [ ]:

### Sample 10 rows with replacement

df %>% sample_n(size=10, replace=T)

In [ ]:

### Sample 5% of the data

df %>% sample_frac(size=0.05)

Operations on columns¶

Sometimes we want to “filter” on columns rather than rows. We can use standard indexing or the dplyr functions

select
rename

In [ ]:

head(df[,2:3])

In [ ]:

head(df %>% select(sl = Sepal.Length, sw=Sepal.Width))

In [ ]:

head(df %>% select(petal = starts_with("Petal")))

In [ ]:

head(df %>% select(width = contains("Width")))

Exercises¶

Exercise 1: Recall that we converted the following data set into a tidy frame. Now we are informed that the pid field is actually a composite of expt.id, pid, sample.id and replicate.id separated by -, while the date field needs to be further broken down into day, month and year columns. In other words, we want a final data set which has the variables expt.id, pid, sample.id, replicate.id, age,day, month, year, buffer and value. Discard the initial semi-colon.

In [ ]:

df <- data.frame(desc = c(
";541-0194-1-0, 5wk, 5/Jan/09,  PLA",
";541-0215-1-0, 5wk, 1/Apr/09,  PLA",
";541-0231-6-0, 5wk, 31/Mar/09,  PLA",
";541-0235-4-0, 5wk, 22/Jan/09,  PLA",
";541-0237-8-0, 5wk, 21/Jan/09,  PLA",
";541-0248-0-0, 5wk, 1/Apr/09,  PLA",
";541-0254-6-0, 5wk, 24/Mar/09,  PLA",
";541-0258-4-0, 5wk, 14/May/09,  PLA"), values=rnorm(8))

head(df)

In [ ]:

df1 <- df %>%
separate(desc, c("tmp", "age", "date", "buffer"), sep=", ") %>%
separate(tmp, c("expt.id", "pid", "sample.id", "replicate.id"), sep="-") %>%
separate(date, c("day", "month", "year"), sep="/") %>%
mutate(expt.id=substring(expt.id, 2))
df1

Exercise 2: You are now informed that the values were in the wrong units. You need to apply the formula y = 39 + 3*x to recover the original values. Round to 2 decimal points.

In [ ]:

df2 <- df1 %>% mutate(values = round(39 + 3*values, 2))
df2

Exercise 3: Find the mean, min and max values when grouped by month.

In [ ]:

df2 %>% group_by(month) %>% summarize_each(c("mean", "min", "max"), values)

Exercise 4: You are now informed that the assay machine was not working properly in March. Create a new data set without any March values.

In [ ]:

df2[df2$month != "Mar",]

In [ ]:

Cleaning and preparing data¶

Objectives¶

Main Additions and Modifications to Traditional R.¶

tibbles vs data frames¶

Pipes¶

How to structure your spreadsheet for analysis¶

Do this¶

Not that¶

How to read a spreadsheet or text file into a data frame¶

Reading in a data frame from a text or csv file¶

Using read.csv¶

read.csv is just a wrapper for read.table with some defaults given¶

Alternative ising dplyr¶

Alternative using sample_n from dplyr¶

How to create a data frame from scratch¶

How to interpret a data frame¶

How to select rows and columns of a data frame¶

How to sort a data frame¶

Examples¶

What is “tidy” data?¶

Examples¶

Not tidy data (Table 1)¶

Not tidy data (Table 2)¶

Tidy data (Table 3)¶

What is a variable?¶

Table 4¶

How to tidy common forms of “messy” data with tidyr¶

For working with single data frames¶

For combining multiple data frames¶

(1) Converting columns into rows¶

Create a data frame representing table 1¶

Add row names¶

Reshape from “wide” to “tall”¶

Using piping¶

(2) Multiple variables in a single column¶

When there is no obvious separator¶

(3) Variables stored in both rows and columns¶

We can use the pipe operator to combine into a single operation¶

(4) One type in multiple tables¶

We have the option of creating a new column to indicate the original data frame source¶

(5) Multiple types in multiple tables¶

When the column to join by has different names in each table¶

Exercises¶

Manipulating Data¶

References¶

Links to functions in packages¶

Start with a tidy data frame¶

Filter¶

Filter on rows¶

Example¶

Transform¶

Example¶

Aggregate¶

Example¶

Sort¶

Sample¶

Sample 10 rows without replacement¶

Operations on columns¶

Exercises¶

Using `read.csv`¶

`read.csv` is just a wrapper for `read.table` with some defaults given¶

Alternative ising `dplyr`¶

Alternative using `sample_n` from `dplyr`¶

How to tidy common forms of “messy” data with `tidyr`¶

Using `piping`¶