Day 2 AM: Analyzing data with dplyr

In [24]:
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(stringr))

Chaining data transformations with pipe (%>%)

We will operate on data incrementally, step by step. At each step, we take a data.frame, apply a function to it, and generate a different data.frame. This data.frame itself can be modified by another function, leading to a chain of operations that all take a data.frame as input and return a data.frame as output. A convenient idiom (borrowed from the Unix shell) is to connect adjacent functions in the chain by a pipe which takes the output of a function and feeds it as input to the next function. The pipe operator in R is denoted by %>%.

A simple piping example

Here we use piping to show rows 6-10 of the iris data.frame

In [4]:
head(iris, n=10)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5.0 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
In [5]:
iris %>% head(n=10) %>% tail(n=5)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
65.4 3.9 1.7 0.4 setosa
74.6 3.4 1.4 0.3 setosa
85.0 3.4 1.5 0.2 setosa
94.4 2.9 1.4 0.2 setosa
104.9 3.1 1.5 0.1 setosa

Filtering rows with filter

In [12]:
iris %>% filter(Species == "versicolor") %>% head(3)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
7.0 3.2 4.7 1.4 versicolor
6.4 3.2 4.5 1.5 versicolor
6.9 3.1 4.9 1.5 versicolor
In [14]:
iris %>% filter(Sepal.Length > 6) %>% head(3)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
7.0 3.2 4.7 1.4 versicolor
6.4 3.2 4.5 1.5 versicolor
6.9 3.1 4.9 1.5 versicolor
In [15]:
iris %>% filter((Sepal.Length > 6) & (Species == "virginica")) %>% head(3)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
6.3 3.3 6.0 2.5 virginica
7.1 3.0 5.9 2.1 virginica
6.3 2.9 5.6 1.8 virginica
In [17]:
iris %>% filter(Sepal.Length > mean(Sepal.Length)) %>% head(3)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
7.0 3.2 4.7 1.4 versicolor
6.4 3.2 4.5 1.5 versicolor
6.9 3.1 4.9 1.5 versicolor
In [28]:
iris %>% filter(str_detect(Species, "virgin")) %>% head(3)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
6.3 3.3 6.0 2.5 virginica
5.8 2.7 5.1 1.9 virginica
7.1 3.0 5.9 2.1 virginica

We can do this via indexing, but using slice can be helpful for chaining of fluent commands.

In [9]:
iris %>% slice(c(2:4, 6:8))
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5.0 3.4 1.5 0.2 setosa

Selecting columns with select

In [18]:
iris %>% select(c(Petal.Length, Petal.Width, Sepal.Length, Sepal.Width)) %>% head(3)
Petal.LengthPetal.WidthSepal.LengthSepal.Width
1.40.25.13.5
1.40.24.93.0
1.30.24.73.2
In [19]:
iris %>% select(c(3,4,1,2)) %>% head(3)
Petal.LengthPetal.WidthSepal.LengthSepal.Width
1.40.25.13.5
1.40.24.93.0
1.30.24.73.2
In [21]:
iris %>% select(-Species) %>% head(3)
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
5.13.51.40.2
4.93.01.40.2
4.73.21.30.2
In [29]:
iris %>% select(contains("Length")) %>% head(3)
Sepal.LengthPetal.Length
5.11.4
4.91.4
4.71.3
In [31]:
iris %>% select(starts_with("S")) %>% head(3)
Sepal.LengthSepal.WidthSpecies
5.1 3.5 setosa
4.9 3.0 setosa
4.7 3.2 setosa
In [32]:
iris %>% rename(Type = Species) %>% head(3)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthType
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
In [44]:
iris %>% rename(SL=Sepal.Length, SW=Sepal.Width, PW=Petal.Width, PL=Petal.Length) %>% head(3)
SLSWPLPWSpecies
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa

Sorting data with arrange

In [46]:
iris %>% arrange(Sepal.Length) %>% head
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
4.3 3.0 1.1 0.1 setosa
4.4 2.9 1.4 0.2 setosa
4.4 3.0 1.3 0.2 setosa
4.4 3.2 1.3 0.2 setosa
4.5 2.3 1.3 0.3 setosa
4.6 3.1 1.5 0.2 setosa
In [47]:
iris %>% arrange(Sepal.Length, desc(Sepal.Width)) %>% head
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
4.3 3.0 1.1 0.1 setosa
4.4 3.2 1.3 0.2 setosa
4.4 3.0 1.3 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.5 2.3 1.3 0.3 setosa
4.6 3.6 1.0 0.2 setosa

Creating new columns with mutate and transmute

In [48]:
iris %>% mutate(Comb.Length=Sepal.Length + Petal.Length,
                Comb.Width = Sepal.Width + Petal.Width) %>% head
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpeciesComb.LengthComb.Width
5.1 3.5 1.4 0.2 setosa6.5 3.7
4.9 3.0 1.4 0.2 setosa6.3 3.2
4.7 3.2 1.3 0.2 setosa6.0 3.4
4.6 3.1 1.5 0.2 setosa6.1 3.3
5.0 3.6 1.4 0.2 setosa6.4 3.8
5.4 3.9 1.7 0.4 setosa7.1 4.3

Mutate only columns where condition is TRUE

In [50]:
iris %>% mutate_if(is.numeric, log) %>% head
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
1.629241 1.252763 0.3364722 -1.6094379setosa
1.589235 1.098612 0.3364722 -1.6094379setosa
1.547563 1.163151 0.2623643 -1.6094379setosa
1.526056 1.131402 0.4054651 -1.6094379setosa
1.609438 1.280934 0.3364722 -1.6094379setosa
1.686399 1.360977 0.5306283 -0.9162907setosa

Mutate columns that meet string criteria

In [59]:
iris %>% mutate_at(c("Sepal.Length", "Petal.Length"), log) %>% head
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
1.629241 3.5 0.33647220.2 setosa
1.589235 3.0 0.33647220.2 setosa
1.547563 3.2 0.26236430.2 setosa
1.526056 3.1 0.40546510.2 setosa
1.609438 3.6 0.33647220.2 setosa
1.686399 3.9 0.53062830.4 setosa

Only keep mutated columns

In [49]:
iris %>% transmute(Comb.Length=Sepal.Length + Petal.Length,
                   Comb.Width = Sepal.Width + Petal.Width) %>% head
Comb.LengthComb.Width
6.53.7
6.33.2
6.03.4
6.13.3
6.43.8
7.14.3

Multiple transformations

In [62]:
iris %>% transmute_if(is.numeric, funs(log, sqrt)) %>% head
Sepal.Length_logSepal.Width_logPetal.Length_logPetal.Width_logSepal.Length_sqrtSepal.Width_sqrtPetal.Length_sqrtPetal.Width_sqrt
1.629241 1.252763 0.3364722 -1.60943792.258318 1.870829 1.183216 0.4472136
1.589235 1.098612 0.3364722 -1.60943792.213594 1.732051 1.183216 0.4472136
1.547563 1.163151 0.2623643 -1.60943792.167948 1.788854 1.140175 0.4472136
1.526056 1.131402 0.4054651 -1.60943792.144761 1.760682 1.224745 0.4472136
1.609438 1.280934 0.3364722 -1.60943792.236068 1.897367 1.183216 0.4472136
1.686399 1.360977 0.5306283 -0.91629072.323790 1.974842 1.303840 0.6324555

Split-apply-combine with group_by and summarize

In [64]:
iris %>% summarise(mean = mean(Sepal.Length)) %>% head
mean
5.843333
In [66]:
iris %>% summarise_if(is.numeric, sum) %>% head
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
876.5458.6563.7179.9

Split-apply-combine

In [80]:
iris %>%
group_by(Species) %>%
summarise(count = n())
Speciescount
setosa 50
versicolor50
virginica 50
In [82]:
iris %>%
group_by(Species) %>%
summarise(SW.mean = mean(Sepal.Width),
          SW.cv = mean(Sepal.Width)/sd(Sepal.Width))
SpeciesSW.meanSW.cv
setosa 3.428 9.043319
versicolor2.770 8.827326
virginica 2.974 9.221802
In [84]:
iris %>%
group_by(Species) %>%
summarise_at("Sepal.Length", funs(min, max, mean, median))
Speciesminmaxmeanmedian
setosa 4.3 5.8 5.006 5.0
versicolor4.9 7.0 5.936 5.9
virginica 4.9 7.9 6.588 6.5
In [75]:
iris %>%
group_by(Species) %>%
summarise_all(funs(min, max))
SpeciesSepal.Length_minSepal.Width_minPetal.Length_minPetal.Width_minSepal.Length_maxSepal.Width_maxPetal.Length_maxPetal.Width_max
setosa 4.3 2.3 1.0 0.1 5.8 4.4 1.9 0.6
versicolor4.9 2.0 3.0 1.0 7.0 3.4 5.1 1.8
virginica 4.9 2.2 4.5 1.4 7.9 3.8 6.9 2.5
In [ ]: