Solutions to Warm-up exercises¶

These problem sets will introduce you to data frames and the basics of data frame manipulation with the dplyr package.

In [6]:

suppressPackageStartupMessages(library(tidyverse))

Warning message:
“package ‘dplyr’ was built under R version 3.4.1”

Finding information about a data set¶

1. We will work with the Puromycin data set in this exercise.

Use help to find out more about of the Puromycin data set
Use class to find out the class of the data set
How many rows and columns are there?
What is the type of each column?
Show all unique values for the state column
Show the first 5 rows
Show the last 5 rows

In [16]:

#1
help(Puromycin)

In [17]:

# 2
class(Puromycin)

'data.frame'

In [18]:

#3
dim(Puromycin)

23
3

In [29]:

#4
sapply(Puromycin, class)

conc: 'numeric'
rate: 'numeric'
state: 'factor'

In [19]:

#4
str(Puromycin)

'data.frame':   23 obs. of  3 variables:
 $ conc : num  0.02 0.02 0.06 0.06 0.11 0.11 0.22 0.22 0.56 0.56 ...
 $ rate : num  76 47 97 107 123 139 159 152 191 201 ...
 $ state: Factor w/ 2 levels "treated","untreated": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "reference")= chr "A1.3, p. 269"

In [30]:

#5
unique(Puromycin$state)

treated
untreated

In [31]:

#6
head(Puromycin, n=5)

conc	rate	state
0.02	76	treated
0.02	47	treated
0.06	97	treated
0.06	107	treated
0.11	123	treated

In [33]:

#7
tail(Puromycin, n=5)

	conc	rate	state
19	0.22	131	untreated
20	0.22	124	untreated
21	0.56	144	untreated
22	0.56	158	untreated
23	1.10	160	untreated

Use of piping (%>%)¶

2. Using the Puromycin data set,

Show the first 20 rows using piping
Show the last 10 rows using piping
Show rows 11 to 20 using piping

In [34]:

#1
Puromycin %>% head(n=20)

conc	rate	state
0.02	76	treated
0.02	47	treated
0.06	97	treated
0.06	107	treated
0.11	123	treated
0.11	139	treated
0.22	159	treated
0.22	152	treated
0.56	191	treated
0.56	201	treated
1.10	207	treated
1.10	200	treated
0.02	67	untreated
0.02	51	untreated
0.06	84	untreated
0.06	86	untreated
0.11	98	untreated
0.11	115	untreated
0.22	131	untreated
0.22	124	untreated

In [35]:

#2
Puromycin %>% tail(n=10)

	conc	rate	state
14	0.02	51	untreated
15	0.06	84	untreated
16	0.06	86	untreated
17	0.11	98	untreated
18	0.11	115	untreated
19	0.22	131	untreated
20	0.22	124	untreated
21	0.56	144	untreated
22	0.56	158	untreated
23	1.10	160	untreated

In [36]:

#3
Puromycin %>% head(n=20) %>% tail(n=10)

	conc	rate	state
11	1.10	207	treated
12	1.10	200	treated
13	0.02	67	untreated
14	0.02	51	untreated
15	0.06	84	untreated
16	0.06	86	untreated
17	0.11	98	untreated
18	0.11	115	untreated
19	0.22	131	untreated
20	0.22	124	untreated

Use of filter¶

3. Using the Puromycin data set,togehter with piping and filter

Show only rows where the state is untreated
Show only rows where the conc is 0.11
Show only rows where the conc is less than 0.1
Show only rows where the state is treated and the rate is more than 100
Show only rows where the conc is less than 0.1 or the rate is more than 200

In [39]:

#1
Puromycin %>% filter(state=='untreated')

conc	rate	state
0.02	67	untreated
0.02	51	untreated
0.06	84	untreated
0.06	86	untreated
0.11	98	untreated
0.11	115	untreated
0.22	131	untreated
0.22	124	untreated
0.56	144	untreated
0.56	158	untreated
1.10	160	untreated

In [41]:

#2
Puromycin %>% filter(conc==0.11)

conc	rate	state
0.11	123	treated
0.11	139	treated
0.11	98	untreated
0.11	115	untreated

In [43]:

#3
Puromycin %>% filter(conc < 0.1)

conc	rate	state
0.02	76	treated
0.02	47	treated
0.06	97	treated
0.06	107	treated
0.02	67	untreated
0.02	51	untreated
0.06	84	untreated
0.06	86	untreated

In [45]:

#4
Puromycin %>% filter(state=='treated' & rate > 100)

conc	rate	state
0.06	107	treated
0.11	123	treated
0.11	139	treated
0.22	159	treated
0.22	152	treated
0.56	191	treated
0.56	201	treated
1.10	207	treated
1.10	200	treated

In [49]:

#5
Puromycin %>% filter(conc < 0.1 | rate > 200)

conc	rate	state
0.02	76	treated
0.02	47	treated
0.06	97	treated
0.06	107	treated
0.56	201	treated
1.10	207	treated
0.02	67	untreated
0.02	51	untreated
0.06	84	untreated
0.06	86	untreated

Use of select¶

4. Using the Puromycin data set, together with piping, head and select, select_if and select_all

Show only the conc and rate columns
Show only the columns whose type is numeric
Show only the columns whose names end with the letter e
Convert all column names to UPPERCASE
Rearrange the columns in the order state, conc, rate
Drop the state column

Limit to only the first 3 rows in each case.

In [60]:

#1
Puromycin %>% select(conc, rate) %>% head(n=3)

conc	rate
0.02	76
0.02	47
0.06	97

In [61]:

#2
Puromycin %>% select_if(is.numeric) %>% head(n=3)

conc	rate
0.02	76
0.02	47
0.06	97

In [62]:

#3
Puromycin %>% select(ends_with('e')) %>%  head(n=3)

rate	state
76	treated
47	treated
97	treated

In [63]:

#4
Puromycin %>% select_all(toupper) %>% head(n=3)

CONC	RATE	STATE
0.02	76	treated
0.02	47	treated
0.06	97	treated

In [82]:

#5
Puromycin %>% select(state, conc, rate) %>% head(n=3)

state	conc	rate
treated	0.02	76
treated	0.02	47
treated	0.06	97

In [83]:

#6
Puromycin %>% select(-state) %>% head(n=3)

conc	rate
0.02	76
0.02	47
0.06	97

Use of mutate and transmute¶

5. Using the Puromycin data set, together with mutate or transmutate and any other operation necessary

Create a new column rate2 that is the square of rate
Create a new data frame that only has the 3 columns with conc, conc^2 and conc^3 values. Name them conc, conc2 and conc3
Replace each value of all numeric columns with the square root of the value

Show only the first 5 rows in each case

In [78]:

#1
Puromycin %>% mutate(rate2=rate^2) %>% head(n=5)

conc	rate	state	rate2
0.02	76	treated	5776
0.02	47	treated	2209
0.06	97	treated	9409
0.06	107	treated	11449
0.11	123	treated	15129

In [80]:

#2
Puromycin %>% transmute(conc, conc2=conc^2, conc3=conc^3) %>% head(n=5)

conc	conc2	conc3
0.02	0.0004	0.000008
0.02	0.0004	0.000008
0.06	0.0036	0.000216
0.06	0.0036	0.000216
0.11	0.0121	0.001331

In [84]:

#3
Puromycin %>% mutate_if(is.numeric, sqrt) %>% head(n=5)

conc	rate	state
0.1414214	8.717798	treated
0.1414214	6.855655	treated
0.2449490	9.848858	treated
0.2449490	10.344080	treated
0.3316625	11.090537	treated

Use of arrange¶

6. Using the Puromycin data set, together with arrange and any other operation necessary

Sort in ascending rate order
Sort in descending rate order
Sort first on conc i ascending order, then rate in ascending order
Sort in ascending order of the number of characters in the state column

In each case show only the first 5 rows.

In [64]:

#1
Puromycin %>% arrange(rate) %>% head(n=5)

conc	rate	state
0.02	47	treated
0.02	51	untreated
0.02	67	untreated
0.02	76	treated
0.06	84	untreated

In [66]:

#2
Puromycin %>% arrange(-rate) %>% head(n=5)

conc	rate	state
1.10	207	treated
0.56	201	treated
1.10	200	treated
0.56	191	treated
1.10	160	untreated

In [70]:

#2
Puromycin %>% arrange(desc(rate)) %>% head(n=5)

conc	rate	state
1.10	207	treated
0.56	201	treated
1.10	200	treated
0.56	191	treated
1.10	160	untreated

In [67]:

#3
Puromycin %>% arrange(conc, rate) %>% head(n=5)

conc	rate	state
0.02	47	treated
0.02	51	untreated
0.02	67	untreated
0.02	76	treated
0.06	84	untreated

In [95]:

length("foo")

1

In [97]:

#4
Puromycin %>%
mutate(len=nchar(as.character(state))) %>%
arrange(len) %>%
select(-len) %>%
head(n=5)

conc	rate	state
0.02	76	treated
0.02	47	treated
0.06	97	treated
0.06	107	treated
0.11	123	treated

Use of summarize¶

7. Using the Puromycin data set, together with summarize and any other operation necessary

Find the mean value of numeric columns
Find the mean length of the state column
Find the min, median and max of the rate column

In [90]:

#1
Puromycin %>% summarise_if(is.numeric, mean)

conc	rate
0.3121739	126.8261

In [111]:

#2
Puromycin %>% transmute(len=nchar(as.character(state))) %>% summarize_all(mean)

len
7.956522

In [126]:

#3
Puromycin %>% summarise_at('rate', c(rate.min=min, rate.median=median, rate.max=max))

rate.min	rate.median	rate.max
47	124	207

Use of group_by¶

8. Using the Puromycin data set, together with group_by and any other operation necessary

Find the average rate for each state
Find the number of treated and untreated states in a new column count
Find the number of rows with the same conc and state in a new column count and only show rows where the count is an even number.
Find the mean and standard deviation of rate for each state and conc. Remove any rows with an NA value for the rate standard deviation.

Hint: group_by is often combined with summarize, and n() returns the count.

In [133]:

#1
Puromycin %>% group_by(state) %>% summarise_all(mean)

state	conc	rate
treated	0.3450000	141.5833
untreated	0.2763636	110.7273

In [145]:

#2
Puromycin %>% group_by(state) %>% summarise(count=n())

state	count
treated	12
untreated	11

In [152]:

#3
Puromycin %>%
group_by(conc, state) %>%
summarise(count=n()) %>%
filter(count %% 2 == 0)

conc	state	count
0.02	treated	2
0.02	untreated	2
0.06	treated	2
0.06	untreated	2
0.11	treated	2
0.11	untreated	2
0.22	treated	2
0.22	untreated	2
0.56	treated	2
0.56	untreated	2
1.10	treated	2

In [147]:

#4
Puromycin %>% group_by(state, conc) %>%
summarise_all(c(rate.mean=mean, rate.sd=sd)) %>%
filter(!is.na(rate.sd))

state	conc	rate.mean	rate.sd
treated	0.02	61.5	20.506097
treated	0.06	102.0	7.071068
treated	0.11	131.0	11.313708
treated	0.22	155.5	4.949747
treated	0.56	196.0	7.071068
treated	1.10	203.5	4.949747
untreated	0.02	59.0	11.313708
untreated	0.06	85.0	1.414214
untreated	0.11	106.5	12.020815
untreated	0.22	127.5	4.949747
untreated	0.56	151.0	9.899495

Use of gather¶

9. Using the iris data set, together with spread and any other operation necessary

Create a new data frame df that has only 3 columns (Species, Measure, Value) where Measure takes on the values Sepal.Length, Sepal.Width, Petal.Length or Petal.Width. Show the first 5 rows.
Show the mean value and counts for each Species and Measure of df

In [182]:

#1
df <- iris %>%
gather(-Species, key=Measure, value=Value)
head(df, n=5)

Species	Measure	Value
setosa	Sepal.Length	5.1
setosa	Sepal.Length	4.9
setosa	Sepal.Length	4.7
setosa	Sepal.Length	4.6
setosa	Sepal.Length	5.0

In [186]:

#2
df %>%
group_by(Species, Measure) %>%
summarize(mean=mean(Value), count=n())

Species	Measure	mean	count
setosa	Petal.Length	1.462	50
setosa	Petal.Width	0.246	50
setosa	Sepal.Length	5.006	50
setosa	Sepal.Width	3.428	50
versicolor	Petal.Length	4.260	50
versicolor	Petal.Width	1.326	50
versicolor	Sepal.Length	5.936	50
versicolor	Sepal.Width	2.770	50
virginica	Petal.Length	5.552	50
virginica	Petal.Width	2.026	50
virginica	Sepal.Length	6.588	50
virginica	Sepal.Width	2.974	50

Use of spread¶

This is the opposite of gather - it takes a key and value column, and makes new columns out of the keys.

In [28]:

df <- data.frame(subject=rep(1:4,3),
                 treatment = rep(c("A", "B", "C"), each=4),
                 value = rnorm(12))

10. Using the df data set, apply spread to

give each different treatement its own column.

In [29]:

df

subject	treatment	value
1	A	0.1765537
2	A	2.0216923
3	A	0.1613333
4	A	1.0256221
1	B	0.8115127
2	B	-1.0966100
3	B	-2.0626682
4	B	0.2882717
1	C	1.7624858
2	C	0.5935407
3	C	-0.5302786
4	C	0.5366085

In [30]:

df %>% spread(key=treatment, value=value)

subject	A	B	C
1	0.1765537	0.8115127	1.7624858
2	2.0216923	-1.0966100	0.5935407
3	0.1613333	-2.0626682	-0.5302786
4	1.0256221	0.2882717	0.5366085

Use of separate¶

In [217]:

pid <- rep(1:4, 3)
treat <- rep(c('A','B','C'), each = 4)
bp <- rnorm(12, 120, 25)
expt <- data.frame(name=paste(pid, treat, sep='-'), bp=bp)
rm(pid)
rm(treat)
expt

name	bp
1-A	102.33299
2-A	92.67872
3-A	148.08022
4-A	156.18250
1-B	138.82081
2-B	113.52586
3-B	152.55624
4-B	123.05604
1-C	129.22529
2-C	131.28523
3-C	74.97073
4-C	159.81344

11. Using the expt data set, together with separate and any other operation necessary

Find the average blood pressure for each treatment group (A, B or C).

Note: You are assumed not to have access to the pid and treat values separate.y.

In [211]:

#1
expt %>%
separate(name, into=c("pid", "treat"))%>%
group_by(treat) %>%
summarise(mean(bp))

treat	mean(bp)
A	123.3292
B	111.1280
C	138.1923

In [ ]: