Solutions to Warm-up exercises

These problem sets will introduce you to data frames and the basics of data frame manipulation with the dplyr package.

In [6]:
suppressPackageStartupMessages(library(tidyverse))
Warning message:
“package ‘dplyr’ was built under R version 3.4.1”

Finding information about a data set

1. We will work with the Puromycin data set in this exercise.

  1. Use help to find out more about of the Puromycin data set
  2. Use class to find out the class of the data set
  3. How many rows and columns are there?
  4. What is the type of each column?
  5. Show all unique values for the state column
  6. Show the first 5 rows
  7. Show the last 5 rows
In [16]:
#1
help(Puromycin)
In [17]:
# 2
class(Puromycin)
'data.frame'
In [18]:
#3
dim(Puromycin)
  1. 23
  2. 3
In [29]:
#4
sapply(Puromycin, class)
conc
'numeric'
rate
'numeric'
state
'factor'
In [19]:
#4
str(Puromycin)
'data.frame':   23 obs. of  3 variables:
 $ conc : num  0.02 0.02 0.06 0.06 0.11 0.11 0.22 0.22 0.56 0.56 ...
 $ rate : num  76 47 97 107 123 139 159 152 191 201 ...
 $ state: Factor w/ 2 levels "treated","untreated": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "reference")= chr "A1.3, p. 269"
In [30]:
#5
unique(Puromycin$state)
  1. treated
  2. untreated
In [31]:
#6
head(Puromycin, n=5)
concratestate
0.02 76 treated
0.02 47 treated
0.06 97 treated
0.06 107 treated
0.11 123 treated
In [33]:
#7
tail(Puromycin, n=5)
concratestate
190.22 131 untreated
200.22 124 untreated
210.56 144 untreated
220.56 158 untreated
231.10 160 untreated

Use of piping (%>%)

2. Using the Puromycin data set,

  1. Show the first 20 rows using piping
  2. Show the last 10 rows using piping
  3. Show rows 11 to 20 using piping
In [34]:
#1
Puromycin %>% head(n=20)
concratestate
0.02 76 treated
0.02 47 treated
0.06 97 treated
0.06 107 treated
0.11 123 treated
0.11 139 treated
0.22 159 treated
0.22 152 treated
0.56 191 treated
0.56 201 treated
1.10 207 treated
1.10 200 treated
0.02 67 untreated
0.02 51 untreated
0.06 84 untreated
0.06 86 untreated
0.11 98 untreated
0.11 115 untreated
0.22 131 untreated
0.22 124 untreated
In [35]:
#2
Puromycin %>% tail(n=10)
concratestate
140.02 51 untreated
150.06 84 untreated
160.06 86 untreated
170.11 98 untreated
180.11 115 untreated
190.22 131 untreated
200.22 124 untreated
210.56 144 untreated
220.56 158 untreated
231.10 160 untreated
In [36]:
#3
Puromycin %>% head(n=20) %>% tail(n=10)
concratestate
111.10 207 treated
121.10 200 treated
130.02 67 untreated
140.02 51 untreated
150.06 84 untreated
160.06 86 untreated
170.11 98 untreated
180.11 115 untreated
190.22 131 untreated
200.22 124 untreated

Use of filter

3. Using the Puromycin data set,togehter with piping and filter

  1. Show only rows where the state is untreated
  2. Show only rows where the conc is 0.11
  3. Show only rows where the conc is less than 0.1
  4. Show only rows where the state is treated and the rate is more than 100
  5. Show only rows where the conc is less than 0.1 or the rate is more than 200
In [39]:
#1
Puromycin %>% filter(state=='untreated')
concratestate
0.02 67 untreated
0.02 51 untreated
0.06 84 untreated
0.06 86 untreated
0.11 98 untreated
0.11 115 untreated
0.22 131 untreated
0.22 124 untreated
0.56 144 untreated
0.56 158 untreated
1.10 160 untreated
In [41]:
#2
Puromycin %>% filter(conc==0.11)
concratestate
0.11 123 treated
0.11 139 treated
0.11 98 untreated
0.11 115 untreated
In [43]:
#3
Puromycin %>% filter(conc < 0.1)
concratestate
0.02 76 treated
0.02 47 treated
0.06 97 treated
0.06 107 treated
0.02 67 untreated
0.02 51 untreated
0.06 84 untreated
0.06 86 untreated
In [45]:
#4
Puromycin %>% filter(state=='treated' & rate > 100)
concratestate
0.06 107 treated
0.11 123 treated
0.11 139 treated
0.22 159 treated
0.22 152 treated
0.56 191 treated
0.56 201 treated
1.10 207 treated
1.10 200 treated
In [49]:
#5
Puromycin %>% filter(conc < 0.1 | rate > 200)
concratestate
0.02 76 treated
0.02 47 treated
0.06 97 treated
0.06 107 treated
0.56 201 treated
1.10 207 treated
0.02 67 untreated
0.02 51 untreated
0.06 84 untreated
0.06 86 untreated

Use of select

4. Using the Puromycin data set, together with piping, head and select, select_if and select_all

  1. Show only the conc and rate columns
  2. Show only the columns whose type is numeric
  3. Show only the columns whose names end with the letter e
  4. Convert all column names to UPPERCASE
  5. Rearrange the columns in the order state, conc, rate
  6. Drop the state column

Limit to only the first 3 rows in each case.

In [60]:
#1
Puromycin %>% select(conc, rate) %>% head(n=3)
concrate
0.0276
0.0247
0.0697
In [61]:
#2
Puromycin %>% select_if(is.numeric) %>% head(n=3)
concrate
0.0276
0.0247
0.0697
In [62]:
#3
Puromycin %>% select(ends_with('e')) %>%  head(n=3)
ratestate
76 treated
47 treated
97 treated
In [63]:
#4
Puromycin %>% select_all(toupper) %>% head(n=3)
CONCRATESTATE
0.02 76 treated
0.02 47 treated
0.06 97 treated
In [82]:
#5
Puromycin %>% select(state, conc, rate) %>% head(n=3)
stateconcrate
treated0.02 76
treated0.02 47
treated0.06 97
In [83]:
#6
Puromycin %>% select(-state) %>% head(n=3)
concrate
0.0276
0.0247
0.0697

Use of mutate and transmute

5. Using the Puromycin data set, together with mutate or transmutate and any other operation necessary

  1. Create a new column rate2 that is the square of rate
  2. Create a new data frame that only has the 3 columns with conc, conc^2 and conc^3 values. Name them conc, conc2 and conc3
  3. Replace each value of all numeric columns with the square root of the value

Show only the first 5 rows in each case

In [78]:
#1
Puromycin %>% mutate(rate2=rate^2) %>% head(n=5)
concratestaterate2
0.02 76 treated 5776
0.02 47 treated 2209
0.06 97 treated 9409
0.06 107 treated11449
0.11 123 treated15129
In [80]:
#2
Puromycin %>% transmute(conc, conc2=conc^2, conc3=conc^3) %>% head(n=5)
concconc2conc3
0.02 0.0004 0.000008
0.02 0.0004 0.000008
0.06 0.0036 0.000216
0.06 0.0036 0.000216
0.11 0.0121 0.001331
In [84]:
#3
Puromycin %>% mutate_if(is.numeric, sqrt) %>% head(n=5)
concratestate
0.1414214 8.717798treated
0.1414214 6.855655treated
0.2449490 9.848858treated
0.244949010.344080treated
0.331662511.090537treated

Use of arrange

6. Using the Puromycin data set, together with arrange and any other operation necessary

  1. Sort in ascending rate order
  2. Sort in descending rate order
  3. Sort first on conc i ascending order, then rate in ascending order
  4. Sort in ascending order of the number of characters in the state column

In each case show only the first 5 rows.

In [64]:
#1
Puromycin %>% arrange(rate) %>% head(n=5)
concratestate
0.02 47 treated
0.02 51 untreated
0.02 67 untreated
0.02 76 treated
0.06 84 untreated
In [66]:
#2
Puromycin %>% arrange(-rate) %>% head(n=5)
concratestate
1.10 207 treated
0.56 201 treated
1.10 200 treated
0.56 191 treated
1.10 160 untreated
In [70]:
#2
Puromycin %>% arrange(desc(rate)) %>% head(n=5)
concratestate
1.10 207 treated
0.56 201 treated
1.10 200 treated
0.56 191 treated
1.10 160 untreated
In [67]:
#3
Puromycin %>% arrange(conc, rate) %>% head(n=5)
concratestate
0.02 47 treated
0.02 51 untreated
0.02 67 untreated
0.02 76 treated
0.06 84 untreated
In [95]:
length("foo")
1
In [97]:
#4
Puromycin %>%
mutate(len=nchar(as.character(state))) %>%
arrange(len) %>%
select(-len) %>%
head(n=5)
concratestate
0.02 76 treated
0.02 47 treated
0.06 97 treated
0.06 107 treated
0.11 123 treated

Use of summarize

7. Using the Puromycin data set, together with summarize and any other operation necessary

  • Find the mean value of numeric columns
  • Find the mean length of the state column
  • Find the min, median and max of the rate column
In [90]:
#1
Puromycin %>% summarise_if(is.numeric, mean)
concrate
0.3121739126.8261
In [111]:
#2
Puromycin %>% transmute(len=nchar(as.character(state))) %>% summarize_all(mean)
len
7.956522
In [126]:
#3
Puromycin %>% summarise_at('rate', c(rate.min=min, rate.median=median, rate.max=max))
rate.minrate.medianrate.max
47 124207

Use of group_by

8. Using the Puromycin data set, together with group_by and any other operation necessary

  1. Find the average rate for each state
  2. Find the number of treated and untreated states in a new column count
  3. Find the number of rows with the same conc and state in a new column count and only show rows where the count is an even number.
  4. Find the mean and standard deviation of rate for each state and conc. Remove any rows with an NA value for the rate standard deviation.

Hint: group_by is often combined with summarize, and n() returns the count.

In [133]:
#1
Puromycin %>% group_by(state) %>% summarise_all(mean)
stateconcrate
treated 0.3450000141.5833
untreated0.2763636110.7273
In [145]:
#2
Puromycin %>% group_by(state) %>% summarise(count=n())
statecount
treated 12
untreated11
In [152]:
#3
Puromycin %>%
group_by(conc, state) %>%
summarise(count=n()) %>%
filter(count %% 2 == 0)
concstatecount
0.02 treated 2
0.02 untreated2
0.06 treated 2
0.06 untreated2
0.11 treated 2
0.11 untreated2
0.22 treated 2
0.22 untreated2
0.56 treated 2
0.56 untreated2
1.10 treated 2
In [147]:
#4
Puromycin %>% group_by(state, conc) %>%
summarise_all(c(rate.mean=mean, rate.sd=sd)) %>%
filter(!is.na(rate.sd))
stateconcrate.meanrate.sd
treated 0.02 61.5 20.506097
treated 0.06 102.0 7.071068
treated 0.11 131.0 11.313708
treated 0.22 155.5 4.949747
treated 0.56 196.0 7.071068
treated 1.10 203.5 4.949747
untreated0.02 59.0 11.313708
untreated0.06 85.0 1.414214
untreated0.11 106.5 12.020815
untreated0.22 127.5 4.949747
untreated0.56 151.0 9.899495

Use of gather

9. Using the iris data set, together with spread and any other operation necessary

  1. Create a new data frame df that has only 3 columns (Species, Measure, Value) where Measure takes on the values Sepal.Length, Sepal.Width, Petal.Length or Petal.Width. Show the first 5 rows.
  2. Show the mean value and counts for each Species and Measure of df
In [182]:
#1
df <- iris %>%
gather(-Species, key=Measure, value=Value)
head(df, n=5)
SpeciesMeasureValue
setosa Sepal.Length5.1
setosa Sepal.Length4.9
setosa Sepal.Length4.7
setosa Sepal.Length4.6
setosa Sepal.Length5.0
In [186]:
#2
df %>%
group_by(Species, Measure) %>%
summarize(mean=mean(Value), count=n())
SpeciesMeasuremeancount
setosa Petal.Length1.462 50
setosa Petal.Width 0.246 50
setosa Sepal.Length5.006 50
setosa Sepal.Width 3.428 50
versicolor Petal.Length4.260 50
versicolor Petal.Width 1.326 50
versicolor Sepal.Length5.936 50
versicolor Sepal.Width 2.770 50
virginica Petal.Length5.552 50
virginica Petal.Width 2.026 50
virginica Sepal.Length6.588 50
virginica Sepal.Width 2.974 50

Use of spread

This is the opposite of gather - it takes a key and value column, and makes new columns out of the keys.

In [28]:
df <- data.frame(subject=rep(1:4,3),
                 treatment = rep(c("A", "B", "C"), each=4),
                 value = rnorm(12))

10. Using the df data set, apply spread to

give each different treatement its own column.

In [29]:
df
subjecttreatmentvalue
1 A 0.1765537
2 A 2.0216923
3 A 0.1613333
4 A 1.0256221
1 B 0.8115127
2 B -1.0966100
3 B -2.0626682
4 B 0.2882717
1 C 1.7624858
2 C 0.5935407
3 C -0.5302786
4 C 0.5366085
In [30]:
df %>% spread(key=treatment, value=value)
subjectABC
1 0.1765537 0.8115127 1.7624858
2 2.0216923 -1.0966100 0.5935407
3 0.1613333 -2.0626682-0.5302786
4 1.0256221 0.2882717 0.5366085

Use of separate

In [217]:
pid <- rep(1:4, 3)
treat <- rep(c('A','B','C'), each = 4)
bp <- rnorm(12, 120, 25)
expt <- data.frame(name=paste(pid, treat, sep='-'), bp=bp)
rm(pid)
rm(treat)
expt
namebp
1-A 102.33299
2-A 92.67872
3-A 148.08022
4-A 156.18250
1-B 138.82081
2-B 113.52586
3-B 152.55624
4-B 123.05604
1-C 129.22529
2-C 131.28523
3-C 74.97073
4-C 159.81344

11. Using the expt data set, together with separate and any other operation necessary

  1. Find the average blood pressure for each treatment group (A, B or C).

Note: You are assumed not to have access to the pid and treat values separate.y.

In [211]:
#1
expt %>%
separate(name, into=c("pid", "treat"))%>%
group_by(treat) %>%
summarise(mean(bp))
treatmean(bp)
A 123.3292
B 111.1280
C 138.1923
In [ ]: