Warm-up exercises¶
These problem sets will introduce you to data frames and the basics of
data frame manipulation with the dplyr
package.
In [ ]:
suppressPackageStartupMessages(library(tidyverse))
Finding information about a data set¶
1. We will work with the Puromycin
data set in this exercise.
- Use
help
to find out more about of thePuromycin
data set - Use
class
to find out the class of the data set - How many rows and columns are there?
- What is the type of each column?
- Show all unique values for the
state
column - Show the first 5 rows
- Show the last 5 rows
In [ ]:
#1
In [ ]:
# 2
In [ ]:
#3
In [ ]:
#4
In [ ]:
#4
In [ ]:
#5
In [ ]:
#6
In [ ]:
#7
Use of piping (%>%)¶
2. Using the Puromycin
data set,
- Show the first 20 rows using piping
- Show the last 10 rows using piping
- Show rows 11 to 20 using piping
In [ ]:
#1
In [ ]:
#2
In [ ]:
#3
Use of filter¶
3. Using the Puromycin
data set,togehter with piping and
filter
- Show only rows where the
state
isuntreated
- Show only rows where the
conc
is 0.11 - Show only rows where the
conc
is less than 0.1 - Show only rows where the
state
istreated
and the rate is more than 100 - Show only rows where the
conc
is less than 0.1 or the rate is more than 200
In [ ]:
#1
In [ ]:
#2
In [ ]:
#3
In [ ]:
#4
In [ ]:
#5
Use of select¶
4. Using the Puromycin
data set, together with piping,
head and select, select_if and select_all
- Show only the
conc
andrate
columns - Show only the columns whose type is numeric
- Show only the columns whose names end with the letter
e
- Convert all column names to UPPERCASE
- Rearrange the columns in the order
state
,conc
,rate
- Drop the
state
column
Limit to only the first 3 rows in each case.
In [ ]:
#1
In [ ]:
#2
In [ ]:
#3
In [ ]:
#4
In [ ]:
#5
In [ ]:
#6
Use of mutate and transmute¶
5. Using the Puromycin
data set, together with mutate or
transmutate and any other operation necessary
- Create a new column
rate2
that is the square of rate - Create a new data frame that only has the 3 columns with
conc
,conc^2
andconc^3
values. Name themconc
,conc2
andconc3
- Replace each value of all numeric columns with the square root of the value
Show only the first 5 rows in each case
In [ ]:
#1
In [ ]:
#2
In [ ]:
#3
Use of arrange¶
6. Using the Puromycin
data set, together with arrange and
any other operation necessary
- Sort in ascending
rate
order - Sort in descending
rate
order - Sort first on
conc
i ascending order, thenrate
in ascending order - Sort in ascending order of the number of characters in the
state
column
In each case show only the first 5 rows.
In [ ]:
#1
In [ ]:
#2
In [ ]:
#2
In [ ]:
#3
In [ ]:
#4
Use of summarize¶
7. Using the Puromycin
data set, together with summarize and
any other operation necessary
- Find the mean value of numeric columns
- Find the mean length of the
state
column - Find the min, median and max of the
rate
column
In [ ]:
#1
In [ ]:
#2
In [ ]:
#3
Use of group_by¶
8. Using the Puromycin
data set, together with group_by and
any other operation necessary
- Find the average rate for each
state
- Find the number of treated and untreated states in a new column
count
- Find the number of rows with the same
conc
andstate
in a new columncount
and only show rows where the count is an even number. - Find the mean and standard deviation of rate for each
state
andconc
. Remove any rows with an NA value for the rate standard deviation.
Hint: group_by
is often combined with summarize
, and n()
returns the count.
In [ ]:
#1
In [ ]:
#2
In [ ]:
#3
In [ ]:
#4
Use of gather¶
9. Using the iris
data set, together with spread and any
other operation necessary
- Create a new data frame
df
that has only 3 columns (Species
,Measure
,Value
) whereMeasure
takes on the valuesSepal.Length
,Sepal.Width
,Petal.Length
orPetal.Width
. Show the first 5 rows. - Show the mean value and counts for each Species and Measure of
df
In [ ]:
#1
In [ ]:
#2
Use of spread¶
This is the opposite of gather - it takes a key and value column, and makes new columns out of the keys.
In [1]:
df <- data.frame(subject=rep(1:4,3),
treatment = rep(c("A", "B", "C"), each=4),
value = rnorm(12))
10. Using the df
data set, apply spread
to
give each different treatement
its own column.
In [ ]:
#1
Use of separate¶
In [ ]:
pid <- rep(1:4, 3)
treat <- rep(c('A','B','C'), each = 4)
bp <- rnorm(12, 120, 25)
expt <- data.frame(name=paste(pid, treat, sep='-'), bp=bp)
rm(pid)
rm(treat)
expt
11. Using the expt
data set, together with separate and any
other operation necessary
- Find the average blood pressure for each treatment group (A, B or C).
Note: You are assumed not to have access to the pid
and treat
values separate.y.
In [ ]:
#1