Warm-up exercises¶

These problem sets will introduce you to data frames and the basics of data frame manipulation with the dplyr package.

In [ ]:

suppressPackageStartupMessages(library(tidyverse))

Finding information about a data set¶

1. We will work with the Puromycin data set in this exercise.

Use help to find out more about of the Puromycin data set
Use class to find out the class of the data set
How many rows and columns are there?
What is the type of each column?
Show all unique values for the state column
Show the first 5 rows
Show the last 5 rows

In [ ]:

#1

In [ ]:

# 2

In [ ]:

#3

In [ ]:

#4

In [ ]:

#4

In [ ]:

#5

In [ ]:

#6

In [ ]:

#7

Use of piping (%>%)¶

2. Using the Puromycin data set,

Show the first 20 rows using piping
Show the last 10 rows using piping
Show rows 11 to 20 using piping

In [ ]:

#1

In [ ]:

#2

In [ ]:

#3

Use of filter¶

3. Using the Puromycin data set,togehter with piping and filter

Show only rows where the state is untreated
Show only rows where the conc is 0.11
Show only rows where the conc is less than 0.1
Show only rows where the state is treated and the rate is more than 100
Show only rows where the conc is less than 0.1 or the rate is more than 200

In [ ]:

#1

In [ ]:

#2

In [ ]:

#3

In [ ]:

#4

In [ ]:

#5

Use of select¶

4. Using the Puromycin data set, together with piping, head and select, select_if and select_all

Show only the conc and rate columns
Show only the columns whose type is numeric
Show only the columns whose names end with the letter e
Convert all column names to UPPERCASE
Rearrange the columns in the order state, conc, rate
Drop the state column

Limit to only the first 3 rows in each case.

In [ ]:

#1

In [ ]:

#2

In [ ]:

#3

In [ ]:

#4

In [ ]:

#5

In [ ]:

#6

Use of mutate and transmute¶

5. Using the Puromycin data set, together with mutate or transmutate and any other operation necessary

Create a new column rate2 that is the square of rate
Create a new data frame that only has the 3 columns with conc, conc^2 and conc^3 values. Name them conc, conc2 and conc3
Replace each value of all numeric columns with the square root of the value

Show only the first 5 rows in each case

In [ ]:

#1

In [ ]:

#2

In [ ]:

#3

Use of arrange¶

6. Using the Puromycin data set, together with arrange and any other operation necessary

Sort in ascending rate order
Sort in descending rate order
Sort first on conc i ascending order, then rate in ascending order
Sort in ascending order of the number of characters in the state column

In each case show only the first 5 rows.

In [ ]:

#1

In [ ]:

#2

In [ ]:

#2

In [ ]:

#3

In [ ]:

#4

Use of summarize¶

7. Using the Puromycin data set, together with summarize and any other operation necessary

Find the mean value of numeric columns
Find the mean length of the state column
Find the min, median and max of the rate column

In [ ]:

#1

In [ ]:

#2

In [ ]:

#3

Use of group_by¶

8. Using the Puromycin data set, together with group_by and any other operation necessary

Find the average rate for each state
Find the number of treated and untreated states in a new column count
Find the number of rows with the same conc and state in a new column count and only show rows where the count is an even number.
Find the mean and standard deviation of rate for each state and conc. Remove any rows with an NA value for the rate standard deviation.

Hint: group_by is often combined with summarize, and n() returns the count.

In [ ]:

#1

In [ ]:

#2

In [ ]:

#3

In [ ]:

#4

Use of gather¶

9. Using the iris data set, together with spread and any other operation necessary

Create a new data frame df that has only 3 columns (Species, Measure, Value) where Measure takes on the values Sepal.Length, Sepal.Width, Petal.Length or Petal.Width. Show the first 5 rows.
Show the mean value and counts for each Species and Measure of df

In [ ]:

#1

In [ ]:

#2

Use of spread¶

This is the opposite of gather - it takes a key and value column, and makes new columns out of the keys.

In [1]:

df <- data.frame(subject=rep(1:4,3),
                 treatment = rep(c("A", "B", "C"), each=4),
                 value = rnorm(12))

10. Using the df data set, apply spread to

give each different treatement its own column.

In [ ]:

#1

Use of separate¶

In [ ]:

pid <- rep(1:4, 3)
treat <- rep(c('A','B','C'), each = 4)
bp <- rnorm(12, 120, 25)
expt <- data.frame(name=paste(pid, treat, sep='-'), bp=bp)
rm(pid)
rm(treat)
expt

11. Using the expt data set, together with separate and any other operation necessary

Find the average blood pressure for each treatment group (A, B or C).

Note: You are assumed not to have access to the pid and treat values separate.y.

In [ ]:

#1