Solutions to Warm-up exercises
These problem sets will introduce you to data frames and the basics of
data frame manipulation with the dplyr
package.
Warning message:
“package ‘dplyr’ was built under R version 3.4.1”
Use of piping (%>%)
2. Using the Puromycin
data set,
- Show the first 20 rows using piping
- Show the last 10 rows using piping
- Show rows 11 to 20 using piping
conc | rate | state |
0.02 | 76 | treated |
0.02 | 47 | treated |
0.06 | 97 | treated |
0.06 | 107 | treated |
0.11 | 123 | treated |
0.11 | 139 | treated |
0.22 | 159 | treated |
0.22 | 152 | treated |
0.56 | 191 | treated |
0.56 | 201 | treated |
1.10 | 207 | treated |
1.10 | 200 | treated |
0.02 | 67 | untreated |
0.02 | 51 | untreated |
0.06 | 84 | untreated |
0.06 | 86 | untreated |
0.11 | 98 | untreated |
0.11 | 115 | untreated |
0.22 | 131 | untreated |
0.22 | 124 | untreated |
| conc | rate | state |
14 | 0.02 | 51 | untreated |
15 | 0.06 | 84 | untreated |
16 | 0.06 | 86 | untreated |
17 | 0.11 | 98 | untreated |
18 | 0.11 | 115 | untreated |
19 | 0.22 | 131 | untreated |
20 | 0.22 | 124 | untreated |
21 | 0.56 | 144 | untreated |
22 | 0.56 | 158 | untreated |
23 | 1.10 | 160 | untreated |
| conc | rate | state |
11 | 1.10 | 207 | treated |
12 | 1.10 | 200 | treated |
13 | 0.02 | 67 | untreated |
14 | 0.02 | 51 | untreated |
15 | 0.06 | 84 | untreated |
16 | 0.06 | 86 | untreated |
17 | 0.11 | 98 | untreated |
18 | 0.11 | 115 | untreated |
19 | 0.22 | 131 | untreated |
20 | 0.22 | 124 | untreated |
Use of filter
3. Using the Puromycin
data set,togehter with piping and
filter
- Show only rows where the
state
is untreated
- Show only rows where the
conc
is 0.11
- Show only rows where the
conc
is less than 0.1
- Show only rows where the
state
is treated
and the rate is
more than 100
- Show only rows where the
conc
is less than 0.1 or the rate is
more than 200
conc | rate | state |
0.02 | 67 | untreated |
0.02 | 51 | untreated |
0.06 | 84 | untreated |
0.06 | 86 | untreated |
0.11 | 98 | untreated |
0.11 | 115 | untreated |
0.22 | 131 | untreated |
0.22 | 124 | untreated |
0.56 | 144 | untreated |
0.56 | 158 | untreated |
1.10 | 160 | untreated |
conc | rate | state |
0.11 | 123 | treated |
0.11 | 139 | treated |
0.11 | 98 | untreated |
0.11 | 115 | untreated |
conc | rate | state |
0.02 | 76 | treated |
0.02 | 47 | treated |
0.06 | 97 | treated |
0.06 | 107 | treated |
0.02 | 67 | untreated |
0.02 | 51 | untreated |
0.06 | 84 | untreated |
0.06 | 86 | untreated |
conc | rate | state |
0.06 | 107 | treated |
0.11 | 123 | treated |
0.11 | 139 | treated |
0.22 | 159 | treated |
0.22 | 152 | treated |
0.56 | 191 | treated |
0.56 | 201 | treated |
1.10 | 207 | treated |
1.10 | 200 | treated |
conc | rate | state |
0.02 | 76 | treated |
0.02 | 47 | treated |
0.06 | 97 | treated |
0.06 | 107 | treated |
0.56 | 201 | treated |
1.10 | 207 | treated |
0.02 | 67 | untreated |
0.02 | 51 | untreated |
0.06 | 84 | untreated |
0.06 | 86 | untreated |
Use of select
4. Using the Puromycin
data set, together with piping,
head and select, select_if and select_all
- Show only the
conc
and rate
columns
- Show only the columns whose type is numeric
- Show only the columns whose names end with the letter
e
- Convert all column names to UPPERCASE
- Rearrange the columns in the order
state
, conc
, rate
- Drop the
state
column
Limit to only the first 3 rows in each case.
conc | rate |
0.02 | 76 |
0.02 | 47 |
0.06 | 97 |
conc | rate |
0.02 | 76 |
0.02 | 47 |
0.06 | 97 |
rate | state |
76 | treated |
47 | treated |
97 | treated |
CONC | RATE | STATE |
0.02 | 76 | treated |
0.02 | 47 | treated |
0.06 | 97 | treated |
state | conc | rate |
treated | 0.02 | 76 |
treated | 0.02 | 47 |
treated | 0.06 | 97 |
conc | rate |
0.02 | 76 |
0.02 | 47 |
0.06 | 97 |
Use of mutate and transmute
5. Using the Puromycin
data set, together with mutate or
transmutate and any other operation necessary
- Create a new column
rate2
that is the square of rate
- Create a new data frame that only has the 3 columns with
conc
,
conc^2
and conc^3
values. Name them conc
, conc2
and
conc3
- Replace each value of all numeric columns with the square root of the
value
Show only the first 5 rows in each case
conc | rate | state | rate2 |
0.02 | 76 | treated | 5776 |
0.02 | 47 | treated | 2209 |
0.06 | 97 | treated | 9409 |
0.06 | 107 | treated | 11449 |
0.11 | 123 | treated | 15129 |
conc | conc2 | conc3 |
0.02 | 0.0004 | 0.000008 |
0.02 | 0.0004 | 0.000008 |
0.06 | 0.0036 | 0.000216 |
0.06 | 0.0036 | 0.000216 |
0.11 | 0.0121 | 0.001331 |
conc | rate | state |
0.1414214 | 8.717798 | treated |
0.1414214 | 6.855655 | treated |
0.2449490 | 9.848858 | treated |
0.2449490 | 10.344080 | treated |
0.3316625 | 11.090537 | treated |
Use of arrange
6. Using the Puromycin
data set, together with arrange and
any other operation necessary
- Sort in ascending
rate
order
- Sort in descending
rate
order
- Sort first on
conc
i ascending order, then rate
in ascending
order
- Sort in ascending order of the number of characters in the
state
column
In each case show only the first 5 rows.
conc | rate | state |
0.02 | 47 | treated |
0.02 | 51 | untreated |
0.02 | 67 | untreated |
0.02 | 76 | treated |
0.06 | 84 | untreated |
conc | rate | state |
1.10 | 207 | treated |
0.56 | 201 | treated |
1.10 | 200 | treated |
0.56 | 191 | treated |
1.10 | 160 | untreated |
conc | rate | state |
1.10 | 207 | treated |
0.56 | 201 | treated |
1.10 | 200 | treated |
0.56 | 191 | treated |
1.10 | 160 | untreated |
conc | rate | state |
0.02 | 47 | treated |
0.02 | 51 | untreated |
0.02 | 67 | untreated |
0.02 | 76 | treated |
0.06 | 84 | untreated |
conc | rate | state |
0.02 | 76 | treated |
0.02 | 47 | treated |
0.06 | 97 | treated |
0.06 | 107 | treated |
0.11 | 123 | treated |
Use of summarize
7. Using the Puromycin
data set, together with summarize and
any other operation necessary
- Find the mean value of numeric columns
- Find the mean length of the
state
column
- Find the min, median and max of the
rate
column
conc | rate |
0.3121739 | 126.8261 |
rate.min | rate.median | rate.max |
47 | 124 | 207 |
Use of group_by
8. Using the Puromycin
data set, together with group_by and
any other operation necessary
- Find the average rate for each
state
- Find the number of treated and untreated states in a new column
count
- Find the number of rows with the same
conc
and state
in a new
column count
and only show rows where the count is an even
number.
- Find the mean and standard deviation of rate for each
state
and
conc
. Remove any rows with an NA value for the rate standard
deviation.
Hint: group_by
is often combined with summarize
, and n()
returns the count.
state | conc | rate |
treated | 0.3450000 | 141.5833 |
untreated | 0.2763636 | 110.7273 |
state | count |
treated | 12 |
untreated | 11 |
conc | state | count |
0.02 | treated | 2 |
0.02 | untreated | 2 |
0.06 | treated | 2 |
0.06 | untreated | 2 |
0.11 | treated | 2 |
0.11 | untreated | 2 |
0.22 | treated | 2 |
0.22 | untreated | 2 |
0.56 | treated | 2 |
0.56 | untreated | 2 |
1.10 | treated | 2 |
state | conc | rate.mean | rate.sd |
treated | 0.02 | 61.5 | 20.506097 |
treated | 0.06 | 102.0 | 7.071068 |
treated | 0.11 | 131.0 | 11.313708 |
treated | 0.22 | 155.5 | 4.949747 |
treated | 0.56 | 196.0 | 7.071068 |
treated | 1.10 | 203.5 | 4.949747 |
untreated | 0.02 | 59.0 | 11.313708 |
untreated | 0.06 | 85.0 | 1.414214 |
untreated | 0.11 | 106.5 | 12.020815 |
untreated | 0.22 | 127.5 | 4.949747 |
untreated | 0.56 | 151.0 | 9.899495 |
Use of gather
9. Using the iris
data set, together with spread and any
other operation necessary
- Create a new data frame
df
that has only 3 columns (Species
,
Measure
, Value
) where Measure
takes on the values
Sepal.Length
, Sepal.Width
, Petal.Length
or
Petal.Width
. Show the first 5 rows.
- Show the mean value and counts for each Species and Measure of
df
Species | Measure | Value |
setosa | Sepal.Length | 5.1 |
setosa | Sepal.Length | 4.9 |
setosa | Sepal.Length | 4.7 |
setosa | Sepal.Length | 4.6 |
setosa | Sepal.Length | 5.0 |
Species | Measure | mean | count |
setosa | Petal.Length | 1.462 | 50 |
setosa | Petal.Width | 0.246 | 50 |
setosa | Sepal.Length | 5.006 | 50 |
setosa | Sepal.Width | 3.428 | 50 |
versicolor | Petal.Length | 4.260 | 50 |
versicolor | Petal.Width | 1.326 | 50 |
versicolor | Sepal.Length | 5.936 | 50 |
versicolor | Sepal.Width | 2.770 | 50 |
virginica | Petal.Length | 5.552 | 50 |
virginica | Petal.Width | 2.026 | 50 |
virginica | Sepal.Length | 6.588 | 50 |
virginica | Sepal.Width | 2.974 | 50 |
Use of spread
This is the opposite of gather - it takes a key and value column, and
makes new columns out of the keys.
10. Using the df
data set, apply spread
to
give each different treatement
its own column.
subject | treatment | value |
1 | A | 0.1765537 |
2 | A | 2.0216923 |
3 | A | 0.1613333 |
4 | A | 1.0256221 |
1 | B | 0.8115127 |
2 | B | -1.0966100 |
3 | B | -2.0626682 |
4 | B | 0.2882717 |
1 | C | 1.7624858 |
2 | C | 0.5935407 |
3 | C | -0.5302786 |
4 | C | 0.5366085 |
subject | A | B | C |
1 | 0.1765537 | 0.8115127 | 1.7624858 |
2 | 2.0216923 | -1.0966100 | 0.5935407 |
3 | 0.1613333 | -2.0626682 | -0.5302786 |
4 | 1.0256221 | 0.2882717 | 0.5366085 |
Use of separate
name | bp |
1-A | 102.33299 |
2-A | 92.67872 |
3-A | 148.08022 |
4-A | 156.18250 |
1-B | 138.82081 |
2-B | 113.52586 |
3-B | 152.55624 |
4-B | 123.05604 |
1-C | 129.22529 |
2-C | 131.28523 |
3-C | 74.97073 |
4-C | 159.81344 |
11. Using the expt
data set, together with separate and any
other operation necessary
- Find the average blood pressure for each treatment group (A, B or C).
Note: You are assumed not to have access to the pid
and treat
values separate.y.
treat | mean(bp) |
A | 123.3292 |
B | 111.1280 |
C | 138.1923 |