Cleaning and preparing data
This lesson is on the use of the tidyr
and dplyr
packages that
are part of a collection called the ‘tidyverse’. These packages are for
data cleaning, preparation and visualization. Currently, data frames and
their manipulation in R are in transition from older programming
paradigms, syntax and data classes to some newer, (hopefully) better
implementations.
Here’s what we want to do:
Objectives
- How to structure your spreadsheet for analysis
- How to read a spreadsheet or text file into a data frame (or tibble)
- How to create a data frame (or tibble) from scratch
- How to interpret a data frame (or tibble)
- How to select rows and columns of a data frame (or tibble)
- How to sort a data frame (or tibble)
- What is “tidy” data?
- How to tidy common forms of “messy” data with
tidyr
Because many researchers may not have adopted the newer methods, we will
often present both the ‘traditional’ R syntax and the tidyverse syntax.
We encourage you to write new code using tidyr, dplyr and the rest of
the tidyverse.
Main Additions and Modifications to Traditional R.
tibbles vs data frames
Base R (meaning the core R libraries and functions) employs a data class
called a ‘data frame’. A data frame is a 2-dimensional array whose
columns can be any type (numeric, vector, list, string, etc.). A tibble
is also a data frame, with some refinements in the printing and
subsetting features. Pretty much anywhere a data frame is used, a tibble
can be used. Data frames can be converted to tibbles using
as.tibbles
and tibbles can be converted to data frames using
as.data.frame
. If you need to use an older package that doesn’t work
with tibbles, you can just wrap your tibble with as.data.frame
and
everything should work. We’ll see some examples shortly.
Pipes
Pipes are new to R, but the concept of a pipe is not at all new. In
fact, it has been around for decades in Unix scripting language. The
pipe works like a literal pipe - it is a connector that carries things
from one output to a new input. For example, the water that is output by
the municipal system flows via a pipe as input to your house, that is
then run through your faucet, whose output goes to the drain, and the
water output from the drain is carried via a pipe to become input to the
sewer system.
Software pipes carry the output of one program to be input to another.
Their use makes code more readable to a human. Readability to humans is
a key goal in today’s programming environments, because it makes
modifying code easier, and it reduces the possibility of bugs. The
symbol for the pipe connector in R is %>%
. (In unix the pipe symbol
is |
, and you may use this when you carry out the bioinformatics
pipeline.)
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag(): dplyr, stats
How to structure your spreadsheet for analysis
Many people in the biological sciences use excel spreadsheets to store
data. When we import a spreadsheet into a programming environment like
R, we need to be much more careful about formatting. Spreadsheets are
formatted for people to look at. We want our data in a format that is
readable for a machine.
The following are some common issues in converting spreadsheets to
usable data sets.
Do this
Not that
- Table is not rectangular
- Fancy annotations lost when exported to CSV
- Single column contains too much information
How to read a spreadsheet or text file into a data frame
Reading in a data frame from a text or csv file
This is the most common case when data is exported from a spreadsheet to
a CSV file.
Using read.csv
| X | Name | PClass | Age | Sex | Survived | SexCode |
1 | 1 | Allen, Miss Elisabeth Walton | 1st | 29 | female | 1 | 1 |
2 | 2 | Allison, Miss Helen Loraine | 1st | 2 | female | 0 | 1 |
3 | 3 | Allison, Mr Hudson Joshua Creighton | 1st | 30 | male | 0 | 0 |
4 | 4 | Allison, Mrs Hudson JC (Bessie Waldo Daniels) | 1st | 25 | female | 0 | 1 |
Warning message:
“Missing column names filled in: 'X1' [1]”Parsed with column specification:
cols(
X1 = col_integer(),
Name = col_character(),
PClass = col_character(),
Age = col_double(),
Sex = col_character(),
Survived = col_integer(),
SexCode = col_integer()
)
| X1 | Name | PClass | Age | Sex | Survived | SexCode |
1 | 1 | Allen, Miss Elisabeth Walton | 1st | 29 | female | 1 | 1 |
2 | 2 | Allison, Miss Helen Loraine | 1st | 2 | female | 0 | 1 |
3 | 3 | Allison, Mr Hudson Joshua Creighton | 1st | 30 | male | 0 | 0 |
4 | 4 | Allison, Mrs Hudson JC (Bessie Waldo Daniels) | 1st | 25 | female | 0 | 1 |
5 | 5 | Allison, Master Hudson Trevor | 1st | 0.92 | male | 1 | 0 |
6 | 6 | Anderson, Mr Harry | 1st | 47 | male | 1 | 0 |
7 | 7 | Andrews, Miss Kornelia Theodosia | 1st | 63 | female | 1 | 1 |
8 | 8 | Andrews, Mr Thomas, jr | 1st | 39 | male | 0 | 0 |
9 | 9 | Appleton, Mrs Edward Dale (Charlotte Lamson) | 1st | 58 | female | 1 | 1 |
10 | 10 | Artagaveytia, Mr Ramon | 1st | 71 | male | 0 | 0 |
11 | 11 | Astor, Colonel John Jacob | 1st | 47 | male | 0 | 0 |
12 | 12 | Astor, Mrs John Jacob (Madeleine Talmadge Force) | 1st | 19 | female | 1 | 1 |
13 | 13 | Aubert, Mrs Leontine Pauline | 1st | NA | female | 1 | 1 |
14 | 14 | Barkworth, Mr Algernon H | 1st | NA | male | 1 | 0 |
15 | 15 | Baumann, Mr John D | 1st | NA | male | 0 | 0 |
16 | 16 | Baxter, Mrs James (Helene DeLaudeniere Chaput) | 1st | 50 | female | 1 | 1 |
17 | 17 | Baxter, Mr Quigg Edmond | 1st | 24 | male | 0 | 0 |
18 | 18 | Beattie, Mr Thomson | 1st | 36 | male | 0 | 0 |
19 | 19 | Beckwith, Mr Richard Leonard | 1st | 37 | male | 1 | 0 |
20 | 20 | Beckwith, Mrs Richard Leonard (Sallie Monypeny) | 1st | 47 | female | 1 | 1 |
21 | 21 | Behr, Mr Karl Howell | 1st | 26 | male | 1 | 0 |
22 | 22 | Birnbaum, Mr Jakob | 1st | 25 | male | 0 | 0 |
23 | 23 | Bishop, Mr Dickinson H | 1st | 25 | male | 1 | 0 |
24 | 24 | Bishop, Mrs Dickinson H (Helen Walton) | 1st | 19 | female | 1 | 1 |
25 | 25 | Bjornstrm-Steffansson, Mr Mauritz Hakan | 1st | 28 | male | 1 | 0 |
26 | 26 | Blackwell, Mr Stephen Weart | 1st | 45 | male | 0 | 0 |
27 | 27 | Blank, Mr Henry | 1st | 39 | male | 1 | 0 |
28 | 28 | Bonnell, Miss Caroline | 1st | 30 | female | 1 | 1 |
29 | 29 | Bonnell, Miss Elizabeth | 1st | 58 | female | 1 | 1 |
30 | 30 | Borebank, Mr John James | 1st | NA | male | 0 | 0 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
1284 | 1284 | Vestrom, Miss Hulda Amanda Adolfina | 3rd | 14 | female | 0 | 1 |
1285 | 1285 | Vonk, Mr Jenko | 3rd | 22 | male | 0 | 0 |
1286 | 1286 | Ware, Mr Frederick | 3rd | NA | male | 0 | 0 |
1287 | 1287 | Warren, Mr Charles William | 3rd | NA | male | 0 | 0 |
1288 | 1288 | Wazli, Mr Yousif | 3rd | NA | male | 0 | 0 |
1289 | 1289 | Webber, Mr James | 3rd | NA | male | 0 | 0 |
1290 | 1290 | Wennerstrom, Mr August Edvard | 3rd | NA | male | 1 | 0 |
1291 | 1291 | Wenzel, Mr Linhart | 3rd | NA | male | 0 | 0 |
1292 | 1292 | Widegren, Mr Charles Peter | 3rd | 51 | male | 0 | 0 |
1293 | 1293 | Wiklund, Mr Jacob Alfred | 3rd | 18 | male | 0 | 0 |
1294 | 1294 | Wilkes, Mrs Ellen | 3rd | 45 | female | 1 | 1 |
1295 | 1295 | Willer, Mr Aaron | 3rd | NA | male | 0 | 0 |
1296 | 1296 | Willey, Mr Edward | 3rd | NA | male | 0 | 0 |
1297 | 1297 | Williams, Mr Howard Hugh | 3rd | NA | male | 0 | 0 |
1298 | 1298 | Williams, Mr Leslie | 3rd | 28 | male | 0 | 0 |
1299 | 1299 | Windelov, Mr Einar | 3rd | 21 | male | 0 | 0 |
1300 | 1300 | Wirz, Mr Albert | 3rd | 27 | male | 0 | 0 |
1301 | 1301 | Wiseman, Mr Phillippe | 3rd | NA | male | 0 | 0 |
1302 | 1302 | Wittevrongel, Mr Camiel | 3rd | 36 | male | 0 | 0 |
1303 | 1303 | Yalsevac, Mr Ivan | 3rd | NA | male | 1 | 0 |
1304 | 1304 | Yasbeck, Mr Antoni | 3rd | 27 | male | 0 | 0 |
1305 | 1305 | Yasbeck, Mrs Antoni | 3rd | 15 | female | 1 | 1 |
1306 | 1306 | Youssef, Mr Gerios | 3rd | NA | male | 0 | 0 |
1307 | 1307 | Zabour, Miss Hileni | 3rd | NA | female | 0 | 1 |
1308 | 1308 | Zabour, Miss Tamini | 3rd | NA | female | 0 | 1 |
1309 | 1309 | Zakarian, Mr Artun | 3rd | 27 | male | 0 | 0 |
1310 | 1310 | Zakarian, Mr Maprieder | 3rd | 26 | male | 0 | 0 |
1311 | 1311 | Zenni, Mr Philip | 3rd | 22 | male | 0 | 0 |
1312 | 1312 | Lievens, Mr Rene | 3rd | 24 | male | 0 | 0 |
1313 | 1313 | Zimmerman, Leo | 3rd | 29 | male | 0 | 0 |
Error in print(class(titanic)): object 'titanic' not found
Traceback:
1. print(class(titanic))
read.csv
is just a wrapper for read.table
with some defaults given
| X | Name | PClass | Age | Sex | Survived | SexCode |
1 | 1 | Allen, Miss Elisabeth Walton | 1st | 29 | female | 1 | 1 |
2 | 2 | Allison, Miss Helen Loraine | 1st | 2 | female | 0 | 1 |
3 | 3 | Allison, Mr Hudson Joshua Creighton | 1st | 30 | male | 0 | 0 |
4 | 4 | Allison, Mrs Hudson JC (Bessie Waldo Daniels) | 1st | 25 | female | 0 | 1 |
Let’s get rid of the first column, as it isn’t useful.
| Name | PClass | Age | Sex | Survived | SexCode |
1 | Allen, Miss Elisabeth Walton | 1st | 29 | female | 1 | 1 |
2 | Allison, Miss Helen Loraine | 1st | 2 | female | 0 | 1 |
3 | Allison, Mr Hudson Joshua Creighton | 1st | 30 | male | 0 | 0 |
4 | Allison, Mrs Hudson JC (Bessie Waldo Daniels) | 1st | 25 | female | 0 | 1 |
Alternative ising dplyr
An alternative approach using dplyr
that we encourage getting used
to as it will provides a consistent grammar for data frame
manipulations.
| Name | PClass | Age | Sex | Survived | SexCode |
1 | Allen, Miss Elisabeth Walton | 1st | 29 | female | 1 | 1 |
2 | Allison, Miss Helen Loraine | 1st | 2 | female | 0 | 1 |
3 | Allison, Mr Hudson Joshua Creighton | 1st | 30 | male | 0 | 0 |
4 | Allison, Mrs Hudson JC (Bessie Waldo Daniels) | 1st | 25 | female | 0 | 1 |
2. Write a function called peek
that takes as arguments df
and n
. It then returns n
random rows ffrom the data frame df
(without repeats). If n
is larger than the number of rows, it should
retrun the original data frame. Give n
a default value of 6.
Alternative using sample_n
from dplyr
How to create a data frame from scratch
| subjects | treatments | gene1 | gene2 |
1 | 153 | Drug | 15.5163083478864 | 11.8019177313714 |
2 | 98 | Placebo | 11.3897127792735 | 10.4411774515486 |
3 | 250 | Drug | 15.8043076138157 | 11.5690995283775 |
4 | 545 | Drug | 15.4461393252021 | 11.240600432231 |
5 | 442 | Placebo | 10.7667391418026 | 12.4025164044546 |
6 | 832 | Drug | 16.5360327391973 | 10.7263837943643 |
7 | 625 | Drug | 14.2269966318662 | 13.5853056018832 |
8 | 293 | Placebo | 9.38369841723581 | 12.1824448855273 |
9 | 32 | Drug | 13.671926122039 | 11.5003612782279 |
10 | 469 | Drug | 15.8043076138157 | 12.8664592872279 |
11 | 955 | Placebo | 10.3796050585356 | 13.6823416307802 |
12 | 443 | Placebo | 11.3897127792735 | 11.9559040521596 |
How to interpret a data frame
- '1'
- '2'
- '3'
- '4'
- '5'
- '6'
- '7'
- '8'
- '9'
- '10'
- '11'
- '12'
- 'subjects'
- 'treatments'
- 'gene1'
- 'gene2'
'data.frame': 12 obs. of 4 variables:
$ subjects : int 153 98 250 545 442 832 625 293 32 469 ...
$ treatments: Factor w/ 2 levels "Drug","Placebo": 1 2 1 1 2 1 1 2 1 1 ...
$ gene1 : num 15.5 11.4 15.8 15.4 10.8 ...
$ gene2 : num 11.8 10.4 11.6 11.2 12.4 ...
subjects treatments gene1 gene2
Min. : 32.0 Drug :7 Min. : 9.384 Min. :10.44
1st Qu.:225.8 Placebo:5 1st Qu.:11.234 1st Qu.:11.44
Median :442.5 Median :13.949 Median :11.88
Mean :428.1 Mean :13.360 Mean :12.00
3rd Qu.:565.0 3rd Qu.:15.588 3rd Qu.:12.52
Max. :955.0 Max. :16.536 Max. :13.68
| subjects | treatments | gene1 | gene2 |
1 | 153 | Drug | 15.5163083478864 | 11.8019177313714 |
2 | 98 | Placebo | 11.3897127792735 | 10.4411774515486 |
3 | 250 | Drug | 15.8043076138157 | 11.5690995283775 |
4 | 545 | Drug | 15.4461393252021 | 11.240600432231 |
5 | 442 | Placebo | 10.7667391418026 | 12.4025164044546 |
6 | 832 | Drug | 16.5360327391973 | 10.7263837943643 |
| subjects | treatments | gene1 | gene2 |
7 | 625 | Drug | 14.2269966318662 | 13.5853056018832 |
8 | 293 | Placebo | 9.38369841723581 | 12.1824448855273 |
9 | 32 | Drug | 13.671926122039 | 11.5003612782279 |
10 | 469 | Drug | 15.8043076138157 | 12.8664592872279 |
11 | 955 | Placebo | 10.3796050585356 | 13.6823416307802 |
12 | 443 | Placebo | 11.3897127792735 | 11.9559040521596 |
How to select rows and columns of a data frame
- Drug
- Placebo
- Drug
- Drug
- Placebo
- Drug
- Drug
- Placebo
- Drug
- Drug
- Placebo
- Placebo
- Drug
- Placebo
- Drug
- Drug
- Placebo
- Drug
- Drug
- Placebo
- Drug
- Drug
- Placebo
- Placebo
- Drug
- Placebo
- Drug
- Drug
- Placebo
- Drug
- Drug
- Placebo
- Drug
- Drug
- Placebo
- Placebo
| subjects | treatments | gene1 | gene2 |
2 | 98 | Placebo | 11.3897127792735 | 10.4411774515486 |
3 | 250 | Drug | 15.8043076138157 | 11.5690995283775 |
4 | 545 | Drug | 15.4461393252021 | 11.240600432231 |
| subjects | treatments | gene1 | gene2 |
2 | 98 | Placebo | 11.3897127792735 | 10.4411774515486 |
5 | 442 | Placebo | 10.7667391418026 | 12.4025164044546 |
8 | 293 | Placebo | 9.38369841723581 | 12.1824448855273 |
11 | 955 | Placebo | 10.3796050585356 | 13.6823416307802 |
12 | 443 | Placebo | 11.3897127792735 | 11.9559040521596 |
| subjects | treatments | gene1 | gene2 |
5 | 442 | Placebo | 10.7667391418026 | 12.4025164044546 |
8 | 293 | Placebo | 9.38369841723581 | 12.1824448855273 |
11 | 955 | Placebo | 10.3796050585356 | 13.6823416307802 |
How to sort a data frame
| subjects | treatments | gene1 | gene2 |
9 | 32 | Drug | 13.671926122039 | 11.5003612782279 |
2 | 98 | Placebo | 11.3897127792735 | 10.4411774515486 |
1 | 153 | Drug | 15.5163083478864 | 11.8019177313714 |
3 | 250 | Drug | 15.8043076138157 | 11.5690995283775 |
8 | 293 | Placebo | 9.38369841723581 | 12.1824448855273 |
5 | 442 | Placebo | 10.7667391418026 | 12.4025164044546 |
12 | 443 | Placebo | 11.3897127792735 | 11.9559040521596 |
10 | 469 | Drug | 15.8043076138157 | 12.8664592872279 |
4 | 545 | Drug | 15.4461393252021 | 11.240600432231 |
7 | 625 | Drug | 14.2269966318662 | 13.5853056018832 |
6 | 832 | Drug | 16.5360327391973 | 10.7263837943643 |
11 | 955 | Placebo | 10.3796050585356 | 13.6823416307802 |
| subjects | treatments | gene1 | gene2 |
8 | 293 | Placebo | 9.38369841723581 | 12.1824448855273 |
11 | 955 | Placebo | 10.3796050585356 | 13.6823416307802 |
5 | 442 | Placebo | 10.7667391418026 | 12.4025164044546 |
2 | 98 | Placebo | 11.3897127792735 | 10.4411774515486 |
12 | 443 | Placebo | 11.3897127792735 | 11.9559040521596 |
9 | 32 | Drug | 13.671926122039 | 11.5003612782279 |
7 | 625 | Drug | 14.2269966318662 | 13.5853056018832 |
4 | 545 | Drug | 15.4461393252021 | 11.240600432231 |
1 | 153 | Drug | 15.5163083478864 | 11.8019177313714 |
3 | 250 | Drug | 15.8043076138157 | 11.5690995283775 |
10 | 469 | Drug | 15.8043076138157 | 12.8664592872279 |
6 | 832 | Drug | 16.5360327391973 | 10.7263837943643 |
| subjects | treatments | gene1 | gene2 |
6 | 832 | Drug | 16.5360327391973 | 10.7263837943643 |
3 | 250 | Drug | 15.8043076138157 | 11.5690995283775 |
10 | 469 | Drug | 15.8043076138157 | 12.8664592872279 |
1 | 153 | Drug | 15.5163083478864 | 11.8019177313714 |
4 | 545 | Drug | 15.4461393252021 | 11.240600432231 |
7 | 625 | Drug | 14.2269966318662 | 13.5853056018832 |
9 | 32 | Drug | 13.671926122039 | 11.5003612782279 |
2 | 98 | Placebo | 11.3897127792735 | 10.4411774515486 |
12 | 443 | Placebo | 11.3897127792735 | 11.9559040521596 |
5 | 442 | Placebo | 10.7667391418026 | 12.4025164044546 |
11 | 955 | Placebo | 10.3796050585356 | 13.6823416307802 |
8 | 293 | Placebo | 9.38369841723581 | 12.1824448855273 |
Examples
1. Find all rows where subjects were given placebo with gene1 values
above 10 and gene2 values below 12.
| subjects | treatments | gene1 | gene2 |
2 | 98 | Placebo | 11.3897127792735 | 10.4411774515486 |
12 | 443 | Placebo | 11.3897127792735 | 11.9559040521596 |
2. What is the log ratio of gene2 to gene1 for subjects with IDs in
the range 100 to 500? Save the information for this subset in a new data
frame and sort by the log ratio in descending order.
| subjects | treatments | gene1 | gene2 | lr |
8 | 293 | Placebo | 9.38369841723581 | 12.1824448855273 | 0.261021998797103 |
5 | 442 | Placebo | 10.7667391418026 | 12.4025164044546 | 0.141437714890785 |
12 | 443 | Placebo | 11.3897127792735 | 11.9559040521596 | 0.0485146590903738 |
10 | 469 | Drug | 15.8043076138157 | 12.8664592872279 | -0.205658666561063 |
1 | 153 | Drug | 15.5163083478864 | 11.8019177313714 | -0.273629584403845 |
3 | 250 | Drug | 15.8043076138157 | 11.5690995283775 | -0.311944826627759 |
What is “tidy” data?
First we need some definitions:
- A value is a number or a string.
- A variable contains all values that measure the same underlying
attribute (e.g. height, gene expression). A fixed variable is
part of the experimental design (e.g. subject ID, treatment group)
while a measured variable refers to data collected as part of the
experiment (e.g. viral titer, gene expression level).
- An observation contains all values measured on the same unit
(e.g. person, assay)
To simplify data analysis, we want to clean and prepare tidy data,
where
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
With tidy data, a table is made up of rows and columns containing
values, where each row represents an observation and each column
represents a variable. Each table describes a type of
observation - for example, in a medical experiment we may have three
tables - demographic data about each person, gene expression data for
each person on each visit, and proteomic data for each person on each
visit.
See the tidy data
paper for a philosophy
of data.
There are many other aspects of cleaning up messy data not addressed
here. From the paper above, here are some:
Apart from tidying, there are many other tasks involved in cleaning
data: parsing dates and numbers, identifying missing values,
correcting character encodings (for international data), matching
similar but not identical values (created by typos), verifying
experimental design, and filling in structural missing values, not
to mention model-based data cleaning that identifies suspicious
values.
Examples
It is simplest to grasp the concept of tidy data via concrete examples.
Suppose we perform two different experiments on subjects Ann and Bob,
and measure some physiological attribute in each experiment. We can
record the data as a table in several ways:
Not tidy data (Table 1)
|
Expt 1 |
Expt 2 |
Ann |
175 |
45 |
Bob |
183 |
72 |
Not tidy data (Table 2)
|
Ann |
Bob |
Expt 1 |
175 |
183 |
Expt 2 |
45 |
72 |
Tidy data (Table 3)
Subject |
Expt |
Value |
Ann |
1 |
175 |
Ann |
2 |
45 |
Bob |
1 |
183 |
Bob |
2 |
72 |
Because of the uniform structure, tables in the tidy data format make it
easy to answer questions about different variables in a consistent way,
and also to merge different tables (e.g. subject demographic data with
results of laboratory assays for each subject). It is conventional to
order the columns so that fixed variables (e.g. subject) come before
measured variables.
What is a variable?
Table 4
Subject |
Gene 1 |
Gene 2 |
Ann |
175 |
45 |
Bob |
183 |
72 |
Table 1 and 4 have almost the exact same formal structure - yet Table 1
is messy while Table 4 is tidy. This is because it is natural to
consider the expression values for a particular gene across subjects as
a variable, but the same is not true for a particular experiment.
Pragmatically, whether something should be a variable or a variable
depends on its role in a statistical model. From a modeling perspective,
it is natural to consider outcome ~ gene1 * gene2
but
outcome ~ expt1 * expt2
doesn’t really make any sense.
Exercises
beaver1 and beaver2 are time series data collected on the body
temperatures of two beavers (You can read more
here)
| day | time | temp | activ |
1 | 346.00 | 840.00 | 36.33 | 0.00 |
2 | 346.00 | 850.00 | 36.34 | 0.00 |
3 | 346.00 | 900.00 | 36.35 | 0.00 |
4 | 346.00 | 910.00 | 36.42 | 0.00 |
5 | 346.00 | 920.00 | 36.55 | 0.00 |
6 | 346.00 | 930.00 | 36.69 | 0.00 |
'data.frame': 114 obs. of 4 variables:
$ day : num 346 346 346 346 346 346 346 346 346 346 ...
$ time : num 840 850 900 910 920 930 940 950 1000 1010 ...
$ temp : num 36.3 36.3 36.4 36.4 36.5 ...
$ activ: num 0 0 0 0 0 0 0 0 0 0 ...
| day | time | temp | activ |
1 | 307.00 | 930.00 | 36.58 | 0.00 |
2 | 307.00 | 940.00 | 36.73 | 0.00 |
3 | 307.00 | 950.00 | 36.93 | 0.00 |
4 | 307.00 | 1000.00 | 37.15 | 0.00 |
5 | 307.00 | 1010.00 | 37.23 | 0.00 |
6 | 307.00 | 1020.00 | 37.24 | 0.00 |
'data.frame': 100 obs. of 4 variables:
$ day : num 307 307 307 307 307 307 307 307 307 307 ...
$ time : num 930 940 950 1000 1010 1020 1030 1040 1050 1100 ...
$ temp : num 36.6 36.7 36.9 37.1 37.2 ...
$ activ: num 0 0 0 0 0 0 0 0 0 0 ...
Exercise 1: Combine the beaver1
and beaver2
data frames,
adding a new column beaver
with entries that take value 1 or 2
depending on which data frame the data came from.
| beaver | day | time | temp | activ |
1 | 1 | 346 | 840 | 36.33 | 0 |
2 | 1 | 346 | 850 | 36.34 | 0 |
3 | 1 | 346 | 900 | 36.35 | 0 |
4 | 1 | 346 | 910 | 36.42 | 0 |
Exercise 2: The data frame given below has multiple variables in a
singel column. Convert this to a tidy data frame with the following
variables = pid
, age
, date
and buffer
- the fields are
separated by a comma and a space.
| desc | values |
1 | 541-0194-1-0, 5wk, 5/Jan/09, PLA | 1.68401691250244 |
2 | 541-0215-1-0, 5wk, 1/Apr/09, PLA | -1.82541741635734 |
3 | 541-0231-6-0, 5wk, 31/Mar/09, PLA | 0.322267715332376 |
4 | 541-0235-4-0, 5wk, 22/Jan/09, PLA | 1.20301573073413 |
5 | 541-0237-8-0, 5wk, 21/Jan/09, PLA | -0.527461157819855 |
6 | 541-0248-0-0, 5wk, 1/Apr/09, PLA | -0.298129442756429 |
7 | 541-0254-6-0, 5wk, 24/Mar/09, PLA | -0.0780923846070984 |
8 | 541-0258-4-0, 5wk, 14/May/09, PLA | 0.990178319959082 |
| pid | age | date | buffer | values |
1 | 541-0194-1-0 | 5wk | 5/Jan/09 | PLA | 1.68401691250244 |
2 | 541-0215-1-0 | 5wk | 1/Apr/09 | PLA | -1.82541741635734 |
3 | 541-0231-6-0 | 5wk | 31/Mar/09 | PLA | 0.322267715332376 |
4 | 541-0235-4-0 | 5wk | 22/Jan/09 | PLA | 1.20301573073413 |
5 | 541-0237-8-0 | 5wk | 21/Jan/09 | PLA | -0.527461157819855 |
6 | 541-0248-0-0 | 5wk | 1/Apr/09 | PLA | -0.298129442756429 |
7 | 541-0254-6-0 | 5wk | 24/Mar/09 | PLA | -0.0780923846070984 |
8 | 541-0258-4-0 | 5wk | 14/May/09 | PLA | 0.990178319959082 |
Manipulating Data
We illustrate five common data manipulation techniques on tidy data.
When combined, these techniques provide power tools for slicing and
dicing your data:
filter
: slice
transform
: mutate
, transmute
, mutate_each
,
transmute_each
aggregate
: group_by
, count
, summarize
,
summarize_each
sort
: arrange
sample
: sample_n
, sample_frac
column operations
: select
, rename
String functions used:
References
Links to functions in packages
Start with a tidy data frame
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 5 | 3.6 | 1.4 | 0.2 | setosa |
6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Filter
subsetting or removing observations based on some condition
Filter on rows
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
11 | 5.4 | 3.7 | 1.5 | 0.2 | setosa |
12 | 4.8 | 3.4 | 1.6 | 0.2 | setosa |
13 | 4.8 | 3 | 1.4 | 0.1 | setosa |
14 | 4.3 | 3 | 1.1 | 0.1 | setosa |
15 | 5.8 | 4 | 1.2 | 0.2 | setosa |
16 | 5.7 | 4.4 | 1.5 | 0.4 | setosa |
17 | 5.4 | 3.9 | 1.3 | 0.4 | setosa |
18 | 5.1 | 3.5 | 1.4 | 0.3 | setosa |
19 | 5.7 | 3.8 | 1.7 | 0.3 | setosa |
20 | 5.1 | 3.8 | 1.5 | 0.3 | setosa |
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
1 | 5.4 | 3.7 | 1.5 | 0.2 | setosa |
2 | 4.8 | 3.4 | 1.6 | 0.2 | setosa |
3 | 4.8 | 3 | 1.4 | 0.1 | setosa |
4 | 4.3 | 3 | 1.1 | 0.1 | setosa |
5 | 5.8 | 4 | 1.2 | 0.2 | setosa |
6 | 5.7 | 4.4 | 1.5 | 0.4 | setosa |
7 | 5.4 | 3.9 | 1.3 | 0.4 | setosa |
8 | 5.1 | 3.5 | 1.4 | 0.3 | setosa |
9 | 5.7 | 3.8 | 1.7 | 0.3 | setosa |
10 | 5.1 | 3.8 | 1.5 | 0.3 | setosa |
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
51 | 7 | 3.2 | 4.7 | 1.4 | versicolor |
52 | 6.4 | 3.2 | 4.5 | 1.5 | versicolor |
53 | 6.9 | 3.1 | 4.9 | 1.5 | versicolor |
54 | 5.5 | 2.3 | 4 | 1.3 | versicolor |
55 | 6.5 | 2.8 | 4.6 | 1.5 | versicolor |
56 | 5.7 | 2.8 | 4.5 | 1.3 | versicolor |
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
1 | 7 | 3.2 | 4.7 | 1.4 | versicolor |
2 | 6.4 | 3.2 | 4.5 | 1.5 | versicolor |
3 | 6.9 | 3.1 | 4.9 | 1.5 | versicolor |
4 | 5.5 | 2.3 | 4 | 1.3 | versicolor |
5 | 6.5 | 2.8 | 4.6 | 1.5 | versicolor |
6 | 5.7 | 2.8 | 4.5 | 1.3 | versicolor |
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
51 | 7 | 3.2 | 4.7 | 1.4 | versicolor |
52 | 6.4 | 3.2 | 4.5 | 1.5 | versicolor |
53 | 6.9 | 3.1 | 4.9 | 1.5 | versicolor |
54 | 5.5 | 2.3 | 4 | 1.3 | versicolor |
55 | 6.5 | 2.8 | 4.6 | 1.5 | versicolor |
56 | 5.7 | 2.8 | 4.5 | 1.3 | versicolor |
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
1 | 7 | 3.2 | 4.7 | 1.4 | versicolor |
2 | 6.4 | 3.2 | 4.5 | 1.5 | versicolor |
3 | 6.9 | 3.1 | 4.9 | 1.5 | versicolor |
4 | 5.5 | 2.3 | 4 | 1.3 | versicolor |
5 | 6.5 | 2.8 | 4.6 | 1.5 | versicolor |
6 | 5.7 | 2.8 | 4.5 | 1.3 | versicolor |
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
101 | 6.3 | 3.3 | 6 | 2.5 | virginica |
102 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
103 | 7.1 | 3 | 5.9 | 2.1 | virginica |
104 | 6.3 | 2.9 | 5.6 | 1.8 | virginica |
105 | 6.5 | 3 | 5.8 | 2.2 | virginica |
106 | 7.6 | 3 | 6.6 | 2.1 | virginica |
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
1 | 6.3 | 3.3 | 6 | 2.5 | virginica |
2 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
3 | 7.1 | 3 | 5.9 | 2.1 | virginica |
4 | 6.3 | 2.9 | 5.6 | 1.8 | virginica |
5 | 6.5 | 3 | 5.8 | 2.2 | virginica |
6 | 7.6 | 3 | 6.6 | 2.1 | virginica |
Example
- Display only rows where the species is not versicolor and the petal
width is greater than the mean petal width.
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
101 | 6.3 | 3.3 | 6 | 2.5 | virginica |
102 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
103 | 7.1 | 3 | 5.9 | 2.1 | virginica |
104 | 6.3 | 2.9 | 5.6 | 1.8 | virginica |
105 | 6.5 | 3 | 5.8 | 2.2 | virginica |
106 | 7.6 | 3 | 6.6 | 2.1 | virginica |
Example
2. The iris
data set measurements are in centimeters. Convert
them all to inches rounding to 2 decimal places and using the same
column names.
Note: 1 inch = 2.54 cm.
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
1 | 2.01 | 1.38 | 0.55 | 0.08 | setosa |
2 | 1.93 | 1.18 | 0.55 | 0.08 | setosa |
3 | 1.85 | 1.26 | 0.51 | 0.08 | setosa |
4 | 1.81 | 1.22 | 0.59 | 0.08 | setosa |
5 | 1.97 | 1.42 | 0.55 | 0.08 | setosa |
6 | 2.13 | 1.54 | 0.67 | 0.16 | setosa |
Aggregate
collapsing multiple values into a single value (e.g., by summing or
taking means).
| Species | n |
1 | setosa | 50 |
2 | versicolor | 50 |
3 | virginica | 50 |
| mean(Sepal.Length) |
1 | 5.843333 |
| Sepal.Length_mean | Sepal.Width_mean | Petal.Length_mean | Petal.Width_mean | Sepal.Length_sd | Sepal.Width_sd | Petal.Length_sd | Petal.Width_sd |
1 | 5.8433333 | 3.0573333 | 3.7580000 | 1.1993333 | 0.8280661 | 0.4358663 | 1.7652982 | 0.7622377 |
| Species | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width |
1 | setosa | 5.006 | 3.428 | 1.462 | 0.246 |
2 | versicolor | 5.936 | 2.77 | 4.26 | 1.326 |
3 | virginica | 6.588 | 2.974 | 5.552 | 2.026 |
Example
3. Find the min and max values for each group in the PlantGrowth
data set. Name the columns smallest and largest respectively.
'data.frame': 30 obs. of 2 variables:
$ weight: num 4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
$ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ...
| group | smallest | largest |
1 | ctrl | 4.17 | 6.11 |
2 | trt1 | 3.59 | 6.03 |
3 | trt2 | 4.92 | 6.31 |
Sort
changing the order of observations
Sample
select random rows from a table
Sample 10 rows without replacement
Operations on columns
Sometimes we want to “filter” on columns rather than rows. We can use
standard indexing or the dplyr
functions
Exercises
Exercise 1: Recall that we converted the following data set into a
tidy frame. Now we are informed that the pid
field is actually a
composite of expt.id
, pid
, sample.id
and replicate.id
separated by -
, while the date
field needs to be further broken
down into day
, month
and year
columns. In other words, we
want a final data set which has the variables expt.id
, pid
,
sample.id
, replicate.id
, age
,day
, month
, year
,
buffer
and value
. Discard the initial semi-colon.
Exercise 2: You are now informed that the values were in the wrong
units. You need to apply the formula y = 39 + 3*x
to recover the
original values. Round to 2 decimal points.
Exercise 3: Find the mean, min and max values when grouped by
month
.
Exercise 4: You are now informed that the assay machine was not
working properly in March. Create a new data set without any March
values.