Using dplyr
for data manipulation¶
Description
dplyr provides a flexible grammar of data manipulation. It’s the next iteration of plyr, focused on tools for working with data frames (hence the d in the name).
If you look at `dplyr
docs <https://cran.r-project.org/web/packages/dplyr/dplyr.pdf>`__, there
is a rich collection of data manipulaiton verbs provided. However, most
common tasks can be accomplished with just 6 verbs that we will cover in
this session:
select
filter
mutate a
arrange
summarize
group_by
We will also see how to construct data manipulation “sentnces” by using
these versb togetehr wtih pipes
.
In [1]:
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1 ✔ purrr 0.2.5
✔ tibble 1.4.2 ✔ dplyr 0.7.5
✔ tidyr 0.8.1 ✔ stringr 1.3.1
✔ readr 1.1.1 ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
Data¶
In [2]:
path='../josh/info/2018_pilot_metadata.tsv'
In [3]:
df <- read_tsv(path)
Parsed with column specification:
cols(
Label = col_character(),
RNA_sample_num = col_integer(),
Media = col_character(),
Strain = col_character(),
Replicate = col_integer(),
experiment_person = col_character(),
libprep_person = col_character(),
enrichment_method = col_character(),
RIN = col_double(),
concentration_fold_difference = col_double(),
`i7 index` = col_character(),
`i5 index` = col_character(),
`i5 primer` = col_character(),
`i7 primer` = col_character(),
`library#` = col_integer()
)
In [4]:
names(df)
- 'Label'
- 'RNA_sample_num'
- 'Media'
- 'Strain'
- 'Replicate'
- 'experiment_person'
- 'libprep_person'
- 'enrichment_method'
- 'RIN'
- 'concentration_fold_difference'
- 'i7 index'
- 'i5 index'
- 'i5 primer'
- 'i7 primer'
- 'library#'
Fix nmaes to be consistent¶
Note that some names use spaces between words and others use underscores. Le’ts finx this.
In [5]:
names(df) <- str_replace_all(names(df), c('[:space:]+' = '_'))
In [6]:
names(df)
- 'Label'
- 'RNA_sample_num'
- 'Media'
- 'Strain'
- 'Replicate'
- 'experiment_person'
- 'libprep_person'
- 'enrichment_method'
- 'RIN'
- 'concentration_fold_difference'
- 'i7_index'
- 'i5_index'
- 'i5_primer'
- 'i7_primer'
- 'library#'
In [7]:
dim(df)
- 51
- 15
Note: Drop some columns so table fits in browser
In [8]:
df <- df[, c(2:5, 8:15)]
In [9]:
dim(df)
- 51
- 12
In [10]:
sample_n(df, 3)
RNA_sample_num | Media | Strain | Replicate | enrichment_method | RIN | concentration_fold_difference | i7_index | i5_index | i5_primer | i7_primer | library# |
---|---|---|---|---|---|---|---|---|---|---|---|
27 | YPD | H99 | 9 | RZ | 10.0 | 3.57 | GAATTCGT | TCAGAGCC | i504 | i706 | 46 |
26 | YPD | H99 | 8 | MA | 10.0 | 2.76 | ATTACTCG | GTCAGTAC | i508 | i701 | 8 |
36 | YPD | mar1d | 12 | MA | 9.7 | 3.70 | CGCTCATT | ACGTCCTG | i507 | i703 | 24 |
In [11]:
summary(df)
RNA_sample_num Media Strain Replicate
Min. : 1.00 Length:51 Length:51 Min. : 1.000
1st Qu.: 9.50 Class :character Class :character 1st Qu.: 3.000
Median :16.00 Mode :character Mode :character Median : 4.000
Mean :19.55 Mean : 5.431
3rd Qu.:27.00 3rd Qu.: 8.000
Max. :47.00 Max. :12.000
enrichment_method RIN concentration_fold_difference
Length:51 Min. : 6.200 Min. :1.340
Class :character 1st Qu.: 9.900 1st Qu.:2.010
Mode :character Median :10.000 Median :2.850
Mean : 9.776 Mean :2.892
3rd Qu.:10.000 3rd Qu.:3.640
Max. :10.000 Max. :5.530
i7_index i5_index i5_primer i7_primer
Length:51 Length:51 Length:51 Length:51
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
library#
Min. : 1.0
1st Qu.:13.5
Median :26.0
Mean :26.0
3rd Qu.:38.5
Max. :51.0
Fixing fake numeric columns¶
Note that RNA_sample_num
, Replicate
and library#
are really
discrete factors
rather than numbrs. As #
is not allowed in an R
variable name, we need to use backticks for library#
.
(Alternatively, we cna reame to something like library_num
).
In [12]:
df %>% mutate(
RNA_sample_num=factor(RNA_sample_num),
Replicate=factor(Replicate),
`library#`=factor(`library#`)
) %>%
select_if(is.factor) %>%
names
- 'RNA_sample_num'
- 'Replicate'
- 'library#'
After checking that the transformation worked, we can save the transformed data.frame.
In [13]:
df <- df %>% mutate(
RNA_sample_num=factor(RNA_sample_num),
Replicate=factor(Replicate),
`library#`=factor(`library#`)
)
0. Pipe¶
In [14]:
df %>% head(10) %>% tail(3)
RNA_sample_num | Media | Strain | Replicate | enrichment_method | RIN | concentration_fold_difference | i7_index | i5_index | i5_primer | i7_primer | library# |
---|---|---|---|---|---|---|---|---|---|---|---|
26 | YPD | H99 | 8 | MA | 10 | 2.76 | ATTACTCG | GTCAGTAC | i508 | i701 | 8 |
2 | YPD | H99 | 2 | RZ | 10 | 1.34 | TCCGGAGA | AGGCTATA | i501 | i702 | 9 |
9 | YPD | mar1d | 3 | RZ | 10 | 2.23 | TCCGGAGA | GCCTCTAT | i502 | i702 | 10 |
Note that the above result can also be achieved with slice
In [15]:
df %>% slice(8:10)
RNA_sample_num | Media | Strain | Replicate | enrichment_method | RIN | concentration_fold_difference | i7_index | i5_index | i5_primer | i7_primer | library# |
---|---|---|---|---|---|---|---|---|---|---|---|
26 | YPD | H99 | 8 | MA | 10 | 2.76 | ATTACTCG | GTCAGTAC | i508 | i701 | 8 |
2 | YPD | H99 | 2 | RZ | 10 | 1.34 | TCCGGAGA | AGGCTATA | i501 | i702 | 9 |
9 | YPD | mar1d | 3 | RZ | 10 | 2.23 | TCCGGAGA | GCCTCTAT | i502 | i702 | 10 |
1. Select columns¶
In [16]:
df %>% select(1:3) %>% head(3)
RNA_sample_num | Media | Strain |
---|---|---|
2 | YPD | H99 |
9 | YPD | mar1d |
10 | YPD | mar1d |
In [17]:
df %>% select(c(1,3,5)) %>% head(3)
RNA_sample_num | Strain | enrichment_method |
---|---|---|
2 | H99 | MA |
9 | mar1d | MA |
10 | mar1d | MA |
In [18]:
df %>% select(c('RNA_sample_num', 'Media')) %>% head(3)
RNA_sample_num | Media |
---|---|
2 | YPD |
9 | YPD |
10 | YPD |
Selecting using string operations¶
In [20]:
df %>% select(ends_with('person')) %>% head(3)
In [21]:
df %>% select(starts_with('i')) %>% head(3)
i7_index | i5_index | i5_primer | i7_primer |
---|---|---|---|
ATTACTCG | AGGCTATA | i501 | i701 |
ATTACTCG | GCCTCTAT | i502 | i701 |
ATTACTCG | AGGATAGG | i503 | i701 |
In [22]:
df %>% select(contains('primer')) %>% head(3)
i5_primer | i7_primer |
---|---|
i501 | i701 |
i502 | i701 |
i503 | i701 |
In [23]:
df %>% select(matches('.*_.*')) %>% head(3)
RNA_sample_num | enrichment_method | concentration_fold_difference | i7_index | i5_index | i5_primer | i7_primer |
---|---|---|---|---|---|---|
2 | MA | 1.34 | ATTACTCG | AGGCTATA | i501 | i701 |
9 | MA | 2.23 | ATTACTCG | GCCTCTAT | i502 | i701 |
10 | MA | 4.37 | ATTACTCG | AGGATAGG | i503 | i701 |
Renaming columns with select¶
In [24]:
df %>%
select(c('method' = 'enrichment_method',
'fold.change' = 'concentration_fold_difference')) %>%
head(3)
method | fold.change |
---|---|
MA | 1.34 |
MA | 2.23 |
MA | 4.37 |
Scoped variants¶
In [25]:
df %>% select_if(is.numeric) %>% head(3)
RIN | concentration_fold_difference |
---|---|
10.0 | 1.34 |
10.0 | 2.23 |
9.9 | 4.37 |
In [26]:
df %>%
select_at(c('enrichment_method',
'concentration_fold_difference'), toupper) %>%
head(3)
ENRICHMENT_METHOD | CONCENTRATION_FOLD_DIFFERENCE |
---|---|
MA | 1.34 |
MA | 2.23 |
MA | 4.37 |
In [27]:
df %>% select_all(tolower) %>% head(3)
rna_sample_num | media | strain | replicate | enrichment_method | rin | concentration_fold_difference | i7_index | i5_index | i5_primer | i7_primer | library# |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | YPD | H99 | 2 | MA | 10.0 | 1.34 | ATTACTCG | AGGCTATA | i501 | i701 | 1 |
9 | YPD | mar1d | 3 | MA | 10.0 | 2.23 | ATTACTCG | GCCTCTAT | i502 | i701 | 2 |
10 | YPD | mar1d | 4 | MA | 9.9 | 4.37 | ATTACTCG | AGGATAGG | i503 | i701 | 3 |
2. Filter rows¶
In [28]:
unique(df$Media)
- 'YPD'
- 'TC'
Equality and inequality conditions¶
In [29]:
df %>% filter(Media == 'TC') %>% head(3)
RNA_sample_num | Media | Strain | Replicate | enrichment_method | RIN | concentration_fold_difference | i7_index | i5_index | i5_primer | i7_primer | library# |
---|---|---|---|---|---|---|---|---|---|---|---|
14 | TC | H99 | 2 | MA | 10.0 | 1.57 | ATTACTCG | TCAGAGCC | i504 | i701 | 4 |
15 | TC | H99 | 3 | MA | 9.9 | 2.85 | ATTACTCG | CTTCGCCT | i505 | i701 | 5 |
21 | TC | mar1d | 3 | MA | 10.0 | 1.81 | ATTACTCG | TAAGATTA | i506 | i701 | 6 |
In [30]:
df %>% filter(concentration_fold_difference > 3) %>% head(3)
RNA_sample_num | Media | Strain | Replicate | enrichment_method | RIN | concentration_fold_difference | i7_index | i5_index | i5_primer | i7_primer | library# |
---|---|---|---|---|---|---|---|---|---|---|---|
10 | YPD | mar1d | 4 | MA | 9.9 | 4.37 | ATTACTCG | AGGATAGG | i503 | i701 | 3 |
10 | YPD | mar1d | 4 | RZ | 9.9 | 4.37 | TCCGGAGA | AGGATAGG | i503 | i702 | 11 |
1 | YPD | H99 | 1 | MA | 10.0 | 3.64 | CGCTCATT | AGGCTATA | i501 | i703 | 18 |
Combining conditions¶
In [31]:
df %>%
filter(Media != 'TC',
concentration_fold_difference > 3) %>%
head(3)
RNA_sample_num | Media | Strain | Replicate | enrichment_method | RIN | concentration_fold_difference | i7_index | i5_index | i5_primer | i7_primer | library# |
---|---|---|---|---|---|---|---|---|---|---|---|
10 | YPD | mar1d | 4 | MA | 9.9 | 4.37 | ATTACTCG | AGGATAGG | i503 | i701 | 3 |
10 | YPD | mar1d | 4 | RZ | 9.9 | 4.37 | TCCGGAGA | AGGATAGG | i503 | i702 | 11 |
1 | YPD | H99 | 1 | MA | 10.0 | 3.64 | CGCTCATT | AGGCTATA | i501 | i703 | 18 |
In [32]:
df %>%
filter(Media == 'TC' |
concentration_fold_difference > 3) %>%
head(3)
RNA_sample_num | Media | Strain | Replicate | enrichment_method | RIN | concentration_fold_difference | i7_index | i5_index | i5_primer | i7_primer | library# |
---|---|---|---|---|---|---|---|---|---|---|---|
10 | YPD | mar1d | 4 | MA | 9.9 | 4.37 | ATTACTCG | AGGATAGG | i503 | i701 | 3 |
14 | TC | H99 | 2 | MA | 10.0 | 1.57 | ATTACTCG | TCAGAGCC | i504 | i701 | 4 |
15 | TC | H99 | 3 | MA | 9.9 | 2.85 | ATTACTCG | CTTCGCCT | i505 | i701 | 5 |
Filtering on string conditions¶
In [33]:
df %>% filter(str_length(Media) == 3) %>% head(3)
RNA_sample_num | Media | Strain | Replicate | enrichment_method | RIN | concentration_fold_difference | i7_index | i5_index | i5_primer | i7_primer | library# |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | YPD | H99 | 2 | MA | 10.0 | 1.34 | ATTACTCG | AGGCTATA | i501 | i701 | 1 |
9 | YPD | mar1d | 3 | MA | 10.0 | 2.23 | ATTACTCG | GCCTCTAT | i502 | i701 | 2 |
10 | YPD | mar1d | 4 | MA | 9.9 | 4.37 | ATTACTCG | AGGATAGG | i503 | i701 | 3 |
In [34]:
df %>% filter(str_detect(i7_index, '^A.+')) %>% head(3)
RNA_sample_num | Media | Strain | Replicate | enrichment_method | RIN | concentration_fold_difference | i7_index | i5_index | i5_primer | i7_primer | library# |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | YPD | H99 | 2 | MA | 10.0 | 1.34 | ATTACTCG | AGGCTATA | i501 | i701 | 1 |
9 | YPD | mar1d | 3 | MA | 10.0 | 2.23 | ATTACTCG | GCCTCTAT | i502 | i701 | 2 |
10 | YPD | mar1d | 4 | MA | 9.9 | 4.37 | ATTACTCG | AGGATAGG | i503 | i701 | 3 |
3. Arrange in ascending or descening order¶
In [35]:
df %>% arrange(concentration_fold_difference) %>% head(3)
RNA_sample_num | Media | Strain | Replicate | enrichment_method | RIN | concentration_fold_difference | i7_index | i5_index | i5_primer | i7_primer | library# |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | YPD | H99 | 2 | MA | 10 | 1.34 | ATTACTCG | AGGCTATA | i501 | i701 | 1 |
2 | YPD | H99 | 2 | RZ | 10 | 1.34 | TCCGGAGA | AGGCTATA | i501 | i702 | 9 |
2 | YPD | H99 | 2 | TOT | 10 | 1.34 | CTGAAGCT | AGGCTATA | i501 | i707 | 17 |
In [36]:
df %>% arrange(desc(concentration_fold_difference)) %>% head(3)
RNA_sample_num | Media | Strain | Replicate | enrichment_method | RIN | concentration_fold_difference | i7_index | i5_index | i5_primer | i7_primer | library# |
---|---|---|---|---|---|---|---|---|---|---|---|
24 | TC | mar1d | 6 | MA | 10.0 | 5.53 | CGCTCATT | TAAGATTA | i506 | i703 | 23 |
24 | TC | mar1d | 6 | RZ | 10.0 | 5.53 | GAGATTCC | TAAGATTA | i506 | i704 | 31 |
23 | TC | mar1d | 5 | MA | 9.9 | 4.47 | CGCTCATT | CTTCGCCT | i505 | i703 | 22 |
In [37]:
df %>%
arrange(Replicate,
desc(concentration_fold_difference))%>%
head(3)
RNA_sample_num | Media | Strain | Replicate | enrichment_method | RIN | concentration_fold_difference | i7_index | i5_index | i5_primer | i7_primer | library# |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | YPD | H99 | 1 | MA | 10 | 3.64 | CGCTCATT | AGGCTATA | i501 | i703 | 18 |
1 | YPD | H99 | 1 | RZ | 10 | 3.64 | GAGATTCC | AGGCTATA | i501 | i704 | 26 |
13 | TC | H99 | 1 | MA | 10 | 1.95 | CGCTCATT | TCAGAGCC | i504 | i703 | 21 |
Using top_n
¶
In [38]:
df %>% top_n(3, concentration_fold_difference)
RNA_sample_num | Media | Strain | Replicate | enrichment_method | RIN | concentration_fold_difference | i7_index | i5_index | i5_primer | i7_primer | library# |
---|---|---|---|---|---|---|---|---|---|---|---|
23 | TC | mar1d | 5 | MA | 9.9 | 4.47 | CGCTCATT | CTTCGCCT | i505 | i703 | 22 |
24 | TC | mar1d | 6 | MA | 10.0 | 5.53 | CGCTCATT | TAAGATTA | i506 | i703 | 23 |
23 | TC | mar1d | 5 | RZ | 9.9 | 4.47 | GAGATTCC | CTTCGCCT | i505 | i704 | 30 |
24 | TC | mar1d | 6 | RZ | 10.0 | 5.53 | GAGATTCC | TAAGATTA | i506 | i704 | 31 |
In [39]:
df %>% top_n(3, desc(concentration_fold_difference))
RNA_sample_num | Media | Strain | Replicate | enrichment_method | RIN | concentration_fold_difference | i7_index | i5_index | i5_primer | i7_primer | library# |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | YPD | H99 | 2 | MA | 10 | 1.34 | ATTACTCG | AGGCTATA | i501 | i701 | 1 |
2 | YPD | H99 | 2 | RZ | 10 | 1.34 | TCCGGAGA | AGGCTATA | i501 | i702 | 9 |
2 | YPD | H99 | 2 | TOT | 10 | 1.34 | CTGAAGCT | AGGCTATA | i501 | i707 | 17 |
4. Mutate values¶
In [40]:
df %>%
select(RNA_sample_num, concentration_fold_difference) %>%
mutate(concentration_difference=2^concentration_fold_difference) %>%
head(3)
RNA_sample_num | concentration_fold_difference | concentration_difference |
---|---|---|
2 | 1.34 | 2.531513 |
9 | 2.23 | 4.691340 |
10 | 4.37 | 20.677645 |
In [41]:
df %>%
transmute(concentration_difference=2^concentration_fold_difference) %>%
head(3)
concentration_difference |
---|
2.531513 |
4.691340 |
20.677645 |
5. Summarize¶
In [42]:
df %>% summarize(avg_fold_change=mean(concentration_fold_difference),
min_fold_change=min(concentration_fold_difference),
max_fold_change=max(concentration_fold_difference))
avg_fold_change | min_fold_change | max_fold_change |
---|---|---|
2.891961 | 1.34 | 5.53 |
In [43]:
df %>% summarize_if(is.numeric, mean)
RIN | concentration_fold_difference |
---|---|
9.776471 | 2.891961 |
Group_by¶
summariz
e is most useful when used with group_by
In [44]:
df %>%
group_by(Media, Strain, enrichment_method) %>%
summarize(mean_fold_diff=mean(concentration_fold_difference))
Media | Strain | enrichment_method | mean_fold_diff |
---|---|---|---|
TC | H99 | MA | 2.576667 |
TC | H99 | RZ | 2.576667 |
TC | mar1d | MA | 3.463333 |
TC | mar1d | RZ | 3.463333 |
YPD | H99 | MA | 2.556667 |
YPD | H99 | RZ | 2.556667 |
YPD | H99 | TOT | 1.790000 |
YPD | mar1d | MA | 3.246667 |
YPD | mar1d | RZ | 3.246667 |
In [45]:
df %>%
group_by(Media, Strain, enrichment_method) %>%
summarize_if(is.numeric, mean)
Media | Strain | enrichment_method | RIN | concentration_fold_difference |
---|---|---|---|---|
TC | H99 | MA | 9.850000 | 2.576667 |
TC | H99 | RZ | 9.850000 | 2.576667 |
TC | mar1d | MA | 9.333333 | 3.463333 |
TC | mar1d | RZ | 9.333333 | 3.463333 |
YPD | H99 | MA | 10.000000 | 2.556667 |
YPD | H99 | RZ | 10.000000 | 2.556667 |
YPD | H99 | TOT | 10.000000 | 1.790000 |
YPD | mar1d | MA | 9.866667 | 3.246667 |
YPD | mar1d | RZ | 9.866667 | 3.246667 |
In [46]:
df %>%
group_by(Media, Strain) %>%
summarize_if(is.numeric, funs(mean, sd))
Media | Strain | RIN_mean | concentration_fold_difference_mean | RIN_sd | concentration_fold_difference_sd |
---|---|---|---|---|---|
TC | H99 | 9.850000 | 2.576667 | 0.2611165 | 0.7157873 |
TC | mar1d | 9.333333 | 3.463333 | 1.4643232 | 1.3630803 |
YPD | H99 | 10.000000 | 2.403333 | 0.0000000 | 0.8593824 |
YPD | mar1d | 9.866667 | 3.246667 | 0.1669694 | 0.7410230 |
What experimental conditions produced the greatest mean fold change?
In [47]:
df %>%
group_by(Media, Strain, enrichment_method) %>%
summarize(mean_fold_diff=mean(concentration_fold_difference)) %>%
arrange(desc(mean_fold_diff)) %>%
head(1)
Media | Strain | enrichment_method | mean_fold_diff |
---|---|---|---|
TC | mar1d | MA | 3.463333 |