Using dplyr for data manipulation

Description

dplyr provides a flexible grammar of data manipulation. It’s the next iteration of plyr, focused on tools for working with data frames (hence the d in the name).

If you look at `dplyr docs <https://cran.r-project.org/web/packages/dplyr/dplyr.pdf>`__, there is a rich collection of data manipulaiton verbs provided. However, most common tasks can be accomplished with just 6 verbs that we will cover in this session:

select
filter
mutate a
arrange
summarize
group_by

We will also see how to construct data manipulation “sentnces” by using these versb togetehr wtih pipes.

In [1]:
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.5
✔ tidyr   0.8.1     ✔ stringr 1.3.1
✔ readr   1.1.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Data

In [2]:
path='../josh/info/2018_pilot_metadata.tsv'
In [3]:
df <- read_tsv(path)
Parsed with column specification:
cols(
  Label = col_character(),
  RNA_sample_num = col_integer(),
  Media = col_character(),
  Strain = col_character(),
  Replicate = col_integer(),
  experiment_person = col_character(),
  libprep_person = col_character(),
  enrichment_method = col_character(),
  RIN = col_double(),
  concentration_fold_difference = col_double(),
  `i7 index` = col_character(),
  `i5 index` = col_character(),
  `i5 primer` = col_character(),
  `i7 primer` = col_character(),
  `library#` = col_integer()
)
In [4]:
names(df)
  1. 'Label'
  2. 'RNA_sample_num'
  3. 'Media'
  4. 'Strain'
  5. 'Replicate'
  6. 'experiment_person'
  7. 'libprep_person'
  8. 'enrichment_method'
  9. 'RIN'
  10. 'concentration_fold_difference'
  11. 'i7 index'
  12. 'i5 index'
  13. 'i5 primer'
  14. 'i7 primer'
  15. 'library#'

Fix nmaes to be consistent

Note that some names use spaces between words and others use underscores. Le’ts finx this.

In [5]:
names(df) <- str_replace_all(names(df), c('[:space:]+' = '_'))
In [6]:
names(df)
  1. 'Label'
  2. 'RNA_sample_num'
  3. 'Media'
  4. 'Strain'
  5. 'Replicate'
  6. 'experiment_person'
  7. 'libprep_person'
  8. 'enrichment_method'
  9. 'RIN'
  10. 'concentration_fold_difference'
  11. 'i7_index'
  12. 'i5_index'
  13. 'i5_primer'
  14. 'i7_primer'
  15. 'library#'
In [7]:
dim(df)
  1. 51
  2. 15

Note: Drop some columns so table fits in browser

In [8]:
df <- df[, c(2:5, 8:15)]
In [9]:
dim(df)
  1. 51
  2. 12
In [10]:
sample_n(df, 3)
RNA_sample_numMediaStrainReplicateenrichment_methodRINconcentration_fold_differencei7_indexi5_indexi5_primeri7_primerlibrary#
27 YPD H99 9 RZ 10.0 3.57 GAATTCGTTCAGAGCCi504 i706 46
26 YPD H99 8 MA 10.0 2.76 ATTACTCGGTCAGTACi508 i701 8
36 YPD mar1d 12 MA 9.7 3.70 CGCTCATTACGTCCTGi507 i703 24
In [11]:
summary(df)
RNA_sample_num     Media              Strain            Replicate
 Min.   : 1.00   Length:51          Length:51          Min.   : 1.000
 1st Qu.: 9.50   Class :character   Class :character   1st Qu.: 3.000
 Median :16.00   Mode  :character   Mode  :character   Median : 4.000
 Mean   :19.55                                         Mean   : 5.431
 3rd Qu.:27.00                                         3rd Qu.: 8.000
 Max.   :47.00                                         Max.   :12.000
 enrichment_method       RIN         concentration_fold_difference
 Length:51          Min.   : 6.200   Min.   :1.340
 Class :character   1st Qu.: 9.900   1st Qu.:2.010
 Mode  :character   Median :10.000   Median :2.850
                    Mean   : 9.776   Mean   :2.892
                    3rd Qu.:10.000   3rd Qu.:3.640
                    Max.   :10.000   Max.   :5.530
   i7_index           i5_index          i5_primer          i7_primer
 Length:51          Length:51          Length:51          Length:51
 Class :character   Class :character   Class :character   Class :character
 Mode  :character   Mode  :character   Mode  :character   Mode  :character



    library#
 Min.   : 1.0
 1st Qu.:13.5
 Median :26.0
 Mean   :26.0
 3rd Qu.:38.5
 Max.   :51.0

Fixing fake numeric columns

Note that RNA_sample_num, Replicate and library# are really discrete factors rather than numbrs. As # is not allowed in an R variable name, we need to use backticks for library#. (Alternatively, we cna reame to something like library_num).

In [12]:
df %>% mutate(
    RNA_sample_num=factor(RNA_sample_num),
    Replicate=factor(Replicate),
    `library#`=factor(`library#`)
) %>%
select_if(is.factor) %>%
names
  1. 'RNA_sample_num'
  2. 'Replicate'
  3. 'library#'

After checking that the transformation worked, we can save the transformed data.frame.

In [13]:
df <- df %>% mutate(
    RNA_sample_num=factor(RNA_sample_num),
    Replicate=factor(Replicate),
    `library#`=factor(`library#`)
)

0. Pipe

In [14]:
df %>% head(10) %>% tail(3)
RNA_sample_numMediaStrainReplicateenrichment_methodRINconcentration_fold_differencei7_indexi5_indexi5_primeri7_primerlibrary#
26 YPD H99 8 MA 10 2.76 ATTACTCGGTCAGTACi508 i701 8
2 YPD H99 2 RZ 10 1.34 TCCGGAGAAGGCTATAi501 i702 9
9 YPD mar1d 3 RZ 10 2.23 TCCGGAGAGCCTCTATi502 i702 10

Note that the above result can also be achieved with slice

In [15]:
df %>% slice(8:10)
RNA_sample_numMediaStrainReplicateenrichment_methodRINconcentration_fold_differencei7_indexi5_indexi5_primeri7_primerlibrary#
26 YPD H99 8 MA 10 2.76 ATTACTCGGTCAGTACi508 i701 8
2 YPD H99 2 RZ 10 1.34 TCCGGAGAAGGCTATAi501 i702 9
9 YPD mar1d 3 RZ 10 2.23 TCCGGAGAGCCTCTATi502 i702 10

1. Select columns

In [16]:
df %>% select(1:3) %>% head(3)
RNA_sample_numMediaStrain
2 YPD H99
9 YPD mar1d
10 YPD mar1d
In [17]:
df %>% select(c(1,3,5)) %>% head(3)
RNA_sample_numStrainenrichment_method
2 H99 MA
9 mar1dMA
10 mar1dMA
In [18]:
df %>% select(c('RNA_sample_num', 'Media')) %>% head(3)
RNA_sample_numMedia
2 YPD
9 YPD
10 YPD

Dropping columns

In [19]:
df %>% select(-((1:10))) %>% head(3)
i7_primerlibrary#
i7011
i7012
i7013

Selecting using string operations

In [20]:
df %>% select(ends_with('person')) %>% head(3)
In [21]:
df %>% select(starts_with('i')) %>% head(3)
i7_indexi5_indexi5_primeri7_primer
ATTACTCGAGGCTATAi501 i701
ATTACTCGGCCTCTATi502 i701
ATTACTCGAGGATAGGi503 i701
In [22]:
df %>% select(contains('primer')) %>% head(3)
i5_primeri7_primer
i501i701
i502i701
i503i701
In [23]:
df %>% select(matches('.*_.*')) %>% head(3)
RNA_sample_numenrichment_methodconcentration_fold_differencei7_indexi5_indexi5_primeri7_primer
2 MA 1.34 ATTACTCGAGGCTATAi501 i701
9 MA 2.23 ATTACTCGGCCTCTATi502 i701
10 MA 4.37 ATTACTCGAGGATAGGi503 i701

Renaming columns with select

In [24]:
df %>%
select(c('method' = 'enrichment_method',
         'fold.change' =  'concentration_fold_difference')) %>%
head(3)
methodfold.change
MA 1.34
MA 2.23
MA 4.37

Scoped variants

In [25]:
df %>% select_if(is.numeric) %>% head(3)
RINconcentration_fold_difference
10.01.34
10.02.23
9.94.37
In [26]:
df %>%
select_at(c('enrichment_method',
            'concentration_fold_difference'), toupper) %>%
head(3)
ENRICHMENT_METHODCONCENTRATION_FOLD_DIFFERENCE
MA 1.34
MA 2.23
MA 4.37
In [27]:
df %>% select_all(tolower) %>% head(3)
rna_sample_nummediastrainreplicateenrichment_methodrinconcentration_fold_differencei7_indexi5_indexi5_primeri7_primerlibrary#
2 YPD H99 2 MA 10.0 1.34 ATTACTCGAGGCTATAi501 i701 1
9 YPD mar1d 3 MA 10.0 2.23 ATTACTCGGCCTCTATi502 i701 2
10 YPD mar1d 4 MA 9.9 4.37 ATTACTCGAGGATAGGi503 i701 3

2. Filter rows

In [28]:
unique(df$Media)
  1. 'YPD'
  2. 'TC'

Equality and inequality conditions

In [29]:
df %>% filter(Media == 'TC') %>% head(3)
RNA_sample_numMediaStrainReplicateenrichment_methodRINconcentration_fold_differencei7_indexi5_indexi5_primeri7_primerlibrary#
14 TC H99 2 MA 10.0 1.57 ATTACTCGTCAGAGCCi504 i701 4
15 TC H99 3 MA 9.9 2.85 ATTACTCGCTTCGCCTi505 i701 5
21 TC mar1d 3 MA 10.0 1.81 ATTACTCGTAAGATTAi506 i701 6
In [30]:
df %>% filter(concentration_fold_difference > 3) %>% head(3)
RNA_sample_numMediaStrainReplicateenrichment_methodRINconcentration_fold_differencei7_indexi5_indexi5_primeri7_primerlibrary#
10 YPD mar1d 4 MA 9.9 4.37 ATTACTCGAGGATAGGi503 i701 3
10 YPD mar1d 4 RZ 9.9 4.37 TCCGGAGAAGGATAGGi503 i702 11
1 YPD H99 1 MA 10.0 3.64 CGCTCATTAGGCTATAi501 i703 18

Combining conditions

In [31]:
df %>%
filter(Media != 'TC',
       concentration_fold_difference > 3) %>%
head(3)
RNA_sample_numMediaStrainReplicateenrichment_methodRINconcentration_fold_differencei7_indexi5_indexi5_primeri7_primerlibrary#
10 YPD mar1d 4 MA 9.9 4.37 ATTACTCGAGGATAGGi503 i701 3
10 YPD mar1d 4 RZ 9.9 4.37 TCCGGAGAAGGATAGGi503 i702 11
1 YPD H99 1 MA 10.0 3.64 CGCTCATTAGGCTATAi501 i703 18
In [32]:
df %>%
filter(Media == 'TC' |
       concentration_fold_difference > 3) %>%
head(3)
RNA_sample_numMediaStrainReplicateenrichment_methodRINconcentration_fold_differencei7_indexi5_indexi5_primeri7_primerlibrary#
10 YPD mar1d 4 MA 9.9 4.37 ATTACTCGAGGATAGGi503 i701 3
14 TC H99 2 MA 10.0 1.57 ATTACTCGTCAGAGCCi504 i701 4
15 TC H99 3 MA 9.9 2.85 ATTACTCGCTTCGCCTi505 i701 5

Filtering on string conditions

In [33]:
df %>% filter(str_length(Media) == 3) %>% head(3)
RNA_sample_numMediaStrainReplicateenrichment_methodRINconcentration_fold_differencei7_indexi5_indexi5_primeri7_primerlibrary#
2 YPD H99 2 MA 10.0 1.34 ATTACTCGAGGCTATAi501 i701 1
9 YPD mar1d 3 MA 10.0 2.23 ATTACTCGGCCTCTATi502 i701 2
10 YPD mar1d 4 MA 9.9 4.37 ATTACTCGAGGATAGGi503 i701 3
In [34]:
df %>% filter(str_detect(i7_index, '^A.+')) %>% head(3)
RNA_sample_numMediaStrainReplicateenrichment_methodRINconcentration_fold_differencei7_indexi5_indexi5_primeri7_primerlibrary#
2 YPD H99 2 MA 10.0 1.34 ATTACTCGAGGCTATAi501 i701 1
9 YPD mar1d 3 MA 10.0 2.23 ATTACTCGGCCTCTATi502 i701 2
10 YPD mar1d 4 MA 9.9 4.37 ATTACTCGAGGATAGGi503 i701 3

3. Arrange in ascending or descening order

In [35]:
df %>% arrange(concentration_fold_difference) %>% head(3)
RNA_sample_numMediaStrainReplicateenrichment_methodRINconcentration_fold_differencei7_indexi5_indexi5_primeri7_primerlibrary#
2 YPD H99 2 MA 10 1.34 ATTACTCGAGGCTATAi501 i701 1
2 YPD H99 2 RZ 10 1.34 TCCGGAGAAGGCTATAi501 i702 9
2 YPD H99 2 TOT 10 1.34 CTGAAGCTAGGCTATAi501 i707 17
In [36]:
df %>% arrange(desc(concentration_fold_difference)) %>% head(3)
RNA_sample_numMediaStrainReplicateenrichment_methodRINconcentration_fold_differencei7_indexi5_indexi5_primeri7_primerlibrary#
24 TC mar1d 6 MA 10.0 5.53 CGCTCATTTAAGATTAi506 i703 23
24 TC mar1d 6 RZ 10.0 5.53 GAGATTCCTAAGATTAi506 i704 31
23 TC mar1d 5 MA 9.9 4.47 CGCTCATTCTTCGCCTi505 i703 22
In [37]:
df %>%
arrange(Replicate,
        desc(concentration_fold_difference))%>%
head(3)
RNA_sample_numMediaStrainReplicateenrichment_methodRINconcentration_fold_differencei7_indexi5_indexi5_primeri7_primerlibrary#
1 YPD H99 1 MA 10 3.64 CGCTCATTAGGCTATAi501 i703 18
1 YPD H99 1 RZ 10 3.64 GAGATTCCAGGCTATAi501 i704 26
13 TC H99 1 MA 10 1.95 CGCTCATTTCAGAGCCi504 i703 21

Using top_n

In [38]:
df %>% top_n(3, concentration_fold_difference)
RNA_sample_numMediaStrainReplicateenrichment_methodRINconcentration_fold_differencei7_indexi5_indexi5_primeri7_primerlibrary#
23 TC mar1d 5 MA 9.9 4.47 CGCTCATTCTTCGCCTi505 i703 22
24 TC mar1d 6 MA 10.0 5.53 CGCTCATTTAAGATTAi506 i703 23
23 TC mar1d 5 RZ 9.9 4.47 GAGATTCCCTTCGCCTi505 i704 30
24 TC mar1d 6 RZ 10.0 5.53 GAGATTCCTAAGATTAi506 i704 31
In [39]:
df %>% top_n(3, desc(concentration_fold_difference))
RNA_sample_numMediaStrainReplicateenrichment_methodRINconcentration_fold_differencei7_indexi5_indexi5_primeri7_primerlibrary#
2 YPD H99 2 MA 10 1.34 ATTACTCGAGGCTATAi501 i701 1
2 YPD H99 2 RZ 10 1.34 TCCGGAGAAGGCTATAi501 i702 9
2 YPD H99 2 TOT 10 1.34 CTGAAGCTAGGCTATAi501 i707 17

4. Mutate values

In [40]:
df %>%
select(RNA_sample_num, concentration_fold_difference) %>%
mutate(concentration_difference=2^concentration_fold_difference) %>%
head(3)
RNA_sample_numconcentration_fold_differenceconcentration_difference
2 1.34 2.531513
9 2.23 4.691340
10 4.37 20.677645
In [41]:
df %>%
transmute(concentration_difference=2^concentration_fold_difference) %>%
head(3)
concentration_difference
2.531513
4.691340
20.677645

5. Summarize

In [42]:
df %>% summarize(avg_fold_change=mean(concentration_fold_difference),
                 min_fold_change=min(concentration_fold_difference),
                 max_fold_change=max(concentration_fold_difference))
avg_fold_changemin_fold_changemax_fold_change
2.8919611.34 5.53
In [43]:
df %>% summarize_if(is.numeric, mean)
RINconcentration_fold_difference
9.7764712.891961

Group_by

summarize is most useful when used with group_by

In [44]:
df %>%
group_by(Media, Strain, enrichment_method) %>%
summarize(mean_fold_diff=mean(concentration_fold_difference))
MediaStrainenrichment_methodmean_fold_diff
TC H99 MA 2.576667
TC H99 RZ 2.576667
TC mar1d MA 3.463333
TC mar1d RZ 3.463333
YPD H99 MA 2.556667
YPD H99 RZ 2.556667
YPD H99 TOT 1.790000
YPD mar1d MA 3.246667
YPD mar1d RZ 3.246667
In [45]:
df %>%
group_by(Media, Strain, enrichment_method) %>%
summarize_if(is.numeric, mean)
MediaStrainenrichment_methodRINconcentration_fold_difference
TC H99 MA 9.8500002.576667
TC H99 RZ 9.8500002.576667
TC mar1d MA 9.3333333.463333
TC mar1d RZ 9.3333333.463333
YPD H99 MA 10.0000002.556667
YPD H99 RZ 10.0000002.556667
YPD H99 TOT 10.0000001.790000
YPD mar1d MA 9.8666673.246667
YPD mar1d RZ 9.8666673.246667
In [46]:
df %>%
group_by(Media, Strain) %>%
summarize_if(is.numeric, funs(mean, sd))
MediaStrainRIN_meanconcentration_fold_difference_meanRIN_sdconcentration_fold_difference_sd
TC H99 9.8500002.576667 0.26111650.7157873
TC mar1d 9.3333333.463333 1.46432321.3630803
YPD H99 10.0000002.403333 0.00000000.8593824
YPD mar1d 9.8666673.246667 0.16696940.7410230

What experimental conditions produced the greatest mean fold change?

In [47]:
df %>%
group_by(Media, Strain, enrichment_method) %>%
summarize(mean_fold_diff=mean(concentration_fold_difference)) %>%
arrange(desc(mean_fold_diff)) %>%
head(1)
MediaStrainenrichment_methodmean_fold_diff
TC mar1d MA 3.463333