Pilot Study: Interaction analysis with DESeq2

Synposis

In this tutorial, we will illustrate the use of the DESeq2 package for conducting interaction analysis. More specifically, for each gene we will assess the level of statistical evidence to support a treatment by strain interaction with respect to the mRNA abudnance for that gene.

In the context of our pilot experiment, for a given gene, let \(\Delta_{H99}\) denote the differential expression effect due to treatment in the H99 strain and $:raw-latex:Delta_{mar1d} denote the corresponding differential expression effect in the mar1d strain. A

If the differential expression effect is quantified by say log2 fold-change, then * testing for a differential treatment effect within the H99 strain can be formulated as testing \(H_0: \Delta_{H99} = 0\) against \(H_1: \Delta_{H99} \ne 0\). * testing for a differential treatment effect within the mar1d strain can be formulated as testing \(H_0: \Delta_{mar1d} = 0\) against \(H_1: \Delta_{mar1d} \ne 0\) * testing for a treatment by strain interaction with respect to mRNA abundance of the gene can be formulated as testing \(H_0: \Delta_{H99} = \Delta_{mar1d}\) against \(H_1: \Delta_{H99} \ne \Delta_{mar1d}\).

The (margional) differential expression hypothesis tests whether there is evidence for a fold change, while the interaction hypothesis tests whether there is a difference between the two fold changes. For example \(\Delta_{H99} >\Delta_{mar1d}\) indicates that the differential expression effect due to treatment is larger in the H99 strain compared to the mar1d strain.

Set Environment

In [1]:
source("pilot_config.R")
source("pilot_util.R")
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats

Attaching package: ‘foreach’

The following objects are masked from ‘package:purrr’:

    accumulate, when

Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:dplyr’:

    combine, intersect, setdiff, union

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, cbind, colMeans, colnames,
    colSums, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, lengths, Map, mapply, match,
    mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which, which.max, which.min


Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:dplyr’:

    first, rename

The following object is masked from ‘package:tidyr’:

    expand

The following object is masked from ‘package:base’:

    expand.grid

Loading required package: IRanges

Attaching package: ‘IRanges’

The following objects are masked from ‘package:dplyr’:

    collapse, desc, slice

The following object is masked from ‘package:purrr’:

    reduce

Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: DelayedArray
Loading required package: matrixStats

Attaching package: ‘matrixStats’

The following objects are masked from ‘package:Biobase’:

    anyMissing, rowMedians

The following object is masked from ‘package:dplyr’:

    count


Attaching package: ‘DelayedArray’

The following objects are masked from ‘package:matrixStats’:

    colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

The following object is masked from ‘package:base’:

    apply


Attaching package: ‘limma’

The following object is masked from ‘package:DESeq2’:

    plotMA

The following object is masked from ‘package:BiocGenerics’:

    plotMA


Attaching package: ‘gridExtra’

The following object is masked from ‘package:Biobase’:

    combine

The following object is masked from ‘package:BiocGenerics’:

    combine

The following object is masked from ‘package:dplyr’:

    combine


---------------------
Welcome to dendextend version 1.8.0
Type citation('dendextend') for how to cite the package.

Type browseVignettes(package = 'dendextend') for the package vignette.
The github page is: https://github.com/talgalili/dendextend/

Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
Or contact: <tal.galili@gmail.com>

        To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
---------------------


Attaching package: ‘dendextend’

The following object is masked from ‘package:stats’:

    cutree


Attaching package: ‘plotly’

The following object is masked from ‘package:IRanges’:

    slice

The following object is masked from ‘package:S4Vectors’:

    rename

The following object is masked from ‘package:ggplot2’:

    last_plot

The following object is masked from ‘package:stats’:

    filter

The following object is masked from ‘package:graphics’:

    layout

Load and create data objects

Attach data objects containing count and meta data

In [2]:
attach(file.path(OUTDIR, "HTS-Pilot-Annotated-STAR-counts.RData"))

Create columnData object

In [3]:
annomapres0 %>%
    dplyr::filter(enrichment_method == "RZ")  %>%
    DataFrame ->
    columnData
rownames(columnData) <- columnData[["Label"]]

columnData[, c("Label", "Strain", "Media")] %>% head(3)
DataFrame with 3 rows and 3 columns
              Label   Strain    Media
        <character> <factor> <factor>
1_RZ_J       1_RZ_J      H99      YPD
10_RZ_C     10_RZ_C    mar1d      YPD
11_RZ_J     11_RZ_J    mar1d      YPD

Create countData objects

In [4]:
annogenecnts0 %>%
    dplyr::select(dput(as.character(c("gene", columnData[["Label"]])))) %>%
    as.data.frame %>%
    column_to_rownames("gene") %>%
    as.matrix ->
    countData

countData %>% head(3)
c("gene", "1_RZ_J", "10_RZ_C", "11_RZ_J", "12_RZ_P", "13_RZ_J",
"14_RZ_C", "15_RZ_C", "16_RZ_P", "2_RZ_C", "21_RZ_C", "22_RZ_C",
"23_RZ_J", "24_RZ_J", "26_RZ_C", "27_RZ_P", "3_RZ_J", "35_RZ_P",
"36_RZ_J", "38_RZ_P", "4_RZ_P", "40_RZ_J", "45_RZ_P", "47_RZ_P",
"9_RZ_C")
1_RZ_J10_RZ_C11_RZ_J12_RZ_P13_RZ_J14_RZ_C15_RZ_C16_RZ_P2_RZ_C21_RZ_C27_RZ_P3_RZ_J35_RZ_P36_RZ_J38_RZ_P4_RZ_P40_RZ_J45_RZ_P47_RZ_P9_RZ_C
CNAG_00001 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
CNAG_0000220476 92 64 23018220012916812443 10715010995 51 235122112106
CNAG_00003 4024 18 34 56 53 54 40 40 41 9 24 26 4343 11 53 46 41 35

Interaction Analysis

Note that the Media by Strain interaction term has been added to the model. In the design argument, the additive model (Median + Strain) has been augmented with the interaction term (Media:Strain). The latter should appear last as it is our primary parameter of interest. In statistics, Media + Strain is referred to as the additive model while Media + Strain + Media:Strain is referred to as the multiplicative model.

In [5]:
ddsmult <- DESeqDataSetFromMatrix(countData, columnData, ~ Media + Strain + Media:Strain)
ddsmult <- estimateSizeFactors(ddsmult)
ddsres <- DESeq(ddsmult)
using pre-existing size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing

Verify the design before looking at the results

In [6]:
design(ddsres)
~Media + Strain + Media:Strain

We retrieve the top twenty genes, according to their corresponding adjusted P-values

In [7]:
results(ddsres, tidy=TRUE) %>%
    arrange(padj) %>%
    head(20) ->
    tophits

tophits
rowbaseMeanlog2FoldChangelfcSEstatpvaluepadj
CNAG_04307 490.86196 4.0231182 0.29759976 13.518553 1.215429e-419.552054e-38
CNAG_05459 636.02477 1.1440422 0.09994094 11.447183 2.429119e-309.545224e-27
CNAG_03621 1821.27006 -1.0644341 0.10659585 -9.985699 1.760551e-234.612056e-20
CNAG_00600 848.44047 1.1424338 0.11598632 9.849729 6.872879e-231.350349e-19
CNAG_04585 812.31911 -4.0407468 0.41261597 -9.792997 1.206656e-221.896623e-19
CNAG_03735 521.96604 1.6342312 0.16755798 9.753228 1.786944e-222.340599e-19
CNAG_07862 85.91321 -2.6088443 0.26841617 -9.719401 2.492446e-222.798304e-19
CNAG_02733 15.32034 -4.0236397 0.41731949 -9.641629 5.333480e-225.239478e-19
CNAG_06517 850.38128 1.8719912 0.19539060 9.580764 9.632973e-228.411726e-19
CNAG_00399 213.68536 -2.1466008 0.22469243 -9.553507 1.253786e-219.853503e-19
CNAG_02045 24.74771 -2.6789308 0.31060668 -8.624833 6.418615e-184.585809e-15
CNAG_03996 2229.18446 1.3632246 0.16083685 8.475822 2.334248e-171.528738e-14
CNAG_01681 1996.46047 2.6394255 0.31182349 8.464486 2.572852e-171.555388e-14
CNAG_00923 73.96611 -1.7320806 0.20880649 -8.295147 1.084508e-166.087964e-14
CNAG_07554 1112.80452 1.0774400 0.13029451 8.269266 1.347860e-167.061886e-14
CNAG_06764 599.44576 -0.9476336 0.12084464 -7.841751 4.443063e-152.182377e-12
CNAG_04949 351.12106 0.9334979 0.12010335 7.772455 7.697914e-153.558700e-12
CNAG_02347 24.20742 -3.2863765 0.42775993 -7.682759 1.556982e-146.797956e-12
CNAG_07856 344.40916 -1.4058418 0.18315481 -7.675703 1.645144e-146.804836e-12
CNAG_05192 50.27230 1.8595868 0.24461087 7.602225 2.910827e-141.143809e-11

Retrieve the gene symbols for the the top and twentieth ranked hits and assign them to the vector topgenes

In [8]:
topgenes <- tophits$row[c(1,20)]
topgenes
  1. 'CNAG_04307'
  2. 'CNAG_05192'

Visualize Interactions

First, we “normalize” the counts using the vst method (this is an alternative to the rlog method)

In [9]:
vstexp <- vst(ddsmult, blind=TRUE)

Next, we visualize the results using dotplots. It is noted that this is an exploratory analysis meant to help us visualize the findings.

In [11]:
p1 <- myinteractplot(vstexp, topgenes[1], "Strain")
p2 <- myinteractplot(vstexp, topgenes[2], "Strain")
grid.arrange(p1, p2, ncol = 2)
../_images/pilot_04_Pilot_DESeq2_Interaction_25_0.png

Observations:

  • CNAG_04307: We observe that in the H99 strain, treatment seemingly upregulates expression while decreasing expression in the mar1d strain. While the effect sizes are seemingly similar in absolute magnitude, they are in the opposite direction.
  • CNAG_05192: We observe that in the H99 strain, treatment seemingly upregulates expression. There is seemingly no treatment effect in the mar1d strain

Estimate treatment effect size within each strain

Next, we will show how to estimate treatment effect sizes within each strain. To keep things clean, we make a new copy of the ddsmult object and name it dds. We will add a new variable, to be named group, containing the combined Media and Strain assignments to this new copy and replace its design with group.

In [11]:
dds <- ddsmult
dds$group <- as.factor(paste0(dds$Media,dds$Strain))
design(dds) <- ~ group
ddsgrp <- DESeq(dds)
using pre-existing size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing

Let’s look at the first 5 cases. As expected, the entry in group is obtained by pasting strain status to the right of treatment status.

In [13]:
dds@colData[1:5, c("Label", "Media", "Strain", "group")]
DataFrame with 5 rows and 4 columns
              Label    Media   Strain    group
        <character> <factor> <factor> <factor>
1_RZ_J       1_RZ_J      YPD      H99   YPDH99
10_RZ_C     10_RZ_C      YPD    mar1d YPDmar1d
11_RZ_J     11_RZ_J      YPD    mar1d YPDmar1d
12_RZ_P     12_RZ_P      YPD    mar1d YPDmar1d
13_RZ_J     13_RZ_J       TC      H99    TCH99

Let’s compare the designs:

In [13]:
ddsmult@design
dds@design
~Media + Strain + Media:Strain
~group

Next, we retrieve the results for the interaction analysis (r0), the results for contrasting the treatment effect within H99 (r1) and the results for constrasting the treatment effect within mar1d (r2).

In [16]:
### Results for interaction analysis
r0<-results(ddsres, tidy=TRUE)
### Contrast treatments within H99 strain
r1<-results(ddsgrp, contrast = c("group", "TCH99", "YPDH99"), tidy=TRUE)
### Contrast treatments within mar1d strain
r2<-results(ddsgrp, contrast = c("group", "TCmar1d", "YPDmar1d"), tidy=TRUE)

Now, let’s look at the results for the two genes we had picked

In [19]:
### Look at the results for top two genes
r0 %>% filter(row %in% topgenes)
r1 %>% filter(row %in% topgenes)
r2 %>% filter(row %in% topgenes)
rowbaseMeanlog2FoldChangelfcSEstatpvaluepadj
CNAG_04307 490.8620 4.023118 0.2975998 13.518553 1.215429e-419.552054e-38
CNAG_05192 50.2723 1.859587 0.2446109 7.602225 2.910827e-141.143809e-11
rowbaseMeanlog2FoldChangelfcSEstatpvaluepadj
CNAG_04307 490.8620 2.013216 0.2107145 9.554234 1.245011e-219.241900e-20
CNAG_05192 50.2723 1.953382 0.1997372 9.779762 1.375341e-221.252967e-20
rowbaseMeanlog2FoldChangelfcSEstatpvaluepadj
CNAG_04307 490.8620 -2.00990294 0.2101548 -9.5639165 1.133828e-211.934022e-19
CNAG_05192 50.2723 0.09379498 0.1412074 0.6642357 5.065395e-015.972830e-01
  • CNAG_04307: We observe that the estimated effect size in the H99 strain (2.01) is positive. The corresponding estimated effect size (-2.01) in the mar1d strain is negative. Constrast the estimated magnitudes and estimated directions to the figure
    • The P-value for testing for a differential expression within H99 (\(H_0: \Delta_{H99} = 0\) against \(H_1: \Delta_{H99} \ne 0\)) is 9.55e-38 (appropriately presented as < 2.22e-16)
    • The P-value for testing for a differential expression within mar1d (\(H_0: -\Delta_{mar1d} = 0\) against \(H_1: \Delta_{mar1d} \ne 0\)) is 1.9e-19 (appropriately presented as < 2.22e-16)
    • The P-value for a treatment by strain interaction with respect to the mRNA abundance of CNAG_04307 is 9.55e-38 appropriately presented as < 2.22e-16)
  • CNAG_05192: We observe that the estimated effect size in the H99 strain (2.01) is positive. The estimated effect size in the mar1d strain is very small (0.093).
  • The P-value for testing for a differential expression within H99 (\(H_0: \Delta_{H99} = 0\) against \(H_1: \Delta_{H99} \ne 0\)) is 1.25e-20 (appropriately presented as < 2.22e-16)
  • The P-value for testing for a differential expression within mar1d (\(H_0: -\Delta_{mar1d} = 0\) against \(H_1: \Delta_{mar1d} \ne 0\)) is 0.51 (note that you may not conclude that there is no differential expression in the mar1d strain)
  • The P-value for a treatment by strain interaction with respect to the mRNA abundance of CNAG_04307 is 1.14e-11.