Introduction to `DESeq2`¶

This notebook serves as a tutorial for using the DESeq2 package. Please be sure to consult the excellent vignette provided by the DESeq2 package. Hopefully, we will also get a chance to review the edgeR package (which also has a very nice vignette which I suggest that you review)

Load packages¶

Load requisite R packages.

In [1]:

library(DESeq2)
library(tools)
library(limma)
library(qvalue)
library(dplyr)
options(width=100)

Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, cbind, colnames, do.call,
    duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect,
    is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
    paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
    Reduce, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit, which, which.max, which.min


Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.


Attaching package: ‘limma’

The following object is masked from ‘package:DESeq2’:

    plotMA

The following object is masked from ‘package:BiocGenerics’:

    plotMA


Attaching package: ‘dplyr’

The following object is masked from ‘package:Biobase’:

    combine

The following objects are masked from ‘package:GenomicRanges’:

    intersect, setdiff, union

The following object is masked from ‘package:GenomeInfoDb’:

    intersect

The following objects are masked from ‘package:IRanges’:

    collapse, desc, intersect, regroup, setdiff, slice, union

The following objects are masked from ‘package:S4Vectors’:

    first, intersect, rename, setdiff, setequal, union

The following objects are masked from ‘package:BiocGenerics’:

    combine, intersect, setdiff, union

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Import Data¶

The first step to any analysis is to import the data into an analysis ready format. The latter depends on the requirements of the package used for the analysis.

For this analysis, we will use the DESeq2::DESeqDataSetFromHTSeqCount. This function allows you to import count files generated by HTSeq directly into R. If you use a program other than HTSeq, you should use the DESeq2::DESeqDataSetFromMatrix function.

Let’s review the three main arguments of DESeq2::DESeqDataSetFromHTSeqCount: sampleTable, directory and design

In [2]:

#? DESeqDataSetFromHTSeqCount

First set the directory under which the HTSeq count files are stored

In [3]:

datadir<-"/home/jovyan/work/2017-HTS-materials/Materials/Statistics/08032017/Data/2015"

Next, put the filenames into a data frame

In [4]:

phdata<-data.frame(fname=list.files(path=datadir,pattern="*.csv"),stringsAsFactors=FALSE)
head(phdata)

	fname
1	7A_E.csv
2	7A_G.csv
3	7A_K.csv
4	7A_N.csv
5	7A_P.csv
6	7B_E.csv

It is always a good idea to check the dimension of the file you have read in

In [5]:

dim(phdata)

30
1

Extract the label from the filename. Why should one reorder columns? See DESeq2 help file

In [6]:

phdata <- phdata %>% transmute(sample=substr(fname,1,4),fname)
head(phdata)

	sample	fname
1	7A_E	7A_E.csv
2	7A_G	7A_G.csv
3	7A_K	7A_K.csv
4	7A_N	7A_N.csv
5	7A_P	7A_P.csv
6	7B_E	7B_E.csv

For any analysis, you should have a manifest file that maps the file names (containing the counts) to the phenotypic and experimental condition for the sample. This file should be validated and put under strict vesrion control.

For this demonstration, we will reproduce this file by parsing the file names. This is meant to help us practice R programming.

Add some design info to the data frame. We will add the treatment factor (the first character of the file name: 7 or 8), the replicate id (the second character of the file name)

Note that tools::md5sum will add the MD5 signature for each of the HTSeq count files. You should keep track of these for the purpose of conducting reproducible analysis

In [7]:

phdata <- phdata %>% mutate(trt=as.factor(substr(sample,1,1)),
                           repl=substr(sample,2,2),
                           team=substr(sample,4,4),
                           md5=tools::md5sum(file.path(datadir,fname)))
head(phdata)

	sample	fname	trt	repl	team	md5
1	7A_E	7A_E.csv	7	A	E	263e8cd3d8bbb72fac8600de41b12e73
2	7A_G	7A_G.csv	7	A	G	8e3447446c11b3cf33496de751a0e7b0
3	7A_K	7A_K.csv	7	A	K	f94f0cf208f73ea5492f95cd8dc32ab2
4	7A_N	7A_N.csv	7	A	N	93596ed64b4c4495bbb95722f6a27415
5	7A_P	7A_P.csv	7	A	P	913a10ca78999596ad51db1321511d71
6	7B_E	7B_E.csv	7	B	E	51d8902b1dee08f1af3affe6510a87a5

Q: Why did I convert the trt variable to a factor?

For this analysis, we pick the data from team E

In [8]:

phdata<- phdata %>% filter(team=="E")
phdata

	sample	fname	trt	repl	team	md5
1	7A_E	7A_E.csv	7	A	E	263e8cd3d8bbb72fac8600de41b12e73
2	7B_E	7B_E.csv	7	B	E	51d8902b1dee08f1af3affe6510a87a5
3	7C_E	7C_E.csv	7	C	E	49a549a26e58dd5c13bd931a794c0ec3
4	8A_E	8A_E.csv	8	A	E	edaaefaf7ddeb1119f29cf3d5682d4ed
5	8B_E	8B_E.csv	8	B	E	f7d52ce9ffc0e61d7e371d072ebc0a43
6	8C_E	8C_E.csv	8	C	E	1e668998cb9947b80f3e813a7428bfb5

Now, we import the counts. Note that the first argument is the sample table while the second is the directory storing the count files. The last argument specifies the design. More on this later.

In [9]:

dds<-DESeqDataSetFromHTSeqCount(sampleTable=phdata,directory=datadir,design=~ trt)

Inspect object¶

Let’s has a look at the object we have created.

In [10]:

dds

class: DESeqDataSet
dim: 4444 6
metadata(1): version
assays(1): counts
rownames(4444): gene0 gene1 ... gene998 gene999
rowData names(0):
colnames(6): 7A_E 7B_E ... 8B_E 8C_E
colData names(4): trt repl team md5

Note that this object is of class DESeqDataSet.

In [11]:

class(dds)

'DESeqDataSet'

The object dds has the following slots.

In [12]:

slotNames(dds)

'design'
'dispersionFunction'
'rowRanges'
'colData'
'assays'
'NAMES'
'elementMetadata'
'metadata'

To get a slot use @

In [13]:

dds@colData

DataFrame with 6 rows and 4 columns
          trt        repl        team                              md5
     <factor> <character> <character>                      <character>
7A_E        7           A           E 263e8cd3d8bbb72fac8600de41b12e73
7B_E        7           B           E 51d8902b1dee08f1af3affe6510a87a5
7C_E        7           C           E 49a549a26e58dd5c13bd931a794c0ec3
8A_E        8           A           E edaaefaf7ddeb1119f29cf3d5682d4ed
8B_E        8           B           E f7d52ce9ffc0e61d7e371d072ebc0a43
8C_E        8           C           E 1e668998cb9947b80f3e813a7428bfb5

Let’s look at the design

In [14]:

dds@design

~trt

The first thing you may want to do is to have a look at the raw counts you have imported. The DESeq2::counts function extracts a matrix of counts (with the genes along the rows and samples along the columns). Let us first verify the dimension of this matrix.

In [15]:

dim(counts(dds))

4444
6

Now print the raw counts for the first three genes (how can you verify this looking at the files from htseq-count)

In [16]:

head(counts(dds),3)

	7A_E	7B_E	7C_E	8A_E	8B_E	8C_E
gene0	9	12	19	21	8	10
gene1	108	119	155	193	164	196
gene10	3	2	2	6	3	7

Slots of an S4 class¶

This gives the design of the study

In [17]:

dds@dispersionFunction

function ()
NULL

This slots return gene specific information (it will be populated later)

Estimate Size Factors and Dispersion Parameters¶

You recall that DESeq requires that we have estimates for sample specific size factors and gene specific dispersion factors. More specifically, recall that DESeq models the count \(K_{ij}\) (gene \(i\), sample \(j\)) as negative binomial with mean \(\mu_{ij}\) and dispersion parameter \(\alpha_i\). Here \(\mu_{ij}=s_j q_{ij}\) where \(\log_2(q_{ij}) = \beta_{0i} + \beta_{1i} z_j\). Here \(s_j\) is the sample \(j\) specific size factor.

Size Factors¶

We begin by estimating the size factors \(s_1,\ldots,s_n\):

In [18]:

dds <- estimateSizeFactors(dds)

Now, compare the dds object to that of before applying the estimateSizeFactors() function. What has changed? What remains unchanged?

In [19]:

dds

class: DESeqDataSet
dim: 4444 6
metadata(1): version
assays(1): counts
rownames(4444): gene0 gene1 ... gene998 gene999
rowData names(0):
colnames(6): 7A_E 7B_E ... 8B_E 8C_E
colData names(5): trt repl team md5 sizeFactor

Note that there is a sizeFactor added to colData. Let’s look at it more carefully

You can also get the size factors directly (why are there six size factors?)

In [20]:

sizeFactors(dds)

7A_E: 0.814709325873651
7B_E: 0.860242585687071
7C_E: 0.982865560325398
8A_E: 1.0938375107471
8B_E: 1.15160914763126
8C_E: 1.26477648704175

It is preferable to limit the number of decimal places. Next show the size factors rounded to 3 decimal places

In [21]:

round(sizeFactors(dds),3)

7A_E: 0.815
7B_E: 0.86
7C_E: 0.983
8A_E: 1.094
8B_E: 1.152
8C_E: 1.265

Now that the size factors have been estimated, we can get “normalized” counts

In [22]:

head(counts(dds),3)
head(counts(dds,normalize=TRUE),3)

	7A_E	7B_E	7C_E	8A_E	8B_E	8C_E
gene0	9	12	19	21	8	10
gene1	108	119	155	193	164	196
gene10	3	2	2	6	3	7

	7A_E	7B_E	7C_E	8A_E	8B_E	8C_E
gene0	11.046885	13.949554	19.331230	19.198464	6.946801	7.906535
gene1	132.5626	138.3331	157.7021	176.4430	142.4094	154.9681
gene10	3.682295	2.324926	2.034866	5.485275	2.605051	5.534575

Note that these are the counts divided by the size factors. Compare the first row of the last table (“normalized” counts for gene 1) to the hand calculation below.

In [23]:

counts(dds)[1,]/sizeFactors(dds)

7A_E: 11.0468847160291
7B_E: 13.9495535325256
7C_E: 19.3312297906844
8A_E: 19.1984639342426
8B_E: 6.94680136611903
8C_E: 7.90653534632788

Exercise: How do you get the raw counts for gene “GeneID:12930116”?

In [24]:

counts(dds)["gene10",]

7A_E: 3
7B_E: 2
7C_E: 2
8A_E: 6
8B_E: 3
8C_E: 7

Exercise: How do you get the normalized counts for gene gene10 for the first 10 sample?

In [25]:

counts(dds,normalize=TRUE)["gene10",]

7A_E: 3.68229490534303
7B_E: 2.32492558875426
7C_E: 2.03486629375625
8A_E: 5.48527540978361
8B_E: 2.60505051229463
8C_E: 5.53457474242952

Exercise: Get a summary (mean, median, quantiles etc ) of the size factors

In [26]:

summary(sizeFactors(dds))

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 0.8147  0.8909  1.0380  1.0280  1.1370  1.2650

Before going to the next step, let’s look at the dispersionFunction slot

In [27]:

dds@dispersionFunction

function ()
NULL

Dispersion Parameters¶

Next, we get the dispersion factors \(\alpha_1,\ldots,\alpha_{m}\)

In [28]:

dds<-estimateDispersions(dds)

gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates

Now inspect the dds object again and note that the rowRanges slot has extra information (“metadata column names(0):” before versus “column names(9): baseMean baseVar ... dispOutlier dispMAP”)

In [29]:

dds

class: DESeqDataSet
dim: 4444 6
metadata(1): version
assays(2): counts mu
rownames(4444): gene0 gene1 ... gene998 gene999
rowData names(9): baseMean baseVar ... dispOutlier dispMAP
colnames(6): 7A_E 7B_E ... 8B_E 8C_E
colData names(5): trt repl team md5 sizeFactor

Note that the dispersionfunction slot is now populated

In [30]:

dds@dispersionFunction

structure(function (q)
coefs[1] + coefs[2]/q, coefficients = structure(c(0.0296449332226886,
1.43330515554225), .Names = c("asymptDisp", "extraPois")), fitType = "parametric", varLogDispEsts = 1.18849739273446, dispPriorVar = 0.54356332588623)

We can extract the gene specific dispersion factors using dispersions(). Note that there will be one number per gene. We look at the first four genes (rounded to 4 decimal places)

In [31]:

alphas<-dispersions(dds)

Verify that the number of dispersion factors equals the number of genes

In [32]:

length(alphas)

4444

Print the dispersion factors for the first 5 genes rounded to four decimal points

In [33]:

round(alphas[1:5],4)

0.1363
0.0226
0.2759
0.0688
0.1377

Extract the metadata using mcols() for the first four genes

In [34]:

mcols(dds)[1:4,]

DataFrame with 4 rows and 9 columns
    baseMean   baseVar   allZero dispGeneEst    dispFit dispersion  dispIter dispOutlier    dispMAP
   <numeric> <numeric> <logical>   <numeric>  <numeric>  <numeric> <numeric>   <logical>  <numeric>
1  13.063245  29.15647     FALSE 0.127803253 0.13936539 0.13628195         6       FALSE 0.13628195
2 150.403062 256.01201     FALSE 0.003999823 0.03917469 0.02257188         8       FALSE 0.02257188
3   3.611165   2.47365     FALSE 0.000000010 0.42655433 0.27587228         8       FALSE 0.27587228
4  24.170850  35.51715     FALSE 0.014700541 0.08894385 0.06884450         8       FALSE 0.06884450

Exercise: Provide statistical summaries of the dispersion factors

In [35]:

summary(dispersions(dds))

Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's
 0.01119  0.02864  0.05539  0.42150  0.17510 10.00000      130

Exercise: Summarize the dispersion factors using a box plot (may want to log transform)

In [36]:

boxplot(log(dispersions(dds)))

../../_images/Statistics_08032017_DESeq2-Notebook-introduction_84_0.png

Differential Expression Analysis¶

We can now conduct a differential expression analysis using the DESeq() function. Keep in mind that to get to this step, we first estimated the size factors and then the dispersion parameters.

In [37]:

ddsDE<-DESeq(dds)

using pre-existing size factors
estimating dispersions
found already estimated dispersions, replacing these
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing

We can get the results for the differential expression analysis using results()

In [38]:

myres<-results(ddsDE)

Let’s look at the results for the first four genes

In [39]:

myres[1:4,]

log2 fold change (MAP): trt 8 vs 7
Wald test p-value: trt 8 vs 7
DataFrame with 4 rows and 6 columns
          baseMean log2FoldChange     lfcSE       stat    pvalue      padj
         <numeric>      <numeric> <numeric>  <numeric> <numeric> <numeric>
gene0    13.063245     -0.3615293 0.4895581 -0.7384809 0.4602223 0.6225190
gene1   150.403062      0.1396895 0.1984781  0.7038029 0.4815555 0.6402232
gene10    3.611165      0.5992853 0.6890620  0.8697117 0.3844580 0.5526493
gene100  24.170850      0.2231204 0.3708345  0.6016711 0.5473931 0.6965989

You can get the descriptions for the columns from the DE analysis

In [40]:

data.frame(desc=mcols(myres)$description)

	desc
1	mean of normalized counts for all samples
2	log2 fold change (MAP): trt 8 vs 7
3	standard error: trt 8 vs 7
4	Wald statistic: trt 8 vs 7
5	Wald test p-value: trt 8 vs 7
6	BH adjusted p-values

or look at ?results

P-values¶

One can extract the unadjusted p-values as follows

In [41]:

pvalues<-myres$pvalue
length(pvalues)
pvalues[1:4]

4444

0.460222295066648
0.481555502006739
0.384457970194728
0.547393065612409

The BH adjusted p-values can be extracted as

In [42]:

adjp<-myres$padj
length(adjp)
adjp[1:4]

4444

0.622518959377389
0.640223160424753
0.552649254473344
0.696598867777664

Calculate BH adjusted P-values by “hand” using the p.adjust() function. Note that you will not replicate the results you get under the padj column (when looking at the first four rows)

In [43]:

pvalues<-myres$pvalue
BH<-p.adjust(pvalues,"BH")
data.frame(BH=BH[1:4],adjp=adjp[1:4])

	BH	adjp
1	0.670405	0.622519
2	0.6872980	0.6402232
3	0.6015538	0.5526493
4	0.7456737	0.6965989

The DESeq2::results function applies “independent” filtering. This enabled by default. Let’s disable and the reexamine the adjusted P-values

In [44]:

myres1<-results(ddsDE,independentFiltering = FALSE)

In [45]:

myres1

log2 fold change (MAP): trt 8 vs 7
Wald test p-value: trt 8 vs 7
DataFrame with 4444 rows and 6 columns
           baseMean log2FoldChange     lfcSE       stat     pvalue       padj
          <numeric>      <numeric> <numeric>  <numeric>  <numeric>  <numeric>
gene0     13.063245     -0.3615293 0.4895581 -0.7384809 0.46022230 0.67040501
gene1    150.403062      0.1396895 0.1984781  0.7038029 0.48155550 0.68729798
gene10     3.611165      0.5992853 0.6890620  0.8697117 0.38445797 0.60155379
gene100   24.170850      0.2231204 0.3708345  0.6016711 0.54739307 0.74567369
gene1000   7.675704     -1.1913393 0.5487654 -2.1709447 0.02993535 0.08914359
...             ...            ...       ...        ...        ...        ...
gene995   1.6197713      1.4706212 0.8162744  1.8016261 0.07160425 0.17719995
gene996  12.8488020     -0.2319250 0.4424331 -0.5242035 0.60013703 0.78035966
gene997   9.9461848     -1.0929507 0.4962577 -2.2023856 0.02763808 0.08386863
gene998  36.7337608     -0.2205173 0.3212546 -0.6864254 0.49244489 0.69827572
gene999   0.8127785      0.1074132 0.8147129  0.1318417 0.89510947 0.94515483

We can now replicate the results

In [46]:

pvalues1<-myres1$pvalue
BH1<-p.adjust(pvalues1[!is.na(pvalues)],"BH")
data.frame(BH=BH1[1:4],adjp=myres1$padj[1:4])

	BH	adjp
1	0.670405	0.670405
2	0.687298	0.687298
3	0.6015538	0.6015538
4	0.7456737	0.7456737

Subset and reorder the results¶

In [47]:

summary(myres,0.05)


out of 4314 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)     : 615, 14%
LFC < 0 (down)   : 685, 16%
outliers [1]     : 8, 0.19%
low counts [2]   : 501, 12%
(mean count < 2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

You can sort the results by say the unadjusted P-values

In [48]:

myres[order(myres[["pvalue"]])[1:4],]

log2 fold change (MAP): trt 8 vs 7
Wald test p-value: trt 8 vs 7
DataFrame with 4 rows and 6 columns
          baseMean log2FoldChange     lfcSE      stat        pvalue          padj
         <numeric>      <numeric> <numeric> <numeric>     <numeric>     <numeric>
gene1209 1110.3199      -3.347158 0.1492045 -22.43336 1.860541e-111 7.079359e-108
gene4405  436.6800       4.443710 0.2144262  20.72373  2.116499e-95  4.026639e-92
gene3312  488.4901       4.659130 0.2362760  19.71902  1.480824e-86  1.878178e-83
gene3317  322.3925       5.043525 0.2597286  19.41844  5.390243e-84  5.127469e-81

To get the list of genes with unadjusted P-values < 0.00001 and absolute log2 FC of more than 4

In [49]:

subset(myres,pvalue<0.00001&abs(log2FoldChange)>4)

log2 fold change (MAP): trt 8 vs 7
Wald test p-value: trt 8 vs 7
DataFrame with 24 rows and 6 columns
          baseMean log2FoldChange     lfcSE      stat       pvalue         padj
         <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
gene1300 175.91020       4.761123 0.4248758  11.20592 3.813880e-29 3.023294e-27
gene2034 128.44386       4.061622 0.3738799  10.86344 1.721393e-27 1.235830e-25
gene2127  82.92247       4.353625 0.3826230  11.37837 5.359438e-30 4.338864e-28
gene2128 120.48719       4.474566 0.3303883  13.54335 8.673550e-42 1.434907e-39
gene2179 436.50102       5.055035 0.2805605  18.01763 1.416826e-72 1.078204e-69
...            ...            ...       ...       ...          ...          ...
gene4405 436.68000       4.443710 0.2144262 20.723727 2.116499e-95 4.026639e-92
gene585   27.68248       5.127728 0.6402093  8.009455 1.152179e-15 3.535517e-14
gene594  121.06565       5.862144 0.4591564 12.767207 2.499646e-37 3.396840e-35
gene595  106.31889       4.798168 0.3853057 12.452885 1.348798e-35 1.466336e-33
gene596   69.53289       4.007674 0.4010769  9.992284 1.647420e-23 9.084685e-22

To get the list of genes with unadjusted P-values < 0.00001 and upregulated genes with log2 FC of more than 4

In [50]:

subset(myres,pvalue<0.00001&log2FoldChange>4)

log2 fold change (MAP): trt 8 vs 7
Wald test p-value: trt 8 vs 7
DataFrame with 24 rows and 6 columns
          baseMean log2FoldChange     lfcSE      stat       pvalue         padj
         <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
gene1300 175.91020       4.761123 0.4248758  11.20592 3.813880e-29 3.023294e-27
gene2034 128.44386       4.061622 0.3738799  10.86344 1.721393e-27 1.235830e-25
gene2127  82.92247       4.353625 0.3826230  11.37837 5.359438e-30 4.338864e-28
gene2128 120.48719       4.474566 0.3303883  13.54335 8.673550e-42 1.434907e-39
gene2179 436.50102       5.055035 0.2805605  18.01763 1.416826e-72 1.078204e-69
...            ...            ...       ...       ...          ...          ...
gene4405 436.68000       4.443710 0.2144262 20.723727 2.116499e-95 4.026639e-92
gene585   27.68248       5.127728 0.6402093  8.009455 1.152179e-15 3.535517e-14
gene594  121.06565       5.862144 0.4591564 12.767207 2.499646e-37 3.396840e-35
gene595  106.31889       4.798168 0.3853057 12.452885 1.348798e-35 1.466336e-33
gene596   69.53289       4.007674 0.4010769  9.992284 1.647420e-23 9.084685e-22

The P-values for the four top genes are beyond machine precision. You can use the format.pval() function to properly format the P-values. PLEASE promote ending the practice of publishing P-values below machine precision. (that would be akin to stating the weight of an object that weighs less than one pound with scale that whose minimum weight spec is 1lbs).

In [51]:

myres$pval=format.pval(myres$pvalue)
myres[order(myres[["pvalue"]])[1:4],]

log2 fold change (MAP): trt 8 vs 7
Wald test p-value: trt 8 vs 7
DataFrame with 4 rows and 7 columns
          baseMean log2FoldChange     lfcSE      stat        pvalue          padj        pval
         <numeric>      <numeric> <numeric> <numeric>     <numeric>     <numeric> <character>
gene1209 1110.3199      -3.347158 0.1492045 -22.43336 1.860541e-111 7.079359e-108  < 2.22e-16
gene4405  436.6800       4.443710 0.2144262  20.72373  2.116499e-95  4.026639e-92  < 2.22e-16
gene3312  488.4901       4.659130 0.2362760  19.71902  1.480824e-86  1.878178e-83  < 2.22e-16
gene3317  322.3925       5.043525 0.2597286  19.41844  5.390243e-84  5.127469e-81  < 2.22e-16

Let’s look at a volcano plot

In [52]:

plot(myres$log2FoldChange,-log10(myres$padj),pch=19,cex=0.3,xlab="Log2 FC",ylab="-log10(BH Adjusted P-value)")

../../_images/Statistics_08032017_DESeq2-Notebook-introduction_118_0.png

Extract results for genes GeneID:12932226 and GeneID:12930116

In [53]:

myres[c("gene3317","gene10"),]

log2 fold change (MAP): trt 8 vs 7
Wald test p-value: trt 8 vs 7
DataFrame with 2 rows and 7 columns
           baseMean log2FoldChange     lfcSE       stat       pvalue         padj        pval
          <numeric>      <numeric> <numeric>  <numeric>    <numeric>    <numeric> <character>
gene3317 322.392539      5.0435248 0.2597286 19.4184401 5.390243e-84 5.127469e-81  < 2.22e-16
gene10     3.611165      0.5992853 0.6890620  0.8697117 3.844580e-01 5.526493e-01  0.38445797

Exercise: Annotate the hits with adjusted P-values < 0.05 and absolute log2 FC greater than 2 in red

In [54]:

plot(myres$log2FoldChange,-log10(myres$padj),pch=19,cex=0.3,xlab="Log2 FC",ylab="-log10(BH Adjusted P-value)",col=ifelse(myres$padj<0.05&abs(myres$log2FoldChange)>2,"red","black"))

../../_images/Statistics_08032017_DESeq2-Notebook-introduction_122_0.png

Converting/Normalizing Counts to “Expressions”¶

Normalized Counts¶

We have already shown how to “normalize” the counts using the estimated size factors

In [55]:

head(counts(dds,normalize=TRUE),3)

	7A_E	7B_E	7C_E	8A_E	8B_E	8C_E
gene0	11.046885	13.949554	19.331230	19.198464	6.946801	7.906535
gene1	132.5626	138.3331	157.7021	176.4430	142.4094	154.9681
gene10	3.682295	2.324926	2.034866	5.485275	2.605051	5.534575

Plot the counts stratified by treatment for the 2nd gene

In [56]:

 plotCounts(dds, 2,intgroup="trt")

../../_images/Statistics_08032017_DESeq2-Notebook-introduction_128_0.png

Or alternatively (better)

In [57]:

 plotCounts(dds, "gene10",intgroup="trt")

../../_images/Statistics_08032017_DESeq2-Notebook-introduction_130_0.png

Now get this plot for the top hit

In [58]:

plotCounts(dds, "gene1209",intgroup="trt")

../../_images/Statistics_08032017_DESeq2-Notebook-introduction_132_0.png

FPM¶

Another approach is to FPM: fragments per million mapped fragments

In [59]:

head(fpm(dds),3)

	7A_E	7B_E	7C_E	8A_E	8B_E	8C_E
gene0	7.563666	9.551087	13.235854	13.144951	4.756389	5.413507
gene1	90.76399	94.71495	107.97671	120.80836	97.50598	106.10474
gene10	2.521222	1.591848	1.393248	3.755700	1.783646	3.789455

Let’s calculate the FPM manually. For gene \(i\) sample \(j\), the FPM is defined as \(\frac{K_{ij}}{D_j}\times 10^{6}\) where \(D_j=\sum_{i=1} K_{ij}\) is the read depth for sample \(j\). First get the read depth for each sample

In [60]:

D<-colSums(counts(dds))
D

7A_E: 1224153
7B_E: 1294576
7C_E: 1493765
8A_E: 1492923
8B_E: 1581586
8C_E: 1736478

By default, the fpm() function uses a robust approach. We will disable this right now as to replicate the standard FPM. Let’s look at gene 1

In [61]:

fpm1<-fpm(dds,robust=FALSE)[1,]
fpm1

7A_E: 7.35202217369888
7B_E: 9.26944420412552
7C_E: 12.7195375443929
8A_E: 14.0663651105918
8B_E: 5.05821371711687
8C_E: 5.75878300790451

Now get the raw counts for gene 1

In [62]:

cnt1<-counts(dds)[1,]
cnt1

7A_E: 9
7B_E: 12
7C_E: 19
8A_E: 21
8B_E: 8
8C_E: 10

Now calculate the FPM for gene 1

In [63]:

myfpm1<-cnt1/D*1e6
myfpm1

7A_E: 7.35202217369888
7B_E: 9.26944420412552
7C_E: 12.7195375443929
8A_E: 14.0663651105918
8B_E: 5.05821371711687
8C_E: 5.75878300790451

This is how you check if two numeric columns are “equal”? One approach is to calculate the maximum absoute difference

In [64]:

max(abs(fpm1-myfpm1))

0

The above approach is also helpful in establishing if the difference is “small”. Another approach to test for equality to use the all.equal() function

In [65]:

all.equal(fpm1,myfpm1)

TRUE

It is generally a bad idea to compare numeric vectors using == (e.g., fpm1==myfpm1)

FPKM¶

To calculate the FPKM (fragments per kilobase per million mapped fragments) we need to add annotation to assign the feature lengths. More specifically, for gene \(i\) sample \(j\), the FPKM is defined as \(\frac{K_{ij}}{\ell_i D_j}\times 10^3 \times 10^{6}\) where \(\ell_i\) is the “length” of gene \(i\) (fragments for each \(10^3\) bases in the gene for every \(\frac{D_j}{10^6}\) fragments. More on this later.

Regularized log transformation¶

The regularized log transform can be obtained using the rlog() function. Note that an important argument for this function is blind (TRUE by default). The default “blinds” the normalization to the design. This is very important so as to not bias the analyses (e.g. class discovery)

In [66]:

rld<-rlog(dds,blind=TRUE)

Hierarchical clustering using rlog transformation

In [67]:

dists<-dist(t(assay(rld)))
plot(hclust(dists))

../../_images/Statistics_08032017_DESeq2-Notebook-introduction_155_0.png

PC Analysis using the rlog transformation

In [68]:

 plotPCA(rld,intgroup="trt")

Data type cannot be displayed:

../../_images/Statistics_08032017_DESeq2-Notebook-introduction_157_1.png

Variance Stabilizing Transformation (vst) and mean-variance modelling at the observational level (voom)¶

Two other normalization approaches for RNA-Seq count data are provided by the functions DESeq2::vst and limma::voom (note that for the latter one needs the limma package).

From ? DESeq2::vst “This function calculates a variance stabilizing transformation (VST) from the fitted dispersion-mean relation(s) and then transforms the count data (normalized by division by the size factors or normalization factors), yielding a matrix of values which are now approximately homoskedastic (having constant variance along the range of mean values). The transformation also normalizes with respect to library size.”

Compared to DESeq2::rlog “The ‘rlog’ is less sensitive to size factors, which can be an issue when size factors vary widely. These transformations are useful when checking for outliers or as input for machine learning techniques such as clustering or linear discriminant analysis.”

From ? limma::voom “Transform count data to log2-counts per million (logCPM), estimate the mean-variance relationship and use this to compute appropriate observation-level weights. The data are then ready for linear modelling.”

Get VST transformation

In [69]:

VST<-vst(dds)
class(VST)

'DESeqTransform'

Get the VST matrix

In [70]:

VSTmat<-assay(VST)
dim(VSTmat)
VSTmat[1:10,]

4444
6

	7A_E	7B_E	7C_E	8A_E	8B_E	8C_E
gene0	3.923730	4.180475	4.559309	4.551088	3.452826	3.578618
gene1	7.097850	7.157394	7.340988	7.498825	7.198022	7.316447
gene10	2.901962	2.572007	2.486853	3.234850	2.648356	3.242828
gene100	5.077572	4.083261	4.843068	4.809872	4.964222	4.998826
gene1000	4.143823	3.978753	3.607448	3.077093	2.856875	3.108463
gene1001	2.203473	2.859947	2.760425	3.077093	2.856875	3.108463
gene1002	5.446930	5.335202	4.736260	4.945204	4.175947	4.678644
gene1003	6.680922	6.914822	6.352627	6.582342	6.552185	6.574850
gene1004	2.203473	1.155868	2.113095	2.064865	1.155868	2.003037
gene1005	10.071462	10.519630	9.903560	8.758925	9.039811	8.859864

Get voom transformation (note that according to ? limma::voom, the function is expecting raw counts

In [71]:

VOOM<-limma::voom(counts(dds))

Get the VOOM matrix

In [72]:

VOOMmat<-VOOM$E
dim(VOOMmat)
VOOMmat[1:10,]

4444
6

	7A_E	7B_E	7C_E	8A_E	8B_E	8C_E
gene0	2.956142	3.271375	3.706448	3.848124	2.426090	2.596152
gene1	6.469766	6.528386	6.701817	7.018049	6.700571	6.822221
gene10	1.5155699	0.9494473	0.7429739	2.1222990	1.1459820	2.1107256
gene100	4.322925	3.151081	4.035756	4.149780	4.315907	4.353582
gene1000	3.231777	3.019837	2.508509	1.881291	1.508552	1.904275
gene1001	0.2931774	1.4348741	1.2284008	1.8812909	1.5085521	1.9042748
gene1002	4.730583	4.604799	3.912899	4.304502	3.383021	3.985195
gene1003	6.039132	6.278571	5.678434	6.072911	6.032114	6.055584
gene1004	0.293177439	-1.372480778	0.006008335	0.006821776	-1.661372918	-0.211202463
gene1005	9.478879	9.926155	9.293721	8.296841	8.571048	8.389949

Get Session Information¶

In [73]:

sessionInfo()

R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
 [1] tools     parallel  stats4    stats     graphics  grDevices utils     datasets  methods
[10] base

other attached packages:
 [1] dplyr_0.5.0                qvalue_2.6.0               limma_3.30.13
 [4] DESeq2_1.14.1              SummarizedExperiment_1.4.0 Biobase_2.34.0
 [7] GenomicRanges_1.26.4       GenomeInfoDb_1.10.3        IRanges_2.8.2
[10] S4Vectors_0.12.2           BiocGenerics_0.20.0

loaded via a namespace (and not attached):
 [1] locfit_1.5-9.1       Rcpp_0.12.8          lattice_0.20-34      assertthat_0.1
 [5] digest_0.6.10        IRdisplay_0.4.3      R6_2.2.0             plyr_1.8.4
 [9] repr_0.7             backports_1.1.0      acepack_1.4.1        RSQLite_1.0.0
[13] evaluate_0.10        ggplot2_2.2.1        zlibbioc_1.20.0      lazyeval_0.2.0
[17] uuid_0.1-2           data.table_1.10.4    annotate_1.52.1      rpart_4.1-10
[21] Matrix_1.2-7.1       checkmate_1.8.3      labeling_0.3         splines_3.3.1
[25] BiocParallel_1.8.2   geneplotter_1.52.0   stringr_1.0.0        foreign_0.8-67
[29] htmlwidgets_0.9      RCurl_1.95-4.8       munsell_0.4.3        base64enc_0.1-3
[33] htmltools_0.3.5      nnet_7.3-12          tibble_1.2           gridExtra_2.2.1
[37] htmlTable_1.9        Hmisc_4.0-3          XML_3.98-1.9         crayon_1.3.1
[41] bitops_1.0-6         grid_3.3.1           jsonlite_1.1         xtable_1.8-2
[45] gtable_0.2.0         DBI_0.5-1            magrittr_1.5         scales_0.4.1
[49] stringi_1.1.2        reshape2_1.4.2       XVector_0.14.1       genefilter_1.56.0
[53] latticeExtra_0.6-28  IRkernel_0.7         Formula_1.2-2        RColorBrewer_1.1-2
[57] survival_2.41-3      AnnotationDbi_1.36.2 colorspace_1.3-1     cluster_2.0.5
[61] memoise_1.0.0        pbdZMQ_0.2-3         knitr_1.15.1

In [ ]:

Introduction to `DESeq2`¶

Load packages¶

Import Data¶

Inspect object¶

Slots of an S4 class¶

Estimate Size Factors and Dispersion Parameters¶

Size Factors¶

Dispersion Parameters¶

Differential Expression Analysis¶

P-values¶

Subset and reorder the results¶

Converting/Normalizing Counts to “Expressions”¶

Normalized Counts¶

FPM¶

FPKM¶

Regularized log transformation¶

Variance Stabilizing Transformation (vst) and mean-variance modelling at the observational level (voom)¶

Get Session Information¶

Page contents

Previous page

Next page

This Page

Introduction to DESeq2¶

Load packages¶

Import Data¶

Inspect object¶

Slots of an S4 class¶

Estimate Size Factors and Dispersion Parameters¶

Size Factors¶

Dispersion Parameters¶

Differential Expression Analysis¶

P-values¶

Subset and reorder the results¶

Converting/Normalizing Counts to “Expressions”¶

Normalized Counts¶

FPM¶

FPKM¶

Regularized log transformation¶

Variance Stabilizing Transformation (vst) and mean-variance modelling at the observational level (voom)¶

Get Session Information¶

Introduction to `DESeq2`¶