HTS Course: DESeq2 Analysis Outline for 2017 Pilot Data¶
Data Import¶
Load packages
In [7]:
library(tidyverse)
library(tools)
library(DESeq2)
In [39]:
cntdir<-"/home/jovyan/work/2017-HTS-materials/Data_Info_and_Results/2017_HTS/counts/HTS_2017_pilot"
cntfiles<-list.files(cntdir)
cntdf<-tibble(cntfiles,md5sum=tools::md5sum(file.path(cntdir,cntfiles)))
cntdf
cntfiles | md5sum | |
---|---|---|
1 | Pst-kb1_S1.csv | 7cab95ffd43eccabd4bff0f1e6b4ac5f |
2 | Pst-kb2_S2.csv | 59633fe06424e54937c34a56c5e24477 |
3 | Pst-kb3_S3.csv | 2b576a5c270e7638b5072fa3776ecc83 |
4 | Pst-mm1_S4.csv | 072a43688309a087125c565c9f41cf54 |
5 | Pst-mm3_S5.csv | 3e255f6af17a313b6839a1f4f2595fdc |
6 | Pst-mm4_S6.csv | dcac1408050aa686c7350d6a9b70d737 |
Identify the csv file containing the meta data
In [17]:
metafile<-"/home/jovyan/work/2017-HTS-materials/Data_Info_and_Results/2017_HTS/info/full_metadata.csv"
tools::md5sum(metafile)
/home/jovyan/work/2017-HTS-materials/Data_Info_and_Results/2017_HTS/info/full_metadata.csv: 'c2dddfedfd9107615ddf0d62f0c6453e'
Import the csv file
In [18]:
metadata<-read_csv(metafile)
Parsed with column specification:
cols(
library_num = col_integer(),
miseq_id = col_character(),
miseq_filename = col_character(),
nextseq_id = col_character(),
nextseq_filename = col_character(),
group = col_character(),
genotype = col_character(),
rna_prep = col_character(),
rep = col_double(),
media = col_character(),
rin = col_double(),
input_sample = col_character(),
i7_index_id = col_character(),
index = col_character(),
notes = col_character()
)
Check the dimension annd first 6 rows of the of the imported file
In [4]:
dim(metadata)
- 54
- 15
In [5]:
head(metadata)
library_num | miseq_id | miseq_filename | nextseq_id | nextseq_filename | group | genotype | rna_prep | rep | media | rin | input_sample | i7_index_id | index | notes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1_A | 1-A_S1.csv | JG-1 | JG-L1_S1.csv | A | Pst | new | 1 | kb | 10 | Pst-kb1-new | P49-E1 | AAGACCGT | NA |
2 | 2 | 2_A | 2-A_S2.csv | JG-2 | JG-L2_S2.csv | A | Pst | new | 2 | kb | 10 | Pst-kb2-new | P50-E2 | TTGCGAGA | NA |
3 | 3 | 3_A | 3-A_S3.csv | JG-3 | JG-L3_S3.csv | A | Pst | new | 3 | kb | 10 | Pst-kb3-new | P51-E3 | GCAATTCC | NA |
4 | 4 | 4_A | 4-A_S4.csv | JG-4 | JG-L4_S4.csv | A | Pst | new | 1 | mm | 10 | Pst-mm1-new | P52-E4 | GAATCCGT | NA |
5 | 5 | 5_A | 5-A_S5.csv | JG-5 | JG-L5_S5.csv | A | Pst | new | 2 | mm | 10 | Pst-mm2-new | P53-E5 | CCGCTTAA | NA |
6 | 6 | 6_A | 6-A_S6.csv | JG-6 | JG-L6_S6.csv | A | Pst | new | 3 | mm | 10 | Pst-mm3-new | P54-E6 | TACCTGCA | NA |
The group column can be used to subset the metadata
In [19]:
table(metadata$group,exclude=NULL)
A B C D E F G H pilot <NA>
6 6 6 6 6 6 6 6 6 0
We will look at the Pilot data
In [20]:
mydata <- metadata %>% filter(group=="pilot")
In [9]:
mydata
library_num | miseq_id | miseq_filename | nextseq_id | nextseq_filename | group | genotype | rna_prep | rep | media | rin | input_sample | i7_index_id | index | notes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 49 | Pst-kb1_S1 | Pst-kb1_S1.csv | NA | NA | pilot | Pst | old | 1 | kb | 8.4 | Pst-kb1-old | P1-A1 | TTACCGAC | NA |
2 | 50 | Pst-kb2_S2 | Pst-kb2_S2.csv | NA | NA | pilot | Pst | old | 2 | kb | 9 | Pst-kb2-old | P2-A2 | AGTGACCT | NA |
3 | 51 | Pst-kb3_S3 | Pst-kb3_S3.csv | NA | NA | pilot | Pst | old | 3 | kb | 8.7 | Pst-kb3-old | P3-A3 | TCGGATTC | NA |
4 | 52 | Pst-mm1_S4 | Pst-mm1_S4.csv | NA | NA | pilot | Pst | old | 1 | mm | 7.9 | Pst-mm1-old | P4-A4 | CAAGGTAC | NA |
5 | 53 | Pst-mm3_S5 | Pst-mm3_S5.csv | NA | NA | pilot | Pst | old | 3 | mm | 6.3 | Pst-mm3-old | P5-A5 | TCCTCATG | NA |
6 | 54 | Pst-mm4_S6 | Pst-mm4_S6.csv | NA | NA | pilot | Pst | old | 4 | mm | 7.7 | Pst-mm4-old | P6-A6 | GTCAGTCA | NA |
To create a sample data frame for DESeq2, be sure that the first and second columns are the sample and file name respectively
In [ ]:
In [46]:
sampledata<- mydata %>% transmute(sampname=input_sample,
fname=miseq_filename,
trt=as.factor(media),
genotype,
group,
rna_prep,md5=md5sum(file.path(cntdir,miseq_filename))) %>% as.data.frame
In [32]:
sampledata
sampname | fname | trt | genotype | group | rna_prep | md5 | |
---|---|---|---|---|---|---|---|
1 | Pst-kb1-old | Pst-kb1_S1.csv | kb | Pst | pilot | old | 7cab95ffd43eccabd4bff0f1e6b4ac5f |
2 | Pst-kb2-old | Pst-kb2_S2.csv | kb | Pst | pilot | old | 59633fe06424e54937c34a56c5e24477 |
3 | Pst-kb3-old | Pst-kb3_S3.csv | kb | Pst | pilot | old | 2b576a5c270e7638b5072fa3776ecc83 |
4 | Pst-mm1-old | Pst-mm1_S4.csv | mm | Pst | pilot | old | 072a43688309a087125c565c9f41cf54 |
5 | Pst-mm3-old | Pst-mm3_S5.csv | mm | Pst | pilot | old | 3e255f6af17a313b6839a1f4f2595fdc |
6 | Pst-mm4-old | Pst-mm4_S6.csv | mm | Pst | pilot | old | dcac1408050aa686c7350d6a9b70d737 |
Overview of analysis¶
Import data as a DESeqData object
In [50]:
dds<-DESeqDataSetFromHTSeqCount(sampleTable = sampledata, directory =cntdir, design=~trt )
Estimate and inspect size factors for the libraries
In [51]:
dds<-estimateSizeFactors(dds)
In [52]:
sizeFactors(dds)
- Pst-kb1-old
- 1.55592063666617
- Pst-kb2-old
- 1.3861043989566
- Pst-kb3-old
- 1.03775079182795
- Pst-mm1-old
- 0.50339693701456
- Pst-mm3-old
- 1.45624118576326
- Pst-mm4-old
- 0.677278949795447
Estimate gene specific dispersions
In [53]:
dds<-estimateDispersions(dds)
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
Conduct differential expression analysis
In [54]:
ddsDE<-DESeq(dds)
using pre-existing size factors
estimating dispersions
found already estimated dispersions, replacing these
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
In [59]:
myres<-results(ddsDE)
myres
log2 fold change (MAP): trt mm vs kb
Wald test p-value: trt mm vs kb
DataFrame with 5842 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
PSPTOA0001 51.383556 1.1074926 0.2955247 3.7475462 0.000178573
PSPTOA0002 15.437549 0.4056040 0.4615707 0.8787473 0.379538281
PSPTOA0003 5.172905 -0.5142773 0.6847216 -0.7510751 0.452607434
PSPTOA0004 35.440383 0.9753153 0.3390883 2.8762869 0.004023838
PSPTOA0005 12.558815 0.5673048 0.4668771 1.2151051 0.224326014
... ... ... ... ... ...
PSPTO_t59 0.3895036 0.57087886 1.1528334 0.49519631 6.204616e-01
PSPTO_t60 1.9232968 -1.93855730 1.0311783 -1.87994381 6.011574e-02
PSPTO_t61 0.1606037 -0.06597916 1.0382954 -0.06354566 9.493320e-01
PSPTO_t62 19.3487956 -3.66099758 0.6020517 -6.08086929 1.195327e-09
PSPTO_t63 19.2699434 -3.15820294 0.5531821 -5.70915614 1.135377e-08
padj
<numeric>
PSPTOA0001 0.001019702
PSPTOA0002 0.519773470
PSPTOA0003 0.588250598
PSPTOA0004 0.014614985
PSPTOA0005 0.353660766
... ...
PSPTO_t59 NA
PSPTO_t60 NA
PSPTO_t61 NA
PSPTO_t62 1.535772e-08
PSPTO_t63 1.303432e-07
Graphical over of the results
In [61]:
plot(myres$log2FoldChange,-log10(myres$padj),pch=19,cex=0.3,xlab="Log2 FC",ylab="-log10(BH Adjusted P-value)",col=ifelse(myres$padj<0.05&abs(myres$log2FoldChange)>2,"red","black"))

In [56]:
sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] tools parallel stats4 stats graphics grDevices utils
[8] datasets methods base
other attached packages:
[1] DESeq2_1.14.1 SummarizedExperiment_1.4.0
[3] Biobase_2.34.0 GenomicRanges_1.26.4
[5] GenomeInfoDb_1.10.3 IRanges_2.8.2
[7] S4Vectors_0.12.2 BiocGenerics_0.20.0
[9] dplyr_0.5.0 purrr_0.2.2
[11] readr_1.0.0 tidyr_0.6.0
[13] tibble_1.2 ggplot2_2.2.1
[15] tidyverse_1.0.0
loaded via a namespace (and not attached):
[1] locfit_1.5-9.1 Rcpp_0.12.8 lattice_0.20-34
[4] assertthat_0.1 digest_0.6.10 IRdisplay_0.4.3
[7] R6_2.2.0 plyr_1.8.4 repr_0.7
[10] backports_1.1.0 acepack_1.4.1 RSQLite_1.0.0
[13] evaluate_0.10 zlibbioc_1.20.0 lazyeval_0.2.0
[16] uuid_0.1-2 annotate_1.52.1 data.table_1.10.4
[19] rpart_4.1-10 Matrix_1.2-7.1 checkmate_1.8.3
[22] splines_3.3.1 BiocParallel_1.8.2 geneplotter_1.52.0
[25] stringr_1.0.0 foreign_0.8-67 htmlwidgets_0.9
[28] RCurl_1.95-4.8 munsell_0.4.3 base64enc_0.1-3
[31] htmltools_0.3.5 nnet_7.3-12 gridExtra_2.2.1
[34] htmlTable_1.9 Hmisc_4.0-3 XML_3.98-1.9
[37] crayon_1.3.1 bitops_1.0-6 grid_3.3.1
[40] xtable_1.8-2 jsonlite_1.1 gtable_0.2.0
[43] DBI_0.5-1 magrittr_1.5 scales_0.4.1
[46] stringi_1.1.2 genefilter_1.56.0 XVector_0.14.1
[49] latticeExtra_0.6-28 IRkernel_0.7 Formula_1.2-2
[52] RColorBrewer_1.1-2 survival_2.41-3 AnnotationDbi_1.36.2
[55] colorspace_1.3-1 cluster_2.0.5 memoise_1.0.0
[58] pbdZMQ_0.2-3 knitr_1.15.1
In [ ]: