Import count data using basic tools¶
We want to read in the count files using more basic tools (important if the count files are not generated fro,m htseq-count). We will import the files as matrices. We start with some of the preliminary steps
In [1]:
library(tools)
In [2]:
phfile="sampletable.txt"
cntdir="COUNTS"
In [3]:
md5sum(phfile)
phdata=read.table(phfile,sep=",",stringsAsFactor=FALSE)
Out[3]:
sampletable.txt: '10cdaaa6e5fc76ab92b6d845d33a2a2f'
In [4]:
colnames(phdata)=c("filename","sampid","trt")
phdata[["md5sum"]]=md5sum(file.path(cntdir,phdata[["filename"]]))
phdata
Out[4]:
filename | sampid | trt | md5sum | |
---|---|---|---|---|
1 | AGTCAA_counts.tsv | AGTCAA | 0 | a9eaa959aba1b02b3831583c2a9751c8 |
2 | AGTTCC_counts.tsv | AGTTCC | 0 | 4183767e4eeb75dc582bcf438af13500 |
3 | ATGTCA_counts.tsv | ATGTCA | 0 | 26fbba06520758e5a3acd9bd432ebed4 |
4 | CCGTCC_counts.tsv | CCGTCC | 1 | 50036a88fd48645f740a31f4f4352cfb |
5 | GTCCGC_counts.tsv | GTCCGC | 1 | bb1cecd886127159157e9431d072cad5 |
6 | GTGAAA_counts.tsv | GTGAAA | 1 | fa544c0a076eedb54937c7189f4e1fbc |
In [5]:
phdata=phdata[c("sampid","filename","trt","md5sum")]
phdata
Out[5]:
sampid | filename | trt | md5sum | |
---|---|---|---|---|
1 | AGTCAA | AGTCAA_counts.tsv | 0 | a9eaa959aba1b02b3831583c2a9751c8 |
2 | AGTTCC | AGTTCC_counts.tsv | 0 | 4183767e4eeb75dc582bcf438af13500 |
3 | ATGTCA | ATGTCA_counts.tsv | 0 | 26fbba06520758e5a3acd9bd432ebed4 |
4 | CCGTCC | CCGTCC_counts.tsv | 1 | 50036a88fd48645f740a31f4f4352cfb |
5 | GTCCGC | GTCCGC_counts.tsv | 1 | bb1cecd886127159157e9431d072cad5 |
6 | GTGAAA | GTGAAA_counts.tsv | 1 | fa544c0a076eedb54937c7189f4e1fbc |
In [6]:
phdata[["trt"]]=as.factor(phdata[["trt"]])
This function reads the data from one file and returns the results as a matrix. It is designed to only keep the rows for which the gene id starts with the prefix “GeneID”
In [7]:
readcnts=function(fname,sep="\t",prefix="GeneID",collab="V1",header=FALSE)
{
### Import text file
dat=read.table(fname,sep=sep,header=header,stringsAsFactor=FALSE)
### Only keep rows for which the gene id matches with the prefix
dat=dat[substr(dat[[collab]],1,nchar(prefix))==prefix,]
return(dat)
}
Let’s read in a file
In [8]:
file1=readcnts("COUNTS/AGTCAA_counts.tsv")
In [9]:
dim(file1)
Out[9]:
- 4436
- 2
In [10]:
head(file1)
Out[10]:
V1 | V2 | |
---|---|---|
1 | GeneID:12930114 | 118 |
2 | GeneID:12930115 | 30 |
3 | GeneID:12930116 | 15 |
4 | GeneID:12930117 | 12 |
5 | GeneID:12930118 | 122 |
6 | GeneID:12930119 | 60 |
Now read in all files
In [11]:
## Read in the first file
countdat=readcnts(file.path(cntdir,phdata$filename[1]))
### Assign the sample id as the column name
names(countdat)[2]=phdata$sampid[1]
### Repeat the last two steps for files 2 ... 6 and merge
### along each step
for(i in 2:nrow(phdata))
{
dat2=readcnts(file.path(cntdir,phdata$filename[i]))
names(dat2)[2]=phdata$sampid[i]
countdat=merge(countdat,dat2,by="V1")
}
geneid=countdat$V1
countdat=as.matrix(countdat[,-1])
row.names(countdat)=geneid
In [12]:
head(countdat)
Out[12]:
AGTCAA | AGTTCC | ATGTCA | CCGTCC | GTCCGC | GTGAAA | |
---|---|---|---|---|---|---|
GeneID:12930114 | 118 | 137 | 149 | 120 | 161 | 174 |
GeneID:12930115 | 30 | 42 | 25 | 18 | 32 | 34 |
GeneID:12930116 | 15 | 55 | 37 | 49 | 36 | 27 |
GeneID:12930117 | 12 | 12 | 13 | 11 | 7 | 6 |
GeneID:12930118 | 122 | 137 | 94 | 48 | 131 | 69 |
GeneID:12930119 | 60 | 88 | 78 | 53 | 43 | 29 |
In [13]:
library(DESeq2)
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from ‘package:stats’:
IQR, mad, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, append, as.data.frame, as.vector, cbind, colnames,
do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl,
intersect, is.unsorted, lapply, lengths, Map, mapply, match, mget,
order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
union, unique, unlist, unsplit
Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: Rcpp
Loading required package: RcppArmadillo
In [14]:
dds=DESeqDataSetFromMatrix(countdat,DataFrame(phdata),design=~trt)
In [15]:
dds
Out[15]:
class: DESeqDataSet
dim: 4436 6
metadata(0):
assays(1): counts
rownames(4436): GeneID:12930114 GeneID:12930115 ... GeneID:13406005
GeneID:13406006
rowRanges metadata column names(0):
colnames(6): AGTCAA AGTTCC ... GTCCGC GTGAAA
colData names(4): sampid filename trt md5sum
In [16]:
sessionInfo()
Out[16]:
R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.3 (El Capitan)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 tools stats graphics grDevices utils
[8] datasets methods base
other attached packages:
[1] DESeq2_1.10.1 RcppArmadillo_0.6.500.4.0
[3] Rcpp_0.12.3 SummarizedExperiment_1.0.2
[5] Biobase_2.30.0 GenomicRanges_1.22.4
[7] GenomeInfoDb_1.6.3 IRanges_2.4.6
[9] S4Vectors_0.8.11 BiocGenerics_0.16.1
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-2 plyr_1.8.3 futile.logger_1.4.1
[4] XVector_0.10.0 base64enc_0.1-3 futile.options_1.0.0
[7] zlibbioc_1.16.0 rpart_4.1-10 digest_0.6.9
[10] uuid_0.1-2 RSQLite_1.0.0 annotate_1.48.0
[13] jsonlite_0.9.19 evaluate_0.8 gtable_0.1.2
[16] lattice_0.20-33 DBI_0.3.1 IRdisplay_0.3
[19] IRkernel_0.5 gridExtra_2.0.0 rzmq_0.7.7
[22] genefilter_1.52.1 cluster_2.0.3 repr_0.4
[25] stringr_1.0.0 locfit_1.5-9.1 nnet_7.3-12
[28] grid_3.2.3 AnnotationDbi_1.32.3 XML_3.98-1.3
[31] survival_2.38-3 BiocParallel_1.4.3 foreign_0.8-66
[34] latticeExtra_0.6-26 Formula_1.2-1 geneplotter_1.48.0
[37] ggplot2_2.0.0 lambda.r_1.1.7 magrittr_1.5
[40] Hmisc_3.17-1 scales_0.3.0 splines_3.2.3
[43] xtable_1.8-2 colorspace_1.2-6 stringi_1.0-1
[46] acepack_1.3-3.3 munsell_0.4.2
In [ ]: