Import count data using basic tools¶

We want to read in the count files using more basic tools (important if the count files are not generated fro,m htseq-count). We will import the files as matrices. We start with some of the preliminary steps

In [1]:

library(tools)

In [2]:

phfile="sampletable.txt"
cntdir="COUNTS"

In [3]:

md5sum(phfile)
phdata=read.table(phfile,sep=",",stringsAsFactor=FALSE)

Out[3]:

sampletable.txt: '10cdaaa6e5fc76ab92b6d845d33a2a2f'

In [4]:

colnames(phdata)=c("filename","sampid","trt")
phdata[["md5sum"]]=md5sum(file.path(cntdir,phdata[["filename"]]))
phdata

Out[4]:

	filename	sampid	trt	md5sum
1	AGTCAA_counts.tsv	AGTCAA	0	a9eaa959aba1b02b3831583c2a9751c8
2	AGTTCC_counts.tsv	AGTTCC	0	4183767e4eeb75dc582bcf438af13500
3	ATGTCA_counts.tsv	ATGTCA	0	26fbba06520758e5a3acd9bd432ebed4
4	CCGTCC_counts.tsv	CCGTCC	1	50036a88fd48645f740a31f4f4352cfb
5	GTCCGC_counts.tsv	GTCCGC	1	bb1cecd886127159157e9431d072cad5
6	GTGAAA_counts.tsv	GTGAAA	1	fa544c0a076eedb54937c7189f4e1fbc

In [5]:

phdata=phdata[c("sampid","filename","trt","md5sum")]
phdata

Out[5]:

	sampid	filename	trt	md5sum
1	AGTCAA	AGTCAA_counts.tsv	0	a9eaa959aba1b02b3831583c2a9751c8
2	AGTTCC	AGTTCC_counts.tsv	0	4183767e4eeb75dc582bcf438af13500
3	ATGTCA	ATGTCA_counts.tsv	0	26fbba06520758e5a3acd9bd432ebed4
4	CCGTCC	CCGTCC_counts.tsv	1	50036a88fd48645f740a31f4f4352cfb
5	GTCCGC	GTCCGC_counts.tsv	1	bb1cecd886127159157e9431d072cad5
6	GTGAAA	GTGAAA_counts.tsv	1	fa544c0a076eedb54937c7189f4e1fbc

In [6]:

phdata[["trt"]]=as.factor(phdata[["trt"]])

This function reads the data from one file and returns the results as a matrix. It is designed to only keep the rows for which the gene id starts with the prefix “GeneID”

In [7]:

readcnts=function(fname,sep="\t",prefix="GeneID",collab="V1",header=FALSE)
    {
        ### Import text file
        dat=read.table(fname,sep=sep,header=header,stringsAsFactor=FALSE)
        ### Only keep rows for which the gene id matches with the prefix
        dat=dat[substr(dat[[collab]],1,nchar(prefix))==prefix,]
        return(dat)
    }

Let’s read in a file

In [8]:

file1=readcnts("COUNTS/AGTCAA_counts.tsv")

In [9]:

dim(file1)

Out[9]:

4436
2

In [10]:

head(file1)

Out[10]:

	V1	V2
1	GeneID:12930114	118
2	GeneID:12930115	30
3	GeneID:12930116	15
4	GeneID:12930117	12
5	GeneID:12930118	122
6	GeneID:12930119	60

Now read in all files

In [11]:

## Read in the first file
countdat=readcnts(file.path(cntdir,phdata$filename[1]))
### Assign the sample id as the column name
names(countdat)[2]=phdata$sampid[1]
### Repeat the last two steps for files 2 ... 6 and merge
### along each step
for(i in 2:nrow(phdata))
    {
        dat2=readcnts(file.path(cntdir,phdata$filename[i]))
        names(dat2)[2]=phdata$sampid[i]
        countdat=merge(countdat,dat2,by="V1")
    }

geneid=countdat$V1
countdat=as.matrix(countdat[,-1])
row.names(countdat)=geneid

In [12]:

head(countdat)

Out[12]:

	AGTCAA	AGTTCC	ATGTCA	CCGTCC	GTCCGC	GTGAAA
GeneID:12930114	118	137	149	120	161	174
GeneID:12930115	30	42	25	18	32	34
GeneID:12930116	15	55	37	49	36	27
GeneID:12930117	12	12	13	11	7	6
GeneID:12930118	122	137	94	48	131	69
GeneID:12930119	60	88	78	53	43	29

In [13]:

library(DESeq2)

Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, as.vector, cbind, colnames,
    do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl,
    intersect, is.unsorted, lapply, lengths, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unlist, unsplit

Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: Rcpp
Loading required package: RcppArmadillo

In [14]:

dds=DESeqDataSetFromMatrix(countdat,DataFrame(phdata),design=~trt)

In [15]:

dds

Out[15]:

class: DESeqDataSet
dim: 4436 6
metadata(0):
assays(1): counts
rownames(4436): GeneID:12930114 GeneID:12930115 ... GeneID:13406005
  GeneID:13406006
rowRanges metadata column names(0):
colnames(6): AGTCAA AGTTCC ... GTCCGC GTGAAA
colData names(4): sampid filename trt md5sum

In [16]:

sessionInfo()

Out[16]:

R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.3 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 [1] parallel  stats4    tools     stats     graphics  grDevices utils
 [8] datasets  methods   base

other attached packages:
 [1] DESeq2_1.10.1              RcppArmadillo_0.6.500.4.0
 [3] Rcpp_0.12.3                SummarizedExperiment_1.0.2
 [5] Biobase_2.30.0             GenomicRanges_1.22.4
 [7] GenomeInfoDb_1.6.3         IRanges_2.4.6
 [9] S4Vectors_0.8.11           BiocGenerics_0.16.1

loaded via a namespace (and not attached):
 [1] RColorBrewer_1.1-2   plyr_1.8.3           futile.logger_1.4.1
 [4] XVector_0.10.0       base64enc_0.1-3      futile.options_1.0.0
 [7] zlibbioc_1.16.0      rpart_4.1-10         digest_0.6.9
[10] uuid_0.1-2           RSQLite_1.0.0        annotate_1.48.0
[13] jsonlite_0.9.19      evaluate_0.8         gtable_0.1.2
[16] lattice_0.20-33      DBI_0.3.1            IRdisplay_0.3
[19] IRkernel_0.5         gridExtra_2.0.0      rzmq_0.7.7
[22] genefilter_1.52.1    cluster_2.0.3        repr_0.4
[25] stringr_1.0.0        locfit_1.5-9.1       nnet_7.3-12
[28] grid_3.2.3           AnnotationDbi_1.32.3 XML_3.98-1.3
[31] survival_2.38-3      BiocParallel_1.4.3   foreign_0.8-66
[34] latticeExtra_0.6-26  Formula_1.2-1        geneplotter_1.48.0
[37] ggplot2_2.0.0        lambda.r_1.1.7       magrittr_1.5
[40] Hmisc_3.17-1         scales_0.3.0         splines_3.2.3
[43] xtable_1.8-2         colorspace_1.2-6     stringi_1.0-1
[46] acepack_1.3-3.3      munsell_0.4.2

In [ ]:

Import count data using basic tools¶

Previous page

This Page