Set up environment¶
In [1]:
source("pilot_config.R")
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag(): dplyr, stats
Attaching package: ‘foreach’
The following objects are masked from ‘package:purrr’:
accumulate, when
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from ‘package:dplyr’:
combine, intersect, setdiff, union
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, append, as.data.frame, cbind, colMeans, colnames,
colSums, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
grepl, intersect, is.unsorted, lapply, lengths, Map, mapply, match,
mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
rbind, Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which, which.max, which.min
Attaching package: ‘S4Vectors’
The following objects are masked from ‘package:dplyr’:
first, rename
The following object is masked from ‘package:tidyr’:
expand
The following object is masked from ‘package:base’:
expand.grid
Loading required package: IRanges
Attaching package: ‘IRanges’
The following objects are masked from ‘package:dplyr’:
collapse, desc, slice
The following object is masked from ‘package:purrr’:
reduce
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: DelayedArray
Loading required package: matrixStats
Attaching package: ‘matrixStats’
The following objects are masked from ‘package:Biobase’:
anyMissing, rowMedians
The following object is masked from ‘package:dplyr’:
count
Attaching package: ‘DelayedArray’
The following objects are masked from ‘package:matrixStats’:
colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
The following object is masked from ‘package:base’:
apply
Attaching package: ‘limma’
The following object is masked from ‘package:DESeq2’:
plotMA
The following object is masked from ‘package:BiocGenerics’:
plotMA
Attaching package: ‘gridExtra’
The following object is masked from ‘package:Biobase’:
combine
The following object is masked from ‘package:BiocGenerics’:
combine
The following object is masked from ‘package:dplyr’:
combine
---------------------
Welcome to dendextend version 1.8.0
Type citation('dendextend') for how to cite the package.
Type browseVignettes(package = 'dendextend') for the package vignette.
The github page is: https://github.com/talgalili/dendextend/
Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
Or contact: <tal.galili@gmail.com>
To suppress this message use: suppressPackageStartupMessages(library(dendextend))
---------------------
Attaching package: ‘dendextend’
The following object is masked from ‘package:stats’:
cutree
Attaching package: ‘plotly’
The following object is masked from ‘package:IRanges’:
slice
The following object is masked from ‘package:S4Vectors’:
rename
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
In [2]:
# get files of all star output files
stardirs <- list.files(DATDIR)
Read in the count data output from STAR¶
There are 204 files under this folder. Each file is generated from a fastq file using STAR.
In [3]:
length(stardirs)
In [4]:
head(stardirs)
- '1_MA_J_S18_L001_ReadsPerGene.out.tab'
- '1_MA_J_S18_L002_ReadsPerGene.out.tab'
- '1_MA_J_S18_L003_ReadsPerGene.out.tab'
- '1_MA_J_S18_L004_ReadsPerGene.out.tab'
- '1_RZ_J_S26_L001_ReadsPerGene.out.tab'
- '1_RZ_J_S26_L002_ReadsPerGene.out.tab'
View the head of one file (1_MA_J_S18_L001_ReadsPerGene.out.tab). Notice that there are four columns and the columns we want is the first and the fourth columns.
In [5]:
cmdstr <- paste("head", file.path(DATDIR, stardirs[1]))
cmdout <- system(cmdstr, intern = TRUE)
str_split(cmdout, pattern = "\t")
- 'N_unmapped'
- '2690'
- '2690'
- '2690'
- 'N_multimapping'
- '66100'
- '66100'
- '66100'
- 'N_noFeature'
- '10626'
- '2238382'
- '20347'
- 'N_ambiguous'
- '173170'
- '1622'
- '647'
- 'CNAG_04548'
- '0'
- '0'
- '0'
- 'CNAG_07303'
- '0'
- '0'
- '0'
- 'CNAG_07304'
- '8'
- '0'
- '8'
- 'CNAG_00001'
- '0'
- '0'
- '0'
- 'CNAG_07305'
- '0'
- '0'
- '0'
- 'CNAG_00002'
- '66'
- '0'
- '66'
Construct a matrix that gathers all the count files¶
helper functions to read the count files
In [6]:
mycombine <- function(df1, df2) {
# Combine two data frames by gene names
#
# Args:
# df1 (Dataframe): the first count data
# df2 (Dataframe): the second count data
#
# Returns:
# (Dataframe) The combined data frame of df1 and df2
full_join(df1, df2, by = "gene")
}
myfile <- function(filedir, filename) {
# Get the absolute paths of a file
#
# Args:
# filedir (Character): the directory of the folder
# filename (Character): the filename
#
# Returns:
# (Character) the directory of the input file
file.path(filedir, filename)
}
# Data type for each column
coltypes <- list(col_character(), col_integer(), col_integer(), col_integer())
read the count files and combine them
In [7]:
out <- foreach(stardir = stardirs, .combine = mycombine) %do% {
# get a directory of each count file
cntfile <- myfile(DATDIR, stardir)
# read in the count file
readr::read_tsv(cntfile, col_names = FALSE, col_types = coltypes) %>%
dplyr::select(X1, X4) %>% # get the 1st and 4th columns
dplyr::rename_(.dots=setNames(names(.), c("gene",stardir)))
}
There are 205 columns (204 count files + 1 rowname)
In [8]:
dim(out)
- 8501
- 205
In [9]:
out[1:6, 1:6]
gene | 1_MA_J_S18_L001_ReadsPerGene.out.tab | 1_MA_J_S18_L002_ReadsPerGene.out.tab | 1_MA_J_S18_L003_ReadsPerGene.out.tab | 1_MA_J_S18_L004_ReadsPerGene.out.tab | 1_RZ_J_S26_L001_ReadsPerGene.out.tab |
---|---|---|---|---|---|
N_unmapped | 2690 | 2684 | 2672 | 2585 | 7218 |
N_multimapping | 66100 | 65234 | 66538 | 65066 | 395848 |
N_noFeature | 20347 | 20004 | 20549 | 20505 | 768146 |
N_ambiguous | 647 | 652 | 697 | 616 | 1431 |
CNAG_04548 | 0 | 0 | 0 | 1 | 0 |
CNAG_07303 | 0 | 0 | 0 | 0 | 0 |
Arrange the results from the count files¶
Separate the first four rows (-> nmisc) and others (-> genecounts)¶
We see that our count matrix contains both summarizing counts and
specific counts of each gene for each sample. Note that our ‘out’ matrix
is in biological format (i.e. the samples are the columns and the
variables are the rows). Let’s split this matrix up into two matrices:
nmisc
and genecounts
.
For nmisc
, we will take the first 4 rows of out
since those are
the summarizing features. Next, we want to transform the data frame so
that it is in statistical format (the samples are the rows and the
feature types are the columns). Using a combination of gather and
spread, we can transpose our matrix into the desired format.
In [10]:
### Gather and spread the first four rows
out %>%
dplyr::slice(1:4) %>%
gather(expid, value, -gene) %>%
spread(gene, value) %>%
rename_(.dots = setNames(names(.), c("expid", "namb", "nmulti", "nnofeat","nunmap"))) ->
nmisc
In [11]:
nmisc %>% head
expid | namb | nmulti | nnofeat | nunmap |
---|---|---|---|---|
1_MA_J_S18_L001_ReadsPerGene.out.tab | 647 | 66100 | 20347 | 2690 |
1_MA_J_S18_L002_ReadsPerGene.out.tab | 652 | 65234 | 20004 | 2684 |
1_MA_J_S18_L003_ReadsPerGene.out.tab | 697 | 66538 | 20549 | 2672 |
1_MA_J_S18_L004_ReadsPerGene.out.tab | 616 | 65066 | 20505 | 2585 |
1_RZ_J_S26_L001_ReadsPerGene.out.tab | 1431 | 395848 | 768146 | 7218 |
1_RZ_J_S26_L002_ReadsPerGene.out.tab | 1337 | 388079 | 755654 | 7022 |
To obtain the counts for specific genes, we will use the rest of our
out
matrix since it contains the gene counts. However, we still want
to transform the data frame into statistical format, which we will
accomplish using gather and spread.
In [12]:
### Gather and spread the genes to get a count matrix
out %>%
dplyr::slice(-(1:4)) %>%
gather(expid, value, -gene) %>%
spread(gene, value) -> genecounts
In [13]:
genecounts[1:6,1:6]
expid | CNAG_00001 | CNAG_00002 | CNAG_00003 | CNAG_00004 | CNAG_00005 |
---|---|---|---|---|---|
1_MA_J_S18_L001_ReadsPerGene.out.tab | 0 | 66 | 38 | 74 | 33 |
1_MA_J_S18_L002_ReadsPerGene.out.tab | 0 | 59 | 25 | 79 | 25 |
1_MA_J_S18_L003_ReadsPerGene.out.tab | 0 | 74 | 27 | 79 | 32 |
1_MA_J_S18_L004_ReadsPerGene.out.tab | 0 | 66 | 22 | 69 | 24 |
1_RZ_J_S26_L001_ReadsPerGene.out.tab | 0 | 50 | 16 | 51 | 26 |
1_RZ_J_S26_L002_ReadsPerGene.out.tab | 0 | 45 | 7 | 51 | 31 |
For each samples, sum up all the counts¶
We can create a variable denoting the number of total genes mapped for each sample by summing across the rows.
In [14]:
### Sum across the rows for a total gene count variable
genecounts %>%
mutate(ngenemap = rowSums(.[-1])) %>%
select(expid, ngenemap) -> ngene
Summarize the results¶
We will create a comprehensive data frame mapresults
which will
combine ngene
with nmisc
. This data frame will have summarizing
mapping features in addition to proportion features.
In [15]:
### Merge in the 4 misc counts and add summaries
ngene %>%
full_join(nmisc, by = "expid") %>%
mutate(depth = as.integer(ngenemap + namb + nmulti + nnofeat + nunmap)) %>%
mutate(prop.gene = ngenemap / depth) %>%
mutate(prop.nofeat = nnofeat / depth) %>%
mutate(prop.unique = (ngenemap + nnofeat) / depth) ->
mapresults
Store the results¶
In [16]:
head(mapresults)
expid | ngenemap | namb | nmulti | nnofeat | nunmap | depth | prop.gene | prop.nofeat | prop.unique |
---|---|---|---|---|---|---|---|---|---|
1_MA_J_S18_L001_ReadsPerGene.out.tab | 2399781 | 647 | 66100 | 20347 | 2690 | 2489565 | 0.9639359 | 0.008172914 | 0.9721088 |
1_MA_J_S18_L002_ReadsPerGene.out.tab | 2362228 | 652 | 65234 | 20004 | 2684 | 2450802 | 0.9638592 | 0.008162226 | 0.9720214 |
1_MA_J_S18_L003_ReadsPerGene.out.tab | 2436776 | 697 | 66538 | 20549 | 2672 | 2527232 | 0.9642075 | 0.008131030 | 0.9723385 |
1_MA_J_S18_L004_ReadsPerGene.out.tab | 2417485 | 616 | 65066 | 20505 | 2585 | 2506257 | 0.9645798 | 0.008181523 | 0.9727614 |
1_RZ_J_S26_L001_ReadsPerGene.out.tab | 2366742 | 1431 | 395848 | 768146 | 7218 | 3539385 | 0.6686874 | 0.217028100 | 0.8857155 |
1_RZ_J_S26_L002_ReadsPerGene.out.tab | 2331658 | 1337 | 388079 | 755654 | 7022 | 3483750 | 0.6692954 | 0.216908217 | 0.8862037 |
In [17]:
outfile <- file.path(OUTDIR, "hts-pilot-2018.RData")
save(mapresults, genecounts, file = outfile)
In [18]:
tools::md5sum(path.expand(outfile))
The End¶
In [19]:
sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)
Matrix products: default
BLAS: /opt/conda/lib/R/lib/libRblas.so
LAPACK: /opt/conda/lib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] tools parallel stats4 stats graphics grDevices utils
[8] datasets methods base
other attached packages:
[1] bindrcpp_0.2 plotly_4.8.0
[3] dendextend_1.8.0 gridExtra_2.3
[5] RColorBrewer_1.1-2 qvalue_2.10.0
[7] limma_3.34.9 DESeq2_1.18.1
[9] SummarizedExperiment_1.8.0 DelayedArray_0.4.1
[11] matrixStats_0.53.1 Biobase_2.38.0
[13] GenomicRanges_1.30.3 GenomeInfoDb_1.14.0
[15] IRanges_2.12.0 S4Vectors_0.16.0
[17] BiocGenerics_0.24.0 haven_1.1.1
[19] stringr_1.3.0 foreach_1.4.4
[21] dplyr_0.7.4 purrr_0.2.4
[23] readr_1.1.1 tidyr_0.8.1
[25] tibble_1.4.2 ggplot2_3.0.0
[27] tidyverse_1.1.1
loaded via a namespace (and not attached):
[1] colorspace_1.3-2 class_7.3-14 modeltools_0.2-22
[4] mclust_5.4.1 IRdisplay_0.4.4 htmlTable_1.9
[7] XVector_0.18.0 base64enc_0.1-3 flexmix_2.3-14
[10] bit64_0.9-5 mvtnorm_1.0-8 AnnotationDbi_1.40.0
[13] lubridate_1.7.4 xml2_1.2.0 codetools_0.2-15
[16] splines_3.4.1 mnormt_1.5-5 robustbase_0.92-7
[19] geneplotter_1.56.0 knitr_1.20 IRkernel_0.8.11
[22] Formula_1.2-1 jsonlite_1.5 broom_0.4.4
[25] annotate_1.56.0 kernlab_0.9-25 cluster_2.0.6
[28] compiler_3.4.1 httr_1.3.1 backports_1.1.2
[31] assertthat_0.2.0 Matrix_1.2-14 lazyeval_0.2.1
[34] acepack_1.4.1 htmltools_0.3.6 gtable_0.2.0
[37] glue_1.2.0 GenomeInfoDbData_0.99.0 reshape2_1.4.3
[40] Rcpp_0.12.15 trimcluster_0.1-2.1 cellranger_1.1.0
[43] nlme_3.1-131 fpc_2.1-11.1 iterators_1.0.9
[46] psych_1.8.4 rvest_0.3.2 XML_3.98-1.11
[49] DEoptimR_1.0-8 MASS_7.3-48 zlibbioc_1.24.0
[52] scales_0.5.0 hms_0.3 memoise_1.1.0
[55] rpart_4.1-13 latticeExtra_0.6-28 stringi_1.1.7
[58] RSQLite_2.0 genefilter_1.60.0 checkmate_1.8.5
[61] BiocParallel_1.12.0 repr_0.15.0 prabclus_2.2-6
[64] rlang_0.2.1 pkgconfig_2.0.1 bitops_1.0-6
[67] evaluate_0.10.1 lattice_0.20-34 bindr_0.1.1
[70] htmlwidgets_1.2 tidyselect_0.2.4 bit_1.1-12
[73] plyr_1.8.4 magrittr_1.5 R6_2.2.2
[76] Hmisc_4.0-3 pbdZMQ_0.3-2 DBI_1.0.0
[79] pillar_1.2.2 whisker_0.3-2 foreign_0.8-67
[82] withr_2.1.1 survival_2.40-1 RCurl_1.95-4.8
[85] nnet_7.3-12 modelr_0.1.2 crayon_1.3.4
[88] uuid_0.1-2 viridis_0.5.1 locfit_1.5-9.1
[91] grid_3.4.1 readxl_1.1.0 data.table_1.10.4
[94] blob_1.1.1 forcats_0.3.0 diptest_0.75-7
[97] digest_0.6.15 xtable_1.8-2 munsell_0.5.0
[100] viridisLite_0.3.0