Set up environment

In [1]:
source("pilot_config.R")
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats

Attaching package: ‘foreach’

The following objects are masked from ‘package:purrr’:

    accumulate, when

Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:dplyr’:

    combine, intersect, setdiff, union

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, cbind, colMeans, colnames,
    colSums, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, lengths, Map, mapply, match,
    mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which, which.max, which.min


Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:dplyr’:

    first, rename

The following object is masked from ‘package:tidyr’:

    expand

The following object is masked from ‘package:base’:

    expand.grid

Loading required package: IRanges

Attaching package: ‘IRanges’

The following objects are masked from ‘package:dplyr’:

    collapse, desc, slice

The following object is masked from ‘package:purrr’:

    reduce

Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: DelayedArray
Loading required package: matrixStats

Attaching package: ‘matrixStats’

The following objects are masked from ‘package:Biobase’:

    anyMissing, rowMedians

The following object is masked from ‘package:dplyr’:

    count


Attaching package: ‘DelayedArray’

The following objects are masked from ‘package:matrixStats’:

    colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

The following object is masked from ‘package:base’:

    apply


Attaching package: ‘limma’

The following object is masked from ‘package:DESeq2’:

    plotMA

The following object is masked from ‘package:BiocGenerics’:

    plotMA


Attaching package: ‘gridExtra’

The following object is masked from ‘package:Biobase’:

    combine

The following object is masked from ‘package:BiocGenerics’:

    combine

The following object is masked from ‘package:dplyr’:

    combine


---------------------
Welcome to dendextend version 1.8.0
Type citation('dendextend') for how to cite the package.

Type browseVignettes(package = 'dendextend') for the package vignette.
The github page is: https://github.com/talgalili/dendextend/

Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
Or contact: <tal.galili@gmail.com>

        To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
---------------------


Attaching package: ‘dendextend’

The following object is masked from ‘package:stats’:

    cutree


Attaching package: ‘plotly’

The following object is masked from ‘package:IRanges’:

    slice

The following object is masked from ‘package:S4Vectors’:

    rename

The following object is masked from ‘package:ggplot2’:

    last_plot

The following object is masked from ‘package:stats’:

    filter

The following object is masked from ‘package:graphics’:

    layout

In [2]:
# get files of all star output files
stardirs <- list.files(DATDIR)

Read in the count data output from STAR

There are 204 files under this folder. Each file is generated from a fastq file using STAR.

In [3]:
length(stardirs)
204
In [4]:
head(stardirs)
  1. '1_MA_J_S18_L001_ReadsPerGene.out.tab'
  2. '1_MA_J_S18_L002_ReadsPerGene.out.tab'
  3. '1_MA_J_S18_L003_ReadsPerGene.out.tab'
  4. '1_MA_J_S18_L004_ReadsPerGene.out.tab'
  5. '1_RZ_J_S26_L001_ReadsPerGene.out.tab'
  6. '1_RZ_J_S26_L002_ReadsPerGene.out.tab'

View the head of one file (1_MA_J_S18_L001_ReadsPerGene.out.tab). Notice that there are four columns and the columns we want is the first and the fourth columns.

In [5]:
cmdstr <- paste("head", file.path(DATDIR, stardirs[1]))
cmdout <- system(cmdstr, intern = TRUE)
str_split(cmdout, pattern = "\t")
    1. 'N_unmapped'
    2. '2690'
    3. '2690'
    4. '2690'
    1. 'N_multimapping'
    2. '66100'
    3. '66100'
    4. '66100'
    1. 'N_noFeature'
    2. '10626'
    3. '2238382'
    4. '20347'
    1. 'N_ambiguous'
    2. '173170'
    3. '1622'
    4. '647'
    1. 'CNAG_04548'
    2. '0'
    3. '0'
    4. '0'
    1. 'CNAG_07303'
    2. '0'
    3. '0'
    4. '0'
    1. 'CNAG_07304'
    2. '8'
    3. '0'
    4. '8'
    1. 'CNAG_00001'
    2. '0'
    3. '0'
    4. '0'
    1. 'CNAG_07305'
    2. '0'
    3. '0'
    4. '0'
    1. 'CNAG_00002'
    2. '66'
    3. '0'
    4. '66'

Construct a matrix that gathers all the count files

helper functions to read the count files

In [6]:
mycombine <- function(df1, df2) {
    # Combine two data frames by gene names
    #
    # Args:
    #   df1 (Dataframe): the first count data
    #   df2 (Dataframe): the second count data
    #
    # Returns:
    #   (Dataframe) The combined data frame of df1 and df2
    full_join(df1, df2, by = "gene")
}

myfile <- function(filedir, filename) {
    # Get the absolute paths of a file
    #
    # Args:
    #   filedir  (Character): the directory of the folder
    #   filename (Character): the filename
    #
    # Returns:
    #   (Character) the directory of the input file
    file.path(filedir, filename)
}

# Data type for each column
coltypes <- list(col_character(), col_integer(), col_integer(), col_integer())

read the count files and combine them

In [7]:
out <- foreach(stardir = stardirs, .combine = mycombine) %do% {

    # get a directory of each count file
    cntfile <- myfile(DATDIR, stardir)

    # read in the count file
    readr::read_tsv(cntfile, col_names = FALSE, col_types = coltypes) %>%
        dplyr::select(X1, X4) %>% # get the 1st and 4th columns
            dplyr::rename_(.dots=setNames(names(.), c("gene",stardir)))
}

There are 205 columns (204 count files + 1 rowname)

In [8]:
dim(out)
  1. 8501
  2. 205
In [9]:
out[1:6, 1:6]
gene1_MA_J_S18_L001_ReadsPerGene.out.tab1_MA_J_S18_L002_ReadsPerGene.out.tab1_MA_J_S18_L003_ReadsPerGene.out.tab1_MA_J_S18_L004_ReadsPerGene.out.tab1_RZ_J_S26_L001_ReadsPerGene.out.tab
N_unmapped 2690 2684 2672 2585 7218
N_multimapping66100 65234 66538 65066 395848
N_noFeature 20347 20004 20549 20505 768146
N_ambiguous 647 652 697 616 1431
CNAG_04548 0 0 0 1 0
CNAG_07303 0 0 0 0 0

Arrange the results from the count files

Separate the first four rows (-> nmisc) and others (-> genecounts)

We see that our count matrix contains both summarizing counts and specific counts of each gene for each sample. Note that our ‘out’ matrix is in biological format (i.e. the samples are the columns and the variables are the rows). Let’s split this matrix up into two matrices: nmisc and genecounts.

For nmisc, we will take the first 4 rows of out since those are the summarizing features. Next, we want to transform the data frame so that it is in statistical format (the samples are the rows and the feature types are the columns). Using a combination of gather and spread, we can transpose our matrix into the desired format.

In [10]:
### Gather and spread the first four rows
out %>%
    dplyr::slice(1:4) %>%
    gather(expid, value, -gene) %>%
    spread(gene, value) %>%
    rename_(.dots = setNames(names(.), c("expid", "namb", "nmulti", "nnofeat","nunmap"))) ->
    nmisc
In [11]:
nmisc %>% head
expidnambnmultinnofeatnunmap
1_MA_J_S18_L001_ReadsPerGene.out.tab 647 66100 20347 2690
1_MA_J_S18_L002_ReadsPerGene.out.tab 652 65234 20004 2684
1_MA_J_S18_L003_ReadsPerGene.out.tab 697 66538 20549 2672
1_MA_J_S18_L004_ReadsPerGene.out.tab 616 65066 20505 2585
1_RZ_J_S26_L001_ReadsPerGene.out.tab1431 395848 768146 7218
1_RZ_J_S26_L002_ReadsPerGene.out.tab1337 388079 755654 7022

To obtain the counts for specific genes, we will use the rest of our out matrix since it contains the gene counts. However, we still want to transform the data frame into statistical format, which we will accomplish using gather and spread.

In [12]:
### Gather and spread the genes to get a count matrix
out %>%
    dplyr::slice(-(1:4)) %>%
    gather(expid, value, -gene) %>%
    spread(gene, value) -> genecounts
In [13]:
genecounts[1:6,1:6]
expidCNAG_00001CNAG_00002CNAG_00003CNAG_00004CNAG_00005
1_MA_J_S18_L001_ReadsPerGene.out.tab0 66 38 74 33
1_MA_J_S18_L002_ReadsPerGene.out.tab0 59 25 79 25
1_MA_J_S18_L003_ReadsPerGene.out.tab0 74 27 79 32
1_MA_J_S18_L004_ReadsPerGene.out.tab0 66 22 69 24
1_RZ_J_S26_L001_ReadsPerGene.out.tab0 50 16 51 26
1_RZ_J_S26_L002_ReadsPerGene.out.tab0 45 7 51 31

For each samples, sum up all the counts

We can create a variable denoting the number of total genes mapped for each sample by summing across the rows.

In [14]:
### Sum across the rows for a total gene count variable
genecounts %>%
    mutate(ngenemap = rowSums(.[-1])) %>%
    select(expid, ngenemap) -> ngene

Summarize the results

We will create a comprehensive data frame mapresults which will combine ngene with nmisc. This data frame will have summarizing mapping features in addition to proportion features.

In [15]:
### Merge in the 4 misc counts and add summaries
ngene %>%
    full_join(nmisc, by = "expid") %>%
    mutate(depth = as.integer(ngenemap + namb + nmulti + nnofeat + nunmap)) %>%
    mutate(prop.gene = ngenemap / depth) %>%
    mutate(prop.nofeat = nnofeat / depth) %>%
    mutate(prop.unique = (ngenemap + nnofeat) / depth) ->
    mapresults

Store the results

In [16]:
head(mapresults)
expidngenemapnambnmultinnofeatnunmapdepthprop.geneprop.nofeatprop.unique
1_MA_J_S18_L001_ReadsPerGene.out.tab2399781 647 66100 20347 2690 2489565 0.9639359 0.008172914 0.9721088
1_MA_J_S18_L002_ReadsPerGene.out.tab2362228 652 65234 20004 2684 2450802 0.9638592 0.008162226 0.9720214
1_MA_J_S18_L003_ReadsPerGene.out.tab2436776 697 66538 20549 2672 2527232 0.9642075 0.008131030 0.9723385
1_MA_J_S18_L004_ReadsPerGene.out.tab2417485 616 65066 20505 2585 2506257 0.9645798 0.008181523 0.9727614
1_RZ_J_S26_L001_ReadsPerGene.out.tab2366742 1431 395848 768146 7218 3539385 0.6686874 0.217028100 0.8857155
1_RZ_J_S26_L002_ReadsPerGene.out.tab2331658 1337 388079 755654 7022 3483750 0.6692954 0.216908217 0.8862037
In [17]:
outfile <- file.path(OUTDIR, "hts-pilot-2018.RData")
save(mapresults, genecounts, file = outfile)
In [18]:
tools::md5sum(path.expand(outfile))
/home/jovyan/work/scratch/analysis_output/out/hts-pilot-2018.RData: 'e8d7991f674b7a9284abae8f38656c2e'

The End

In [19]:
sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

Matrix products: default
BLAS: /opt/conda/lib/R/lib/libRblas.so
LAPACK: /opt/conda/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
 [1] tools     parallel  stats4    stats     graphics  grDevices utils
 [8] datasets  methods   base

other attached packages:
 [1] bindrcpp_0.2               plotly_4.8.0
 [3] dendextend_1.8.0           gridExtra_2.3
 [5] RColorBrewer_1.1-2         qvalue_2.10.0
 [7] limma_3.34.9               DESeq2_1.18.1
 [9] SummarizedExperiment_1.8.0 DelayedArray_0.4.1
[11] matrixStats_0.53.1         Biobase_2.38.0
[13] GenomicRanges_1.30.3       GenomeInfoDb_1.14.0
[15] IRanges_2.12.0             S4Vectors_0.16.0
[17] BiocGenerics_0.24.0        haven_1.1.1
[19] stringr_1.3.0              foreach_1.4.4
[21] dplyr_0.7.4                purrr_0.2.4
[23] readr_1.1.1                tidyr_0.8.1
[25] tibble_1.4.2               ggplot2_3.0.0
[27] tidyverse_1.1.1

loaded via a namespace (and not attached):
  [1] colorspace_1.3-2        class_7.3-14            modeltools_0.2-22
  [4] mclust_5.4.1            IRdisplay_0.4.4         htmlTable_1.9
  [7] XVector_0.18.0          base64enc_0.1-3         flexmix_2.3-14
 [10] bit64_0.9-5             mvtnorm_1.0-8           AnnotationDbi_1.40.0
 [13] lubridate_1.7.4         xml2_1.2.0              codetools_0.2-15
 [16] splines_3.4.1           mnormt_1.5-5            robustbase_0.92-7
 [19] geneplotter_1.56.0      knitr_1.20              IRkernel_0.8.11
 [22] Formula_1.2-1           jsonlite_1.5            broom_0.4.4
 [25] annotate_1.56.0         kernlab_0.9-25          cluster_2.0.6
 [28] compiler_3.4.1          httr_1.3.1              backports_1.1.2
 [31] assertthat_0.2.0        Matrix_1.2-14           lazyeval_0.2.1
 [34] acepack_1.4.1           htmltools_0.3.6         gtable_0.2.0
 [37] glue_1.2.0              GenomeInfoDbData_0.99.0 reshape2_1.4.3
 [40] Rcpp_0.12.15            trimcluster_0.1-2.1     cellranger_1.1.0
 [43] nlme_3.1-131            fpc_2.1-11.1            iterators_1.0.9
 [46] psych_1.8.4             rvest_0.3.2             XML_3.98-1.11
 [49] DEoptimR_1.0-8          MASS_7.3-48             zlibbioc_1.24.0
 [52] scales_0.5.0            hms_0.3                 memoise_1.1.0
 [55] rpart_4.1-13            latticeExtra_0.6-28     stringi_1.1.7
 [58] RSQLite_2.0             genefilter_1.60.0       checkmate_1.8.5
 [61] BiocParallel_1.12.0     repr_0.15.0             prabclus_2.2-6
 [64] rlang_0.2.1             pkgconfig_2.0.1         bitops_1.0-6
 [67] evaluate_0.10.1         lattice_0.20-34         bindr_0.1.1
 [70] htmlwidgets_1.2         tidyselect_0.2.4        bit_1.1-12
 [73] plyr_1.8.4              magrittr_1.5            R6_2.2.2
 [76] Hmisc_4.0-3             pbdZMQ_0.3-2            DBI_1.0.0
 [79] pillar_1.2.2            whisker_0.3-2           foreign_0.8-67
 [82] withr_2.1.1             survival_2.40-1         RCurl_1.95-4.8
 [85] nnet_7.3-12             modelr_0.1.2            crayon_1.3.4
 [88] uuid_0.1-2              viridis_0.5.1           locfit_1.5-9.1
 [91] grid_3.4.1              readxl_1.1.0            data.table_1.10.4
 [94] blob_1.1.1              forcats_0.3.0           diptest_0.75-7
 [97] digest_0.6.15           xtable_1.8-2            munsell_0.5.0
[100] viridisLite_0.3.0