Set up environment¶

In [1]:

source("pilot_config.R")

Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats

Attaching package: ‘foreach’

The following objects are masked from ‘package:purrr’:

    accumulate, when

Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:dplyr’:

    combine, intersect, setdiff, union

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, cbind, colMeans, colnames,
    colSums, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, lengths, Map, mapply, match,
    mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which, which.max, which.min


Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:dplyr’:

    first, rename

The following object is masked from ‘package:tidyr’:

    expand

The following object is masked from ‘package:base’:

    expand.grid

Loading required package: IRanges

Attaching package: ‘IRanges’

The following objects are masked from ‘package:dplyr’:

    collapse, desc, slice

The following object is masked from ‘package:purrr’:

    reduce

Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: DelayedArray
Loading required package: matrixStats

Attaching package: ‘matrixStats’

The following objects are masked from ‘package:Biobase’:

    anyMissing, rowMedians

The following object is masked from ‘package:dplyr’:

    count


Attaching package: ‘DelayedArray’

The following objects are masked from ‘package:matrixStats’:

    colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

The following object is masked from ‘package:base’:

    apply


Attaching package: ‘limma’

The following object is masked from ‘package:DESeq2’:

    plotMA

The following object is masked from ‘package:BiocGenerics’:

    plotMA


Attaching package: ‘gridExtra’

The following object is masked from ‘package:Biobase’:

    combine

The following object is masked from ‘package:BiocGenerics’:

    combine

The following object is masked from ‘package:dplyr’:

    combine


---------------------
Welcome to dendextend version 1.8.0
Type citation('dendextend') for how to cite the package.

Type browseVignettes(package = 'dendextend') for the package vignette.
The github page is: https://github.com/talgalili/dendextend/

Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
Or contact: <tal.galili@gmail.com>

        To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
---------------------


Attaching package: ‘dendextend’

The following object is masked from ‘package:stats’:

    cutree


Attaching package: ‘plotly’

The following object is masked from ‘package:IRanges’:

    slice

The following object is masked from ‘package:S4Vectors’:

    rename

The following object is masked from ‘package:ggplot2’:

    last_plot

The following object is masked from ‘package:stats’:

    filter

The following object is masked from ‘package:graphics’:

    layout

In [2]:

# get files of all star output files
stardirs <- list.files(DATDIR)

Read in the count data output from STAR¶

There are 204 files under this folder. Each file is generated from a fastq file using STAR.

In [3]:

length(stardirs)

204

In [4]:

head(stardirs)

'1_MA_J_S18_L001_ReadsPerGene.out.tab'
'1_MA_J_S18_L002_ReadsPerGene.out.tab'
'1_MA_J_S18_L003_ReadsPerGene.out.tab'
'1_MA_J_S18_L004_ReadsPerGene.out.tab'
'1_RZ_J_S26_L001_ReadsPerGene.out.tab'
'1_RZ_J_S26_L002_ReadsPerGene.out.tab'

View the head of one file (1_MA_J_S18_L001_ReadsPerGene.out.tab). Notice that there are four columns and the columns we want is the first and the fourth columns.

In [5]:

cmdstr <- paste("head", file.path(DATDIR, stardirs[1]))
cmdout <- system(cmdstr, intern = TRUE)
str_split(cmdout, pattern = "\t")

1. 'N_unmapped'
2. '2690'
3. '2690'
4. '2690'
1. 'N_multimapping'
2. '66100'
3. '66100'
4. '66100'
1. 'N_noFeature'
2. '10626'
3. '2238382'
4. '20347'
1. 'N_ambiguous'
2. '173170'
3. '1622'
4. '647'
1. 'CNAG_04548'
2. '0'
3. '0'
4. '0'
1. 'CNAG_07303'
2. '0'
3. '0'
4. '0'
1. 'CNAG_07304'
2. '8'
3. '0'
4. '8'
1. 'CNAG_00001'
2. '0'
3. '0'
4. '0'
1. 'CNAG_07305'
2. '0'
3. '0'
4. '0'
1. 'CNAG_00002'
2. '66'
3. '0'
4. '66'

Construct a matrix that gathers all the count files¶

helper functions to read the count files

In [6]:

mycombine <- function(df1, df2) {
    # Combine two data frames by gene names
    #
    # Args:
    #   df1 (Dataframe): the first count data
    #   df2 (Dataframe): the second count data
    #
    # Returns:
    #   (Dataframe) The combined data frame of df1 and df2
    full_join(df1, df2, by = "gene")
}

myfile <- function(filedir, filename) {
    # Get the absolute paths of a file
    #
    # Args:
    #   filedir  (Character): the directory of the folder
    #   filename (Character): the filename
    #
    # Returns:
    #   (Character) the directory of the input file
    file.path(filedir, filename)
}

# Data type for each column
coltypes <- list(col_character(), col_integer(), col_integer(), col_integer())

read the count files and combine them

In [7]:

out <- foreach(stardir = stardirs, .combine = mycombine) %do% {

    # get a directory of each count file
    cntfile <- myfile(DATDIR, stardir)

    # read in the count file
    readr::read_tsv(cntfile, col_names = FALSE, col_types = coltypes) %>%
        dplyr::select(X1, X4) %>% # get the 1st and 4th columns
            dplyr::rename_(.dots=setNames(names(.), c("gene",stardir)))
}

There are 205 columns (204 count files + 1 rowname)

In [8]:

dim(out)

8501
205

In [9]:

out[1:6, 1:6]

gene	1_MA_J_S18_L001_ReadsPerGene.out.tab	1_MA_J_S18_L002_ReadsPerGene.out.tab	1_MA_J_S18_L003_ReadsPerGene.out.tab	1_MA_J_S18_L004_ReadsPerGene.out.tab	1_RZ_J_S26_L001_ReadsPerGene.out.tab
N_unmapped	2690	2684	2672	2585	7218
N_multimapping	66100	65234	66538	65066	395848
N_noFeature	20347	20004	20549	20505	768146
N_ambiguous	647	652	697	616	1431
CNAG_04548	0	0	0	1	0
CNAG_07303	0	0	0	0	0

Arrange the results from the count files¶

Separate the first four rows (-> nmisc) and others (-> genecounts)¶

We see that our count matrix contains both summarizing counts and specific counts of each gene for each sample. Note that our ‘out’ matrix is in biological format (i.e. the samples are the columns and the variables are the rows). Let’s split this matrix up into two matrices: nmisc and genecounts.

For nmisc, we will take the first 4 rows of out since those are the summarizing features. Next, we want to transform the data frame so that it is in statistical format (the samples are the rows and the feature types are the columns). Using a combination of gather and spread, we can transpose our matrix into the desired format.

In [10]:

### Gather and spread the first four rows
out %>%
    dplyr::slice(1:4) %>%
    gather(expid, value, -gene) %>%
    spread(gene, value) %>%
    rename_(.dots = setNames(names(.), c("expid", "namb", "nmulti", "nnofeat","nunmap"))) ->
    nmisc

In [11]:

nmisc %>% head

expid	namb	nmulti	nnofeat	nunmap
1_MA_J_S18_L001_ReadsPerGene.out.tab	647	66100	20347	2690
1_MA_J_S18_L002_ReadsPerGene.out.tab	652	65234	20004	2684
1_MA_J_S18_L003_ReadsPerGene.out.tab	697	66538	20549	2672
1_MA_J_S18_L004_ReadsPerGene.out.tab	616	65066	20505	2585
1_RZ_J_S26_L001_ReadsPerGene.out.tab	1431	395848	768146	7218
1_RZ_J_S26_L002_ReadsPerGene.out.tab	1337	388079	755654	7022

To obtain the counts for specific genes, we will use the rest of our out matrix since it contains the gene counts. However, we still want to transform the data frame into statistical format, which we will accomplish using gather and spread.

In [12]:

### Gather and spread the genes to get a count matrix
out %>%
    dplyr::slice(-(1:4)) %>%
    gather(expid, value, -gene) %>%
    spread(gene, value) -> genecounts

In [13]:

genecounts[1:6,1:6]

expid	CNAG_00002	CNAG_00003	CNAG_00004	CNAG_00005
1_MA_J_S18_L001_ReadsPerGene.out.tab	66	38	74	33
1_MA_J_S18_L002_ReadsPerGene.out.tab	59	25	79	25
1_MA_J_S18_L003_ReadsPerGene.out.tab	74	27	79	32
1_MA_J_S18_L004_ReadsPerGene.out.tab	66	22	69	24
1_RZ_J_S26_L001_ReadsPerGene.out.tab	50	16	51	26
1_RZ_J_S26_L002_ReadsPerGene.out.tab	45	7	51	31

For each samples, sum up all the counts¶

We can create a variable denoting the number of total genes mapped for each sample by summing across the rows.

In [14]:

### Sum across the rows for a total gene count variable
genecounts %>%
    mutate(ngenemap = rowSums(.[-1])) %>%
    select(expid, ngenemap) -> ngene

Summarize the results¶

We will create a comprehensive data frame mapresults which will combine ngene with nmisc. This data frame will have summarizing mapping features in addition to proportion features.

In [15]:

### Merge in the 4 misc counts and add summaries
ngene %>%
    full_join(nmisc, by = "expid") %>%
    mutate(depth = as.integer(ngenemap + namb + nmulti + nnofeat + nunmap)) %>%
    mutate(prop.gene = ngenemap / depth) %>%
    mutate(prop.nofeat = nnofeat / depth) %>%
    mutate(prop.unique = (ngenemap + nnofeat) / depth) ->
    mapresults

Store the results¶

In [16]:

head(mapresults)

expid	ngenemap	namb	nmulti	nnofeat	nunmap	depth	prop.gene	prop.nofeat	prop.unique
1_MA_J_S18_L001_ReadsPerGene.out.tab	2399781	647	66100	20347	2690	2489565	0.9639359	0.008172914	0.9721088
1_MA_J_S18_L002_ReadsPerGene.out.tab	2362228	652	65234	20004	2684	2450802	0.9638592	0.008162226	0.9720214
1_MA_J_S18_L003_ReadsPerGene.out.tab	2436776	697	66538	20549	2672	2527232	0.9642075	0.008131030	0.9723385
1_MA_J_S18_L004_ReadsPerGene.out.tab	2417485	616	65066	20505	2585	2506257	0.9645798	0.008181523	0.9727614
1_RZ_J_S26_L001_ReadsPerGene.out.tab	2366742	1431	395848	768146	7218	3539385	0.6686874	0.217028100	0.8857155
1_RZ_J_S26_L002_ReadsPerGene.out.tab	2331658	1337	388079	755654	7022	3483750	0.6692954	0.216908217	0.8862037

In [17]:

outfile <- file.path(OUTDIR, "hts-pilot-2018.RData")
save(mapresults, genecounts, file = outfile)

In [18]:

tools::md5sum(path.expand(outfile))

/home/jovyan/work/scratch/analysis_output/out/hts-pilot-2018.RData: 'e8d7991f674b7a9284abae8f38656c2e'

The End¶

In [19]:

sessionInfo()

R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

Matrix products: default
BLAS: /opt/conda/lib/R/lib/libRblas.so
LAPACK: /opt/conda/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
 [1] tools     parallel  stats4    stats     graphics  grDevices utils
 [8] datasets  methods   base

other attached packages:
 [1] bindrcpp_0.2               plotly_4.8.0
 [3] dendextend_1.8.0           gridExtra_2.3
 [5] RColorBrewer_1.1-2         qvalue_2.10.0
 [7] limma_3.34.9               DESeq2_1.18.1
 [9] SummarizedExperiment_1.8.0 DelayedArray_0.4.1
[11] matrixStats_0.53.1         Biobase_2.38.0
[13] GenomicRanges_1.30.3       GenomeInfoDb_1.14.0
[15] IRanges_2.12.0             S4Vectors_0.16.0
[17] BiocGenerics_0.24.0        haven_1.1.1
[19] stringr_1.3.0              foreach_1.4.4
[21] dplyr_0.7.4                purrr_0.2.4
[23] readr_1.1.1                tidyr_0.8.1
[25] tibble_1.4.2               ggplot2_3.0.0
[27] tidyverse_1.1.1

loaded via a namespace (and not attached):
  [1] colorspace_1.3-2        class_7.3-14            modeltools_0.2-22
  [4] mclust_5.4.1            IRdisplay_0.4.4         htmlTable_1.9
  [7] XVector_0.18.0          base64enc_0.1-3         flexmix_2.3-14
 [10] bit64_0.9-5             mvtnorm_1.0-8           AnnotationDbi_1.40.0
 [13] lubridate_1.7.4         xml2_1.2.0              codetools_0.2-15
 [16] splines_3.4.1           mnormt_1.5-5            robustbase_0.92-7
 [19] geneplotter_1.56.0      knitr_1.20              IRkernel_0.8.11
 [22] Formula_1.2-1           jsonlite_1.5            broom_0.4.4
 [25] annotate_1.56.0         kernlab_0.9-25          cluster_2.0.6
 [28] compiler_3.4.1          httr_1.3.1              backports_1.1.2
 [31] assertthat_0.2.0        Matrix_1.2-14           lazyeval_0.2.1
 [34] acepack_1.4.1           htmltools_0.3.6         gtable_0.2.0
 [37] glue_1.2.0              GenomeInfoDbData_0.99.0 reshape2_1.4.3
 [40] Rcpp_0.12.15            trimcluster_0.1-2.1     cellranger_1.1.0
 [43] nlme_3.1-131            fpc_2.1-11.1            iterators_1.0.9
 [46] psych_1.8.4             rvest_0.3.2             XML_3.98-1.11
 [49] DEoptimR_1.0-8          MASS_7.3-48             zlibbioc_1.24.0
 [52] scales_0.5.0            hms_0.3                 memoise_1.1.0
 [55] rpart_4.1-13            latticeExtra_0.6-28     stringi_1.1.7
 [58] RSQLite_2.0             genefilter_1.60.0       checkmate_1.8.5
 [61] BiocParallel_1.12.0     repr_0.15.0             prabclus_2.2-6
 [64] rlang_0.2.1             pkgconfig_2.0.1         bitops_1.0-6
 [67] evaluate_0.10.1         lattice_0.20-34         bindr_0.1.1
 [70] htmlwidgets_1.2         tidyselect_0.2.4        bit_1.1-12
 [73] plyr_1.8.4              magrittr_1.5            R6_2.2.2
 [76] Hmisc_4.0-3             pbdZMQ_0.3-2            DBI_1.0.0
 [79] pillar_1.2.2            whisker_0.3-2           foreign_0.8-67
 [82] withr_2.1.1             survival_2.40-1         RCurl_1.95-4.8
 [85] nnet_7.3-12             modelr_0.1.2            crayon_1.3.4
 [88] uuid_0.1-2              viridis_0.5.1           locfit_1.5-9.1
 [91] grid_3.4.1              readxl_1.1.0            data.table_1.10.4
 [94] blob_1.1.1              forcats_0.3.0           diptest_0.75-7
 [97] digest_0.6.15           xtable_1.8-2            munsell_0.5.0
[100] viridisLite_0.3.0