Set up environment

In [1]:
source("pilot_config.R")
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats

Attaching package: ‘foreach’

The following objects are masked from ‘package:purrr’:

    accumulate, when

Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:dplyr’:

    combine, intersect, setdiff, union

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, cbind, colMeans, colnames,
    colSums, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, lengths, Map, mapply, match,
    mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which, which.max, which.min


Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:dplyr’:

    first, rename

The following object is masked from ‘package:tidyr’:

    expand

The following object is masked from ‘package:base’:

    expand.grid

Loading required package: IRanges

Attaching package: ‘IRanges’

The following objects are masked from ‘package:dplyr’:

    collapse, desc, slice

The following object is masked from ‘package:purrr’:

    reduce

Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: DelayedArray
Loading required package: matrixStats

Attaching package: ‘matrixStats’

The following objects are masked from ‘package:Biobase’:

    anyMissing, rowMedians

The following object is masked from ‘package:dplyr’:

    count


Attaching package: ‘DelayedArray’

The following objects are masked from ‘package:matrixStats’:

    colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

The following object is masked from ‘package:base’:

    apply


Attaching package: ‘limma’

The following object is masked from ‘package:DESeq2’:

    plotMA

The following object is masked from ‘package:BiocGenerics’:

    plotMA


Attaching package: ‘gridExtra’

The following object is masked from ‘package:Biobase’:

    combine

The following object is masked from ‘package:BiocGenerics’:

    combine

The following object is masked from ‘package:dplyr’:

    combine


---------------------
Welcome to dendextend version 1.8.0
Type citation('dendextend') for how to cite the package.

Type browseVignettes(package = 'dendextend') for the package vignette.
The github page is: https://github.com/talgalili/dendextend/

Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
Or contact: <tal.galili@gmail.com>

        To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
---------------------


Attaching package: ‘dendextend’

The following object is masked from ‘package:stats’:

    cutree


Attaching package: ‘plotly’

The following object is masked from ‘package:IRanges’:

    slice

The following object is masked from ‘package:S4Vectors’:

    rename

The following object is masked from ‘package:ggplot2’:

    last_plot

The following object is masked from ‘package:stats’:

    filter

The following object is masked from ‘package:graphics’:

    layout

Recall that after the tutorial one, we have created the hts-pilot-2018.RData.

scratch
└── analysis_output
    ├── out
    │   └── hts-pilot-2018.RData
    └── img
In [2]:
OUTDIR
'/home/jovyan/work/scratch/analysis_output/out'
In [3]:
# file directory
cntfile <- file.path(OUTDIR, "hts-pilot-2018.RData")

Read in results

In [4]:
### Import count data
attach(cntfile)
tools::md5sum(cntfile)
/home/jovyan/work/scratch/analysis_output/out/hts-pilot-2018.RData: 'e8d7991f674b7a9284abae8f38656c2e'
In [5]:
### Import metadata
tools::md5sum(METADTFILE)

mtdf <- readr::read_tsv(METADTFILE) %>%
    mutate_at(vars(
        `Label`,
        `Strain`,
        `Media`,
        `enrichment_method`,
        `library#`,
        `libprep_person`,
        `experiment_person`
    ), factor)
/home/jovyan/work/HTS2018-notebooks/josh/info/2018_pilot_metadata_anon.tsv: '20b31fdc4487d0cd8cc3c9300584faca'
Parsed with column specification:
cols(
  Label = col_character(),
  RNA_sample_num = col_integer(),
  Media = col_character(),
  Strain = col_character(),
  Replicate = col_integer(),
  experiment_person = col_character(),
  libprep_person = col_character(),
  enrichment_method = col_character(),
  RIN = col_double(),
  concentration_fold_difference = col_double(),
  `i7 index` = col_character(),
  `i5 index` = col_character(),
  `i5 primer` = col_character(),
  `i7 primer` = col_character(),
  `library#` = col_integer()
)

Recall that there are 204 samples and 8498 genes

In [6]:
head(mtdf)
LabelRNA_sample_numMediaStrainReplicateexperiment_personlibprep_personenrichment_methodRINconcentration_fold_differencei7 indexi5 indexi5 primeri7 primerlibrary#
2_MA_C 2 YPD H99 2 expA prepA MA 10.0 1.34 ATTACTCGAGGCTATAi501 i701 1
9_MA_C 9 YPD mar1d 3 expA prepA MA 10.0 2.23 ATTACTCGGCCTCTATi502 i701 2
10_MA_C 10 YPD mar1d 4 expA prepA MA 9.9 4.37 ATTACTCGAGGATAGGi503 i701 3
14_MA_C 14 TC H99 2 expA prepA MA 10.0 1.57 ATTACTCGTCAGAGCCi504 i701 4
15_MA_C 15 TC H99 3 expA prepA MA 9.9 2.85 ATTACTCGCTTCGCCTi505 i701 5
21_MA_C 21 TC mar1d 3 expA prepA MA 10.0 1.81 ATTACTCGTAAGATTAi506 i701 6

Check the label between metadata and mapping results

Recall that we got two data frames in the previous tutorial:
- genecounts: gene counts for each sample - Note: We will need to convert it into gene counts for each library later - mapresults: the mapping results - Note: There are 204 samples

The metadata (mtdf) contains the information of each sample. Here we need to make sure if the label in the metadata matches the sample names we have in mapresults and genecount

The code chunk below allows us to check to see if we can match the labels to the those in the metadata file. There should not be any output from the code chunk.

In [7]:
### Use setdiff to check to see if we can match the labels to the those in the metadata file
myregex <- "_[A-Z](100|[1-9][0-9]?)_L00[1-4].*"

mapresults$expid %>%
    str_replace(myregex, "") %>%
    setdiff(mtdf$Label)

mtdf$Label %>%
    setdiff(str_replace(mapresults$expid, myregex, ""))

Construct gene count matrix for each library

Add the “Label” to the count matrix and mapping results, then merge in phenotype data (by Label)

In [8]:
### Add "Label" to genecounts
genecounts %>%
    mutate(Label = str_replace(expid, myregex, "")) ->
    annogenecnts
In [9]:
### Collapse the gene counts within each label
annogenecnts %>%
    select(-expid) %>%
    group_by(Label) %>%
    summarize_all(sum) %>%
    gather(gene, value, -Label) %>%
    spread(Label, value) ->
    annogenecnts0

Show the resulting data frame in each step

In [10]:
genecounts[1:6, 1:6]
expidCNAG_00001CNAG_00002CNAG_00003CNAG_00004CNAG_00005
1_MA_J_S18_L001_ReadsPerGene.out.tab0 66 38 74 33
1_MA_J_S18_L002_ReadsPerGene.out.tab0 59 25 79 25
1_MA_J_S18_L003_ReadsPerGene.out.tab0 74 27 79 32
1_MA_J_S18_L004_ReadsPerGene.out.tab0 66 22 69 24
1_RZ_J_S26_L001_ReadsPerGene.out.tab0 50 16 51 26
1_RZ_J_S26_L002_ReadsPerGene.out.tab0 45 7 51 31
In [11]:
annogenecnts[1:6, c(colnames(annogenecnts)[1:6], "Label")]
expidCNAG_00001CNAG_00002CNAG_00003CNAG_00004CNAG_00005Label
1_MA_J_S18_L001_ReadsPerGene.out.tab0 66 38 74 33 1_MA_J
1_MA_J_S18_L002_ReadsPerGene.out.tab0 59 25 79 25 1_MA_J
1_MA_J_S18_L003_ReadsPerGene.out.tab0 74 27 79 32 1_MA_J
1_MA_J_S18_L004_ReadsPerGene.out.tab0 66 22 69 24 1_MA_J
1_RZ_J_S26_L001_ReadsPerGene.out.tab0 50 16 51 26 1_RZ_J
1_RZ_J_S26_L002_ReadsPerGene.out.tab0 45 7 51 31 1_RZ_J
In [12]:
annogenecnts0[1:6, 1:6]
gene1_MA_J1_RZ_J10_MA_C10_RZ_C11_MA_J
CNAG_00001 0 0 0 0 0
CNAG_00002 265 204 269 76 130
CNAG_00003 112 40 171 24 124
CNAG_00004 301 207 407 141 272
CNAG_00005 114 125 50 25 38
CNAG_000061904 1295 3571 1015 2073

Metadata

Add “Label” to read map results and merge in phenotype data (-> annomapres)

In [13]:
mapresults %>%
    mutate(Label = str_replace(expid, myregex, "")) %>%
    full_join(mtdf, by = "Label") ->
    annomapres
Warning message:
“Column `Label` joining character vector and factor, coercing into character vector”
In [14]:
grpvars <- vars(Label, Strain, Media, experiment_person, libprep_person, enrichment_method)
sumvars <- vars(prop.gene, prop.nofeat, prop.unique, depth)

annomapres %>%
    group_by_at(grpvars) %>%
    summarize_at(sumvars, mean) ->
    annomapres0
In [15]:
head(annomapres)
expidngenemapnambnmultinnofeatnunmapdepthprop.geneprop.nofeatprop.uniqueexperiment_personlibprep_personenrichment_methodRINconcentration_fold_differencei7 indexi5 indexi5 primeri7 primerlibrary#
1_MA_J_S18_L001_ReadsPerGene.out.tab2399781 647 66100 20347 2690 2489565 0.9639359 0.008172914 0.9721088 expA prepB MA 10 3.64 CGCTCATT AGGCTATA i501 i703 18
1_MA_J_S18_L002_ReadsPerGene.out.tab2362228 652 65234 20004 2684 2450802 0.9638592 0.008162226 0.9720214 expA prepB MA 10 3.64 CGCTCATT AGGCTATA i501 i703 18
1_MA_J_S18_L003_ReadsPerGene.out.tab2436776 697 66538 20549 2672 2527232 0.9642075 0.008131030 0.9723385 expA prepB MA 10 3.64 CGCTCATT AGGCTATA i501 i703 18
1_MA_J_S18_L004_ReadsPerGene.out.tab2417485 616 65066 20505 2585 2506257 0.9645798 0.008181523 0.9727614 expA prepB MA 10 3.64 CGCTCATT AGGCTATA i501 i703 18
1_RZ_J_S26_L001_ReadsPerGene.out.tab2366742 1431 395848 768146 7218 3539385 0.6686874 0.217028100 0.8857155 expA prepB RZ 10 3.64 GAGATTCC AGGCTATA i501 i704 26
1_RZ_J_S26_L002_ReadsPerGene.out.tab2331658 1337 388079 755654 7022 3483750 0.6692954 0.216908217 0.8862037 expA prepB RZ 10 3.64 GAGATTCC AGGCTATA i501 i704 26
In [16]:
head(annomapres0)
LabelStrainMediaexperiment_personlibprep_personenrichment_methodprop.geneprop.nofeatprop.uniquedepth
1_MA_J H99 YPD expA prepB MA 0.9641456 0.0081619230.9723075 2493464
1_RZ_J H99 YPD expA prepB RZ 0.6689001 0.2170956210.8859957 3541358
10_MA_C mar1d YPD expA prepA MA 0.9618651 0.0098185730.9716837 3282785
10_RZ_C mar1d YPD expA prepA RZ 0.7497438 0.2006516860.9503955 1742594
11_MA_J mar1d YPD expA prepB MA 0.9669597 0.0087178980.9756776 2062181
11_RZ_J mar1d YPD expA prepB RZ 0.7030020 0.1955471510.8985491 2621913

Store the results

In [17]:
outfile <- file.path(OUTDIR, "HTS-Pilot-Annotated-STAR-counts.RData")
save(mtdf, annogenecnts0, annomapres0, annogenecnts, annomapres, file = outfile)
tools::md5sum(outfile)
/home/jovyan/work/scratch/analysis_output/out/HTS-Pilot-Annotated-STAR-counts.RData: '79338a9ad2efa4bcd1e6e9fd2a2b0ef7'

Visualize the mapping results

In [18]:
### Figures for mapping results
mygeom <-  geom_point(position = position_jitter(w = 0.3, h = 0))
mypal <- scale_colour_manual(name="",  values =brewer.pal(3,"Set1"))
mytheme <-  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + theme_bw()
myfacet <- facet_grid(Strain~ Media, drop=TRUE, scales="free_x", space="free")

Show the fraction of unique mapped reads among all reads (prop.unique)

In [19]:
options(repr.plot.width = 9, repr.plot.height = 6)

p1 <- ggplot(annomapres,
             aes(x = factor(Label),
                 y = prop.unique,
                 shape = Strain,
                 color = Media)) +
        myfacet +
        mygeom +
        mytheme +
        mypal

print(p1)
../_images/pilot_02_Plot_Counts_33_0.png

Show the fraction of reads mapped to genes (prop.gene)

In [20]:
options(repr.plot.width = 9, repr.plot.height = 6)

p2 <- ggplot(annomapres,
             aes(x = Label,
                 y = prop.gene,
                 shape = Strain,
                 color = Media)) +
        myfacet +
        mygeom +
        mytheme +
        mypal

print(p2)
../_images/pilot_02_Plot_Counts_35_0.png

Show the fraction of reads categorized as “no feature” (prop.nofeat)

In [21]:
options(repr.plot.width = 9, repr.plot.height = 6)

p3 <- ggplot(annomapres,
             aes(x = Label,
                 y = prop.nofeat,
                 shape = Strain,
                 color = Media))+
        myfacet +
        mygeom +
        mytheme +
        mypal

print(p3)
../_images/pilot_02_Plot_Counts_37_0.png

Show the number of all the reads in each sample

(Note: depth = ngenemap + namb + nmulti + nnofeat + nunmap)

In [22]:
options(repr.plot.width = 9, repr.plot.height = 6)

p4 <- ggplot(annomapres,
             aes(x = Label,
                 y = depth,
                 shape = Strain,
                 color = Media))+
        myfacet +
        mygeom +
        mytheme +
        mypal

print(p4)
../_images/pilot_02_Plot_Counts_39_0.png

Store the plots

In [23]:
png(file.path(IMGDIR, "p1.png"), height = 480 * 2, width = 480 * 2)
plot(p1)
graphics.off()

png(file.path(IMGDIR, "p2.png"), height = 480 * 2, width = 480 * 2)
plot(p2)
graphics.off()

png(file.path(IMGDIR, "p3.png"), height = 480 * 2, width = 480 * 2)
plot(p3)
graphics.off()

png(file.path(IMGDIR, "p4.png"), height = 480 * 2, width = 480 * 2)
plot(p4)
graphics.off()

The End

In [24]:
sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

Matrix products: default
BLAS: /opt/conda/lib/R/lib/libRblas.so
LAPACK: /opt/conda/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
 [1] tools     parallel  stats4    stats     graphics  grDevices utils
 [8] datasets  methods   base

other attached packages:
 [1] bindrcpp_0.2               plotly_4.8.0
 [3] dendextend_1.8.0           gridExtra_2.3
 [5] RColorBrewer_1.1-2         qvalue_2.10.0
 [7] limma_3.34.9               DESeq2_1.18.1
 [9] SummarizedExperiment_1.8.0 DelayedArray_0.4.1
[11] matrixStats_0.53.1         Biobase_2.38.0
[13] GenomicRanges_1.30.3       GenomeInfoDb_1.14.0
[15] IRanges_2.12.0             S4Vectors_0.16.0
[17] BiocGenerics_0.24.0        haven_1.1.1
[19] stringr_1.3.0              foreach_1.4.4
[21] dplyr_0.7.4                purrr_0.2.4
[23] readr_1.1.1                tidyr_0.8.1
[25] tibble_1.4.2               ggplot2_3.0.0
[27] tidyverse_1.1.1

loaded via a namespace (and not attached):
  [1] colorspace_1.3-2        class_7.3-14            modeltools_0.2-22
  [4] mclust_5.4.1            IRdisplay_0.4.4         htmlTable_1.9
  [7] XVector_0.18.0          base64enc_0.1-3         flexmix_2.3-14
 [10] bit64_0.9-5             mvtnorm_1.0-8           AnnotationDbi_1.40.0
 [13] lubridate_1.7.4         xml2_1.2.0              codetools_0.2-15
 [16] splines_3.4.1           mnormt_1.5-5            robustbase_0.92-7
 [19] geneplotter_1.56.0      knitr_1.20              IRkernel_0.8.11
 [22] Formula_1.2-1           jsonlite_1.5            broom_0.4.4
 [25] annotate_1.56.0         kernlab_0.9-25          cluster_2.0.6
 [28] compiler_3.4.1          httr_1.3.1              backports_1.1.2
 [31] assertthat_0.2.0        Matrix_1.2-14           lazyeval_0.2.1
 [34] acepack_1.4.1           htmltools_0.3.6         gtable_0.2.0
 [37] glue_1.2.0              GenomeInfoDbData_0.99.0 reshape2_1.4.3
 [40] Rcpp_0.12.15            trimcluster_0.1-2.1     cellranger_1.1.0
 [43] nlme_3.1-131            fpc_2.1-11.1            iterators_1.0.9
 [46] psych_1.8.4             rvest_0.3.2             XML_3.98-1.11
 [49] DEoptimR_1.0-8          MASS_7.3-48             zlibbioc_1.24.0
 [52] scales_0.5.0            hms_0.3                 memoise_1.1.0
 [55] rpart_4.1-13            latticeExtra_0.6-28     stringi_1.1.7
 [58] RSQLite_2.0             genefilter_1.60.0       checkmate_1.8.5
 [61] BiocParallel_1.12.0     repr_0.15.0             prabclus_2.2-6
 [64] rlang_0.2.1             pkgconfig_2.0.1         bitops_1.0-6
 [67] evaluate_0.10.1         lattice_0.20-34         bindr_0.1.1
 [70] labeling_0.3            htmlwidgets_1.2         tidyselect_0.2.4
 [73] bit_1.1-12              plyr_1.8.4              magrittr_1.5
 [76] R6_2.2.2                Hmisc_4.0-3             pbdZMQ_0.3-2
 [79] DBI_1.0.0               pillar_1.2.2            whisker_0.3-2
 [82] foreign_0.8-67          withr_2.1.1             survival_2.40-1
 [85] RCurl_1.95-4.8          nnet_7.3-12             modelr_0.1.2
 [88] crayon_1.3.4            uuid_0.1-2              viridis_0.5.1
 [91] locfit_1.5-9.1          grid_3.4.1              readxl_1.1.0
 [94] data.table_1.10.4       blob_1.1.1              forcats_0.3.0
 [97] diptest_0.75-7          digest_0.6.15           xtable_1.8-2
[100] munsell_0.5.0           viridisLite_0.3.0