About the notebook

This is the setup of the pilot count analysis. In this notebook, we will create the folder for output and check the files we need.

In [1]:
set -u

About the purpose of the analysis

There are several things we would go through in this series of tutorial: 1. create the count matrix from the STAR output 1. use DESeq package to normalize the count matrix 1. pathway analysis

Create directories

Below is an illustration of our folder structure.

scratch
└── bioinf_intro
└── analysis_output
    ├── out -> the folder to store all our output data files
    └── img -> the folder to store all our images

set the directories of each folder

In [2]:
CURDIR="$HOME/work/scratch/analysis_output"
OUTDIR="${CURDIR}/out"
IMGDIR="${CURDIR}/img"

create the directories

In [3]:
mkdir -p $OUTDIR
mkdir -p $IMGDIR

check if the directories are created correctly

In [4]:
ls $CURDIR
img  out

Check all the files we need

The files we need throughout this series of tutorial is count matrix and metadata, which you have been using during Cliburn’s lecture

In [5]:
DATDIR="/data/hts2018_pilot/star_counts"
METADTFILE="$HOME/work/HTS2018-notebooks/josh/info/2018_pilot_metadata_anon.tsv"

The metadata of the samples in pilot data.

In [6]:
head $METADTFILE
Label   RNA_sample_num  Media   Strain  Replicate       experiment_person       libprep_person  enrichment_method       RIN     concentration_fold_difference   i7 index        i5 index        i5 primer       i7 primer       library#
2_MA_C  2       YPD     H99     2       expA    prepA   MA      10      1.34    ATTACTCG        AGGCTATA        i501    i701    1
9_MA_C  9       YPD     mar1d   3       expA    prepA   MA      10      2.23    ATTACTCG        GCCTCTAT        i502    i701    2
10_MA_C 10      YPD     mar1d   4       expA    prepA   MA      9.9     4.37    ATTACTCG        AGGATAGG        i503    i701    3
14_MA_C 14      TC      H99     2       expA    prepA   MA      10      1.57    ATTACTCG        TCAGAGCC        i504    i701    4
15_MA_C 15      TC      H99     3       expA    prepA   MA      9.9     2.85    ATTACTCG        CTTCGCCT        i505    i701    5
21_MA_C 21      TC      mar1d   3       expA    prepA   MA      10      1.81    ATTACTCG        TAAGATTA        i506    i701    6
22_MA_C 22      TC      mar1d   4       expA    prepA   MA      9.9     2.01    ATTACTCG        ACGTCCTG        i507    i701    7
26_MA_C 26      YPD     H99     8       expB    prepA   MA      10      2.76    ATTACTCG        GTCAGTAC        i508    i701    8
2_RZ_C  2       YPD     H99     2       expA    prepA   RZ      10      1.34    TCCGGAGA        AGGCTATA        i501    i702    9

The count results output from STAR alignment, which was explained in Josh’s lecture.

In [7]:
# there should be 204 samples in this directory
ls $DATDIR | wc -l
204
In [8]:
ls $DATDIR
10_MA_C_S3_L001_ReadsPerGene.out.tab   26_RZ_C_S16_L003_ReadsPerGene.out.tab
10_MA_C_S3_L002_ReadsPerGene.out.tab   26_RZ_C_S16_L004_ReadsPerGene.out.tab
10_MA_C_S3_L003_ReadsPerGene.out.tab   27_MA_P_S38_L001_ReadsPerGene.out.tab
10_MA_C_S3_L004_ReadsPerGene.out.tab   27_MA_P_S38_L002_ReadsPerGene.out.tab
10_RZ_C_S11_L001_ReadsPerGene.out.tab  27_MA_P_S38_L003_ReadsPerGene.out.tab
10_RZ_C_S11_L002_ReadsPerGene.out.tab  27_MA_P_S38_L004_ReadsPerGene.out.tab
10_RZ_C_S11_L003_ReadsPerGene.out.tab  27_RZ_P_S46_L001_ReadsPerGene.out.tab
10_RZ_C_S11_L004_ReadsPerGene.out.tab  27_RZ_P_S46_L002_ReadsPerGene.out.tab
11_MA_J_S20_L001_ReadsPerGene.out.tab  27_RZ_P_S46_L003_ReadsPerGene.out.tab
11_MA_J_S20_L002_ReadsPerGene.out.tab  27_RZ_P_S46_L004_ReadsPerGene.out.tab
11_MA_J_S20_L003_ReadsPerGene.out.tab  2_MA_C_S1_L001_ReadsPerGene.out.tab
11_MA_J_S20_L004_ReadsPerGene.out.tab  2_MA_C_S1_L002_ReadsPerGene.out.tab
11_RZ_J_S28_L001_ReadsPerGene.out.tab  2_MA_C_S1_L003_ReadsPerGene.out.tab
11_RZ_J_S28_L002_ReadsPerGene.out.tab  2_MA_C_S1_L004_ReadsPerGene.out.tab
11_RZ_J_S28_L003_ReadsPerGene.out.tab  2_RZ_C_S9_L001_ReadsPerGene.out.tab
11_RZ_J_S28_L004_ReadsPerGene.out.tab  2_RZ_C_S9_L002_ReadsPerGene.out.tab
12_MA_P_S36_L001_ReadsPerGene.out.tab  2_RZ_C_S9_L003_ReadsPerGene.out.tab
12_MA_P_S36_L002_ReadsPerGene.out.tab  2_RZ_C_S9_L004_ReadsPerGene.out.tab
12_MA_P_S36_L003_ReadsPerGene.out.tab  2_TOT_C_S17_L001_ReadsPerGene.out.tab
12_MA_P_S36_L004_ReadsPerGene.out.tab  2_TOT_C_S17_L002_ReadsPerGene.out.tab
12_RZ_P_S44_L001_ReadsPerGene.out.tab  2_TOT_C_S17_L003_ReadsPerGene.out.tab
12_RZ_P_S44_L002_ReadsPerGene.out.tab  2_TOT_C_S17_L004_ReadsPerGene.out.tab
12_RZ_P_S44_L003_ReadsPerGene.out.tab  35_MA_P_S39_L001_ReadsPerGene.out.tab
12_RZ_P_S44_L004_ReadsPerGene.out.tab  35_MA_P_S39_L002_ReadsPerGene.out.tab
13_MA_J_S21_L001_ReadsPerGene.out.tab  35_MA_P_S39_L003_ReadsPerGene.out.tab
13_MA_J_S21_L002_ReadsPerGene.out.tab  35_MA_P_S39_L004_ReadsPerGene.out.tab
13_MA_J_S21_L003_ReadsPerGene.out.tab  35_RZ_P_S47_L001_ReadsPerGene.out.tab
13_MA_J_S21_L004_ReadsPerGene.out.tab  35_RZ_P_S47_L002_ReadsPerGene.out.tab
13_RZ_J_S29_L001_ReadsPerGene.out.tab  35_RZ_P_S47_L003_ReadsPerGene.out.tab
13_RZ_J_S29_L002_ReadsPerGene.out.tab  35_RZ_P_S47_L004_ReadsPerGene.out.tab
13_RZ_J_S29_L003_ReadsPerGene.out.tab  36_MA_J_S24_L001_ReadsPerGene.out.tab
13_RZ_J_S29_L004_ReadsPerGene.out.tab  36_MA_J_S24_L002_ReadsPerGene.out.tab
14_MA_C_S4_L001_ReadsPerGene.out.tab   36_MA_J_S24_L003_ReadsPerGene.out.tab
14_MA_C_S4_L002_ReadsPerGene.out.tab   36_MA_J_S24_L004_ReadsPerGene.out.tab
14_MA_C_S4_L003_ReadsPerGene.out.tab   36_RZ_J_S32_L001_ReadsPerGene.out.tab
14_MA_C_S4_L004_ReadsPerGene.out.tab   36_RZ_J_S32_L002_ReadsPerGene.out.tab
14_RZ_C_S12_L001_ReadsPerGene.out.tab  36_RZ_J_S32_L003_ReadsPerGene.out.tab
14_RZ_C_S12_L002_ReadsPerGene.out.tab  36_RZ_J_S32_L004_ReadsPerGene.out.tab
14_RZ_C_S12_L003_ReadsPerGene.out.tab  38_MA_P_S40_L001_ReadsPerGene.out.tab
14_RZ_C_S12_L004_ReadsPerGene.out.tab  38_MA_P_S40_L002_ReadsPerGene.out.tab
15_MA_C_S5_L001_ReadsPerGene.out.tab   38_MA_P_S40_L003_ReadsPerGene.out.tab
15_MA_C_S5_L002_ReadsPerGene.out.tab   38_MA_P_S40_L004_ReadsPerGene.out.tab
15_MA_C_S5_L003_ReadsPerGene.out.tab   38_RZ_P_S48_L001_ReadsPerGene.out.tab
15_MA_C_S5_L004_ReadsPerGene.out.tab   38_RZ_P_S48_L002_ReadsPerGene.out.tab
15_RZ_C_S13_L001_ReadsPerGene.out.tab  38_RZ_P_S48_L003_ReadsPerGene.out.tab
15_RZ_C_S13_L002_ReadsPerGene.out.tab  38_RZ_P_S48_L004_ReadsPerGene.out.tab
15_RZ_C_S13_L003_ReadsPerGene.out.tab  3_MA_J_S19_L001_ReadsPerGene.out.tab
15_RZ_C_S13_L004_ReadsPerGene.out.tab  3_MA_J_S19_L002_ReadsPerGene.out.tab
16_MA_P_S37_L001_ReadsPerGene.out.tab  3_MA_J_S19_L003_ReadsPerGene.out.tab
16_MA_P_S37_L002_ReadsPerGene.out.tab  3_MA_J_S19_L004_ReadsPerGene.out.tab
16_MA_P_S37_L003_ReadsPerGene.out.tab  3_RZ_J_S27_L001_ReadsPerGene.out.tab
16_MA_P_S37_L004_ReadsPerGene.out.tab  3_RZ_J_S27_L002_ReadsPerGene.out.tab
16_RZ_P_S45_L001_ReadsPerGene.out.tab  3_RZ_J_S27_L003_ReadsPerGene.out.tab
16_RZ_P_S45_L002_ReadsPerGene.out.tab  3_RZ_J_S27_L004_ReadsPerGene.out.tab
16_RZ_P_S45_L003_ReadsPerGene.out.tab  3_TOT_J_S34_L001_ReadsPerGene.out.tab
16_RZ_P_S45_L004_ReadsPerGene.out.tab  3_TOT_J_S34_L002_ReadsPerGene.out.tab
1_MA_J_S18_L001_ReadsPerGene.out.tab   3_TOT_J_S34_L003_ReadsPerGene.out.tab
1_MA_J_S18_L002_ReadsPerGene.out.tab   3_TOT_J_S34_L004_ReadsPerGene.out.tab
1_MA_J_S18_L003_ReadsPerGene.out.tab   40_MA_J_S25_L001_ReadsPerGene.out.tab
1_MA_J_S18_L004_ReadsPerGene.out.tab   40_MA_J_S25_L002_ReadsPerGene.out.tab
1_RZ_J_S26_L001_ReadsPerGene.out.tab   40_MA_J_S25_L003_ReadsPerGene.out.tab
1_RZ_J_S26_L002_ReadsPerGene.out.tab   40_MA_J_S25_L004_ReadsPerGene.out.tab
1_RZ_J_S26_L003_ReadsPerGene.out.tab   40_RZ_J_S33_L001_ReadsPerGene.out.tab
1_RZ_J_S26_L004_ReadsPerGene.out.tab   40_RZ_J_S33_L002_ReadsPerGene.out.tab
21_MA_C_S6_L001_ReadsPerGene.out.tab   40_RZ_J_S33_L003_ReadsPerGene.out.tab
21_MA_C_S6_L002_ReadsPerGene.out.tab   40_RZ_J_S33_L004_ReadsPerGene.out.tab
21_MA_C_S6_L003_ReadsPerGene.out.tab   45_MA_P_S41_L001_ReadsPerGene.out.tab
21_MA_C_S6_L004_ReadsPerGene.out.tab   45_MA_P_S41_L002_ReadsPerGene.out.tab
21_RZ_C_S14_L001_ReadsPerGene.out.tab  45_MA_P_S41_L003_ReadsPerGene.out.tab
21_RZ_C_S14_L002_ReadsPerGene.out.tab  45_MA_P_S41_L004_ReadsPerGene.out.tab
21_RZ_C_S14_L003_ReadsPerGene.out.tab  45_RZ_P_S49_L001_ReadsPerGene.out.tab
21_RZ_C_S14_L004_ReadsPerGene.out.tab  45_RZ_P_S49_L002_ReadsPerGene.out.tab
22_MA_C_S7_L001_ReadsPerGene.out.tab   45_RZ_P_S49_L003_ReadsPerGene.out.tab
22_MA_C_S7_L002_ReadsPerGene.out.tab   45_RZ_P_S49_L004_ReadsPerGene.out.tab
22_MA_C_S7_L003_ReadsPerGene.out.tab   47_MA_P_S42_L001_ReadsPerGene.out.tab
22_MA_C_S7_L004_ReadsPerGene.out.tab   47_MA_P_S42_L002_ReadsPerGene.out.tab
22_RZ_C_S15_L001_ReadsPerGene.out.tab  47_MA_P_S42_L003_ReadsPerGene.out.tab
22_RZ_C_S15_L002_ReadsPerGene.out.tab  47_MA_P_S42_L004_ReadsPerGene.out.tab
22_RZ_C_S15_L003_ReadsPerGene.out.tab  47_RZ_P_S50_L001_ReadsPerGene.out.tab
22_RZ_C_S15_L004_ReadsPerGene.out.tab  47_RZ_P_S50_L002_ReadsPerGene.out.tab
23_MA_J_S22_L001_ReadsPerGene.out.tab  47_RZ_P_S50_L003_ReadsPerGene.out.tab
23_MA_J_S22_L002_ReadsPerGene.out.tab  47_RZ_P_S50_L004_ReadsPerGene.out.tab
23_MA_J_S22_L003_ReadsPerGene.out.tab  4_MA_P_S35_L001_ReadsPerGene.out.tab
23_MA_J_S22_L004_ReadsPerGene.out.tab  4_MA_P_S35_L002_ReadsPerGene.out.tab
23_RZ_J_S30_L001_ReadsPerGene.out.tab  4_MA_P_S35_L003_ReadsPerGene.out.tab
23_RZ_J_S30_L002_ReadsPerGene.out.tab  4_MA_P_S35_L004_ReadsPerGene.out.tab
23_RZ_J_S30_L003_ReadsPerGene.out.tab  4_RZ_P_S43_L001_ReadsPerGene.out.tab
23_RZ_J_S30_L004_ReadsPerGene.out.tab  4_RZ_P_S43_L002_ReadsPerGene.out.tab
24_MA_J_S23_L001_ReadsPerGene.out.tab  4_RZ_P_S43_L003_ReadsPerGene.out.tab
24_MA_J_S23_L002_ReadsPerGene.out.tab  4_RZ_P_S43_L004_ReadsPerGene.out.tab
24_MA_J_S23_L003_ReadsPerGene.out.tab  4_TOT_P_S51_L001_ReadsPerGene.out.tab
24_MA_J_S23_L004_ReadsPerGene.out.tab  4_TOT_P_S51_L002_ReadsPerGene.out.tab
24_RZ_J_S31_L001_ReadsPerGene.out.tab  4_TOT_P_S51_L003_ReadsPerGene.out.tab
24_RZ_J_S31_L002_ReadsPerGene.out.tab  4_TOT_P_S51_L004_ReadsPerGene.out.tab
24_RZ_J_S31_L003_ReadsPerGene.out.tab  9_MA_C_S2_L001_ReadsPerGene.out.tab
24_RZ_J_S31_L004_ReadsPerGene.out.tab  9_MA_C_S2_L002_ReadsPerGene.out.tab
26_MA_C_S8_L001_ReadsPerGene.out.tab   9_MA_C_S2_L003_ReadsPerGene.out.tab
26_MA_C_S8_L002_ReadsPerGene.out.tab   9_MA_C_S2_L004_ReadsPerGene.out.tab
26_MA_C_S8_L003_ReadsPerGene.out.tab   9_RZ_C_S10_L001_ReadsPerGene.out.tab
26_MA_C_S8_L004_ReadsPerGene.out.tab   9_RZ_C_S10_L002_ReadsPerGene.out.tab
26_RZ_C_S16_L001_ReadsPerGene.out.tab  9_RZ_C_S10_L003_ReadsPerGene.out.tab
26_RZ_C_S16_L002_ReadsPerGene.out.tab  9_RZ_C_S10_L004_ReadsPerGene.out.tab