About the notebook¶
This is the setup of the pilot count analysis. In this notebook, we will create the folder for output and check the files we need.
In [1]:
set -u
About the purpose of the analysis¶
There are several things we would go through in this series of tutorial: 1. create the count matrix from the STAR output 1. use DESeq package to normalize the count matrix 1. pathway analysis
Create directories¶
Below is an illustration of our folder structure.
scratch
└── bioinf_intro
└── analysis_output
├── out -> the folder to store all our output data files
└── img -> the folder to store all our images
set the directories of each folder
In [2]:
CURDIR="$HOME/work/scratch/analysis_output"
OUTDIR="${CURDIR}/out"
IMGDIR="${CURDIR}/img"
create the directories
In [3]:
mkdir -p $OUTDIR
mkdir -p $IMGDIR
check if the directories are created correctly
In [4]:
ls $CURDIR
img out
Check all the files we need¶
The files we need throughout this series of tutorial is count matrix and metadata, which you have been using during Cliburn’s lecture
In [5]:
DATDIR="/data/hts2018_pilot/star_counts"
METADTFILE="$HOME/work/HTS2018-notebooks/josh/info/2018_pilot_metadata_anon.tsv"
The metadata of the samples in pilot data.
In [6]:
head $METADTFILE
Label RNA_sample_num Media Strain Replicate experiment_person libprep_person enrichment_method RIN concentration_fold_difference i7 index i5 index i5 primer i7 primer library#
2_MA_C 2 YPD H99 2 expA prepA MA 10 1.34 ATTACTCG AGGCTATA i501 i701 1
9_MA_C 9 YPD mar1d 3 expA prepA MA 10 2.23 ATTACTCG GCCTCTAT i502 i701 2
10_MA_C 10 YPD mar1d 4 expA prepA MA 9.9 4.37 ATTACTCG AGGATAGG i503 i701 3
14_MA_C 14 TC H99 2 expA prepA MA 10 1.57 ATTACTCG TCAGAGCC i504 i701 4
15_MA_C 15 TC H99 3 expA prepA MA 9.9 2.85 ATTACTCG CTTCGCCT i505 i701 5
21_MA_C 21 TC mar1d 3 expA prepA MA 10 1.81 ATTACTCG TAAGATTA i506 i701 6
22_MA_C 22 TC mar1d 4 expA prepA MA 9.9 2.01 ATTACTCG ACGTCCTG i507 i701 7
26_MA_C 26 YPD H99 8 expB prepA MA 10 2.76 ATTACTCG GTCAGTAC i508 i701 8
2_RZ_C 2 YPD H99 2 expA prepA RZ 10 1.34 TCCGGAGA AGGCTATA i501 i702 9
The count results output from STAR alignment, which was explained in Josh’s lecture.
In [7]:
# there should be 204 samples in this directory
ls $DATDIR | wc -l
204
In [8]:
ls $DATDIR
10_MA_C_S3_L001_ReadsPerGene.out.tab 26_RZ_C_S16_L003_ReadsPerGene.out.tab
10_MA_C_S3_L002_ReadsPerGene.out.tab 26_RZ_C_S16_L004_ReadsPerGene.out.tab
10_MA_C_S3_L003_ReadsPerGene.out.tab 27_MA_P_S38_L001_ReadsPerGene.out.tab
10_MA_C_S3_L004_ReadsPerGene.out.tab 27_MA_P_S38_L002_ReadsPerGene.out.tab
10_RZ_C_S11_L001_ReadsPerGene.out.tab 27_MA_P_S38_L003_ReadsPerGene.out.tab
10_RZ_C_S11_L002_ReadsPerGene.out.tab 27_MA_P_S38_L004_ReadsPerGene.out.tab
10_RZ_C_S11_L003_ReadsPerGene.out.tab 27_RZ_P_S46_L001_ReadsPerGene.out.tab
10_RZ_C_S11_L004_ReadsPerGene.out.tab 27_RZ_P_S46_L002_ReadsPerGene.out.tab
11_MA_J_S20_L001_ReadsPerGene.out.tab 27_RZ_P_S46_L003_ReadsPerGene.out.tab
11_MA_J_S20_L002_ReadsPerGene.out.tab 27_RZ_P_S46_L004_ReadsPerGene.out.tab
11_MA_J_S20_L003_ReadsPerGene.out.tab 2_MA_C_S1_L001_ReadsPerGene.out.tab
11_MA_J_S20_L004_ReadsPerGene.out.tab 2_MA_C_S1_L002_ReadsPerGene.out.tab
11_RZ_J_S28_L001_ReadsPerGene.out.tab 2_MA_C_S1_L003_ReadsPerGene.out.tab
11_RZ_J_S28_L002_ReadsPerGene.out.tab 2_MA_C_S1_L004_ReadsPerGene.out.tab
11_RZ_J_S28_L003_ReadsPerGene.out.tab 2_RZ_C_S9_L001_ReadsPerGene.out.tab
11_RZ_J_S28_L004_ReadsPerGene.out.tab 2_RZ_C_S9_L002_ReadsPerGene.out.tab
12_MA_P_S36_L001_ReadsPerGene.out.tab 2_RZ_C_S9_L003_ReadsPerGene.out.tab
12_MA_P_S36_L002_ReadsPerGene.out.tab 2_RZ_C_S9_L004_ReadsPerGene.out.tab
12_MA_P_S36_L003_ReadsPerGene.out.tab 2_TOT_C_S17_L001_ReadsPerGene.out.tab
12_MA_P_S36_L004_ReadsPerGene.out.tab 2_TOT_C_S17_L002_ReadsPerGene.out.tab
12_RZ_P_S44_L001_ReadsPerGene.out.tab 2_TOT_C_S17_L003_ReadsPerGene.out.tab
12_RZ_P_S44_L002_ReadsPerGene.out.tab 2_TOT_C_S17_L004_ReadsPerGene.out.tab
12_RZ_P_S44_L003_ReadsPerGene.out.tab 35_MA_P_S39_L001_ReadsPerGene.out.tab
12_RZ_P_S44_L004_ReadsPerGene.out.tab 35_MA_P_S39_L002_ReadsPerGene.out.tab
13_MA_J_S21_L001_ReadsPerGene.out.tab 35_MA_P_S39_L003_ReadsPerGene.out.tab
13_MA_J_S21_L002_ReadsPerGene.out.tab 35_MA_P_S39_L004_ReadsPerGene.out.tab
13_MA_J_S21_L003_ReadsPerGene.out.tab 35_RZ_P_S47_L001_ReadsPerGene.out.tab
13_MA_J_S21_L004_ReadsPerGene.out.tab 35_RZ_P_S47_L002_ReadsPerGene.out.tab
13_RZ_J_S29_L001_ReadsPerGene.out.tab 35_RZ_P_S47_L003_ReadsPerGene.out.tab
13_RZ_J_S29_L002_ReadsPerGene.out.tab 35_RZ_P_S47_L004_ReadsPerGene.out.tab
13_RZ_J_S29_L003_ReadsPerGene.out.tab 36_MA_J_S24_L001_ReadsPerGene.out.tab
13_RZ_J_S29_L004_ReadsPerGene.out.tab 36_MA_J_S24_L002_ReadsPerGene.out.tab
14_MA_C_S4_L001_ReadsPerGene.out.tab 36_MA_J_S24_L003_ReadsPerGene.out.tab
14_MA_C_S4_L002_ReadsPerGene.out.tab 36_MA_J_S24_L004_ReadsPerGene.out.tab
14_MA_C_S4_L003_ReadsPerGene.out.tab 36_RZ_J_S32_L001_ReadsPerGene.out.tab
14_MA_C_S4_L004_ReadsPerGene.out.tab 36_RZ_J_S32_L002_ReadsPerGene.out.tab
14_RZ_C_S12_L001_ReadsPerGene.out.tab 36_RZ_J_S32_L003_ReadsPerGene.out.tab
14_RZ_C_S12_L002_ReadsPerGene.out.tab 36_RZ_J_S32_L004_ReadsPerGene.out.tab
14_RZ_C_S12_L003_ReadsPerGene.out.tab 38_MA_P_S40_L001_ReadsPerGene.out.tab
14_RZ_C_S12_L004_ReadsPerGene.out.tab 38_MA_P_S40_L002_ReadsPerGene.out.tab
15_MA_C_S5_L001_ReadsPerGene.out.tab 38_MA_P_S40_L003_ReadsPerGene.out.tab
15_MA_C_S5_L002_ReadsPerGene.out.tab 38_MA_P_S40_L004_ReadsPerGene.out.tab
15_MA_C_S5_L003_ReadsPerGene.out.tab 38_RZ_P_S48_L001_ReadsPerGene.out.tab
15_MA_C_S5_L004_ReadsPerGene.out.tab 38_RZ_P_S48_L002_ReadsPerGene.out.tab
15_RZ_C_S13_L001_ReadsPerGene.out.tab 38_RZ_P_S48_L003_ReadsPerGene.out.tab
15_RZ_C_S13_L002_ReadsPerGene.out.tab 38_RZ_P_S48_L004_ReadsPerGene.out.tab
15_RZ_C_S13_L003_ReadsPerGene.out.tab 3_MA_J_S19_L001_ReadsPerGene.out.tab
15_RZ_C_S13_L004_ReadsPerGene.out.tab 3_MA_J_S19_L002_ReadsPerGene.out.tab
16_MA_P_S37_L001_ReadsPerGene.out.tab 3_MA_J_S19_L003_ReadsPerGene.out.tab
16_MA_P_S37_L002_ReadsPerGene.out.tab 3_MA_J_S19_L004_ReadsPerGene.out.tab
16_MA_P_S37_L003_ReadsPerGene.out.tab 3_RZ_J_S27_L001_ReadsPerGene.out.tab
16_MA_P_S37_L004_ReadsPerGene.out.tab 3_RZ_J_S27_L002_ReadsPerGene.out.tab
16_RZ_P_S45_L001_ReadsPerGene.out.tab 3_RZ_J_S27_L003_ReadsPerGene.out.tab
16_RZ_P_S45_L002_ReadsPerGene.out.tab 3_RZ_J_S27_L004_ReadsPerGene.out.tab
16_RZ_P_S45_L003_ReadsPerGene.out.tab 3_TOT_J_S34_L001_ReadsPerGene.out.tab
16_RZ_P_S45_L004_ReadsPerGene.out.tab 3_TOT_J_S34_L002_ReadsPerGene.out.tab
1_MA_J_S18_L001_ReadsPerGene.out.tab 3_TOT_J_S34_L003_ReadsPerGene.out.tab
1_MA_J_S18_L002_ReadsPerGene.out.tab 3_TOT_J_S34_L004_ReadsPerGene.out.tab
1_MA_J_S18_L003_ReadsPerGene.out.tab 40_MA_J_S25_L001_ReadsPerGene.out.tab
1_MA_J_S18_L004_ReadsPerGene.out.tab 40_MA_J_S25_L002_ReadsPerGene.out.tab
1_RZ_J_S26_L001_ReadsPerGene.out.tab 40_MA_J_S25_L003_ReadsPerGene.out.tab
1_RZ_J_S26_L002_ReadsPerGene.out.tab 40_MA_J_S25_L004_ReadsPerGene.out.tab
1_RZ_J_S26_L003_ReadsPerGene.out.tab 40_RZ_J_S33_L001_ReadsPerGene.out.tab
1_RZ_J_S26_L004_ReadsPerGene.out.tab 40_RZ_J_S33_L002_ReadsPerGene.out.tab
21_MA_C_S6_L001_ReadsPerGene.out.tab 40_RZ_J_S33_L003_ReadsPerGene.out.tab
21_MA_C_S6_L002_ReadsPerGene.out.tab 40_RZ_J_S33_L004_ReadsPerGene.out.tab
21_MA_C_S6_L003_ReadsPerGene.out.tab 45_MA_P_S41_L001_ReadsPerGene.out.tab
21_MA_C_S6_L004_ReadsPerGene.out.tab 45_MA_P_S41_L002_ReadsPerGene.out.tab
21_RZ_C_S14_L001_ReadsPerGene.out.tab 45_MA_P_S41_L003_ReadsPerGene.out.tab
21_RZ_C_S14_L002_ReadsPerGene.out.tab 45_MA_P_S41_L004_ReadsPerGene.out.tab
21_RZ_C_S14_L003_ReadsPerGene.out.tab 45_RZ_P_S49_L001_ReadsPerGene.out.tab
21_RZ_C_S14_L004_ReadsPerGene.out.tab 45_RZ_P_S49_L002_ReadsPerGene.out.tab
22_MA_C_S7_L001_ReadsPerGene.out.tab 45_RZ_P_S49_L003_ReadsPerGene.out.tab
22_MA_C_S7_L002_ReadsPerGene.out.tab 45_RZ_P_S49_L004_ReadsPerGene.out.tab
22_MA_C_S7_L003_ReadsPerGene.out.tab 47_MA_P_S42_L001_ReadsPerGene.out.tab
22_MA_C_S7_L004_ReadsPerGene.out.tab 47_MA_P_S42_L002_ReadsPerGene.out.tab
22_RZ_C_S15_L001_ReadsPerGene.out.tab 47_MA_P_S42_L003_ReadsPerGene.out.tab
22_RZ_C_S15_L002_ReadsPerGene.out.tab 47_MA_P_S42_L004_ReadsPerGene.out.tab
22_RZ_C_S15_L003_ReadsPerGene.out.tab 47_RZ_P_S50_L001_ReadsPerGene.out.tab
22_RZ_C_S15_L004_ReadsPerGene.out.tab 47_RZ_P_S50_L002_ReadsPerGene.out.tab
23_MA_J_S22_L001_ReadsPerGene.out.tab 47_RZ_P_S50_L003_ReadsPerGene.out.tab
23_MA_J_S22_L002_ReadsPerGene.out.tab 47_RZ_P_S50_L004_ReadsPerGene.out.tab
23_MA_J_S22_L003_ReadsPerGene.out.tab 4_MA_P_S35_L001_ReadsPerGene.out.tab
23_MA_J_S22_L004_ReadsPerGene.out.tab 4_MA_P_S35_L002_ReadsPerGene.out.tab
23_RZ_J_S30_L001_ReadsPerGene.out.tab 4_MA_P_S35_L003_ReadsPerGene.out.tab
23_RZ_J_S30_L002_ReadsPerGene.out.tab 4_MA_P_S35_L004_ReadsPerGene.out.tab
23_RZ_J_S30_L003_ReadsPerGene.out.tab 4_RZ_P_S43_L001_ReadsPerGene.out.tab
23_RZ_J_S30_L004_ReadsPerGene.out.tab 4_RZ_P_S43_L002_ReadsPerGene.out.tab
24_MA_J_S23_L001_ReadsPerGene.out.tab 4_RZ_P_S43_L003_ReadsPerGene.out.tab
24_MA_J_S23_L002_ReadsPerGene.out.tab 4_RZ_P_S43_L004_ReadsPerGene.out.tab
24_MA_J_S23_L003_ReadsPerGene.out.tab 4_TOT_P_S51_L001_ReadsPerGene.out.tab
24_MA_J_S23_L004_ReadsPerGene.out.tab 4_TOT_P_S51_L002_ReadsPerGene.out.tab
24_RZ_J_S31_L001_ReadsPerGene.out.tab 4_TOT_P_S51_L003_ReadsPerGene.out.tab
24_RZ_J_S31_L002_ReadsPerGene.out.tab 4_TOT_P_S51_L004_ReadsPerGene.out.tab
24_RZ_J_S31_L003_ReadsPerGene.out.tab 9_MA_C_S2_L001_ReadsPerGene.out.tab
24_RZ_J_S31_L004_ReadsPerGene.out.tab 9_MA_C_S2_L002_ReadsPerGene.out.tab
26_MA_C_S8_L001_ReadsPerGene.out.tab 9_MA_C_S2_L003_ReadsPerGene.out.tab
26_MA_C_S8_L002_ReadsPerGene.out.tab 9_MA_C_S2_L004_ReadsPerGene.out.tab
26_MA_C_S8_L003_ReadsPerGene.out.tab 9_RZ_C_S10_L001_ReadsPerGene.out.tab
26_MA_C_S8_L004_ReadsPerGene.out.tab 9_RZ_C_S10_L002_ReadsPerGene.out.tab
26_RZ_C_S16_L001_ReadsPerGene.out.tab 9_RZ_C_S10_L003_ReadsPerGene.out.tab
26_RZ_C_S16_L002_ReadsPerGene.out.tab 9_RZ_C_S10_L004_ReadsPerGene.out.tab