# About the notebook

**This is the setup of the pilot count analysis. In this notebook, we will create the folder for output and check the files we need.**

In [1]:
set -u

# About the purpose of the analysis

**There are several things we would go through in this series of tutorial:**
1. create the count matrix from the STAR output
1. use DESeq package to normalize the count matrix
1. pathway analysis 

# Create directories

Below is an illustration of our folder structure.
```
scratch
└── bioinf_intro
└── analysis_output
 ├── out -> the folder to store all our output data files
 └── img -> the folder to store all our images 
```

set the directories of each folder

In [2]:
CURDIR="$HOME/work/scratch/analysis_output"
OUTDIR="${CURDIR}/out"
IMGDIR="${CURDIR}/img"

create the directories

In [3]:
mkdir -p $OUTDIR
mkdir -p $IMGDIR

check if the directories are created correctly

In [4]:
ls $CURDIR

[0m[01;34mimg[0m [01;34mout[0m


# Check all the files we need

The files we need throughout this series of tutorial is count matrix and metadata, which you have been using during Cliburn's lecture

In [5]:
DATDIR="/data/hts2018_pilot/star_counts"
METADTFILE="$HOME/work/HTS2018-notebooks/josh/info/2018_pilot_metadata_anon.tsv"

The metadata of the samples in pilot data.

In [6]:
head $METADTFILE

Label	RNA_sample_num	Media	Strain	Replicate	experiment_person	libprep_person	enrichment_method	RIN	concentration_fold_difference	i7 index	i5 index	i5 primer	i7 primer	library#
2_MA_C	2	YPD	H99	2	expA	prepA	MA	10	1.34	ATTACTCG	AGGCTATA	i501	i701	1
9_MA_C	9	YPD	mar1d	3	expA	prepA	MA	10	2.23	ATTACTCG	GCCTCTAT	i502	i701	2
10_MA_C	10	YPD	mar1d	4	expA	prepA	MA	9.9	4.37	ATTACTCG	AGGATAGG	i503	i701	3
14_MA_C	14	TC	H99	2	expA	prepA	MA	10	1.57	ATTACTCG	TCAGAGCC	i504	i701	4
15_MA_C	15	TC	H99	3	expA	prepA	MA	9.9	2.85	ATTACTCG	CTTCGCCT	i505	i701	5
21_MA_C	21	TC	mar1d	3	expA	prepA	MA	10	1.81	ATTACTCG	TAAGATTA	i506	i701	6
22_MA_C	22	TC	mar1d	4	expA	prepA	MA	9.9	2.01	ATTACTCG	ACGTCCTG	i507	i701	7
26_MA_C	26	YPD	H99	8	expB	prepA	MA	10	2.76	ATTACTCG	GTCAGTAC	i508	i701	8
2_RZ_C	2	YPD	H99	2	expA	prepA	RZ	10	1.34	TCCGGAGA	AGGCTATA	i501	i702	9


The count results output from STAR alignment, which was explained in Josh's lecture.

In [7]:
# there should be 204 samples in this directory
ls $DATDIR | wc -l

204


In [8]:
ls $DATDIR

10_MA_C_S3_L001_ReadsPerGene.out.tab 26_RZ_C_S16_L003_ReadsPerGene.out.tab
10_MA_C_S3_L002_ReadsPerGene.out.tab 26_RZ_C_S16_L004_ReadsPerGene.out.tab
10_MA_C_S3_L003_ReadsPerGene.out.tab 27_MA_P_S38_L001_ReadsPerGene.out.tab
10_MA_C_S3_L004_ReadsPerGene.out.tab 27_MA_P_S38_L002_ReadsPerGene.out.tab
10_RZ_C_S11_L001_ReadsPerGene.out.tab 27_MA_P_S38_L003_ReadsPerGene.out.tab
10_RZ_C_S11_L002_ReadsPerGene.out.tab 27_MA_P_S38_L004_ReadsPerGene.out.tab
10_RZ_C_S11_L003_ReadsPerGene.out.tab 27_RZ_P_S46_L001_ReadsPerGene.out.tab
10_RZ_C_S11_L004_ReadsPerGene.out.tab 27_RZ_P_S46_L002_ReadsPerGene.out.tab
11_MA_J_S20_L001_ReadsPerGene.out.tab 27_RZ_P_S46_L003_ReadsPerGene.out.tab
11_MA_J_S20_L002_ReadsPerGene.out.tab 27_RZ_P_S46_L004_ReadsPerGene.out.tab
11_MA_J_S20_L003_ReadsPerGene.out.tab 2_MA_C_S1_L001_ReadsPerGene.out.tab
11_MA_J_S20_L004_ReadsPerGene.out.tab 2_MA_C_S1_L002_ReadsPerGene.out.tab
11_RZ_J_S28_L001_ReadsPerGene.out.tab 2_MA_C_S1_L003_ReadsPerGene.out.tab
11_RZ_J_S28_L002_Reads