Working with Multiple Samples¶
Let’s kick it up another notch - we have six pilot samples, let’s run our analysis on all of them!
Shell Variables¶
Assign the variables in this notebook.
In [1]:
# We used these in previous notebooks
CUROUT=$HOME/work/scratch/2015_output
DEMUX=$CUROUT/demux_fastqs
CURDATA=/data/HTS_2015_data/runs_combined
INFO=$HOME/work/myinfo
TRIMMED=$CUROUT/trimmed_fastqs
GENOME_DIR=$CUROUT/genome
TH_DIR=$CUROUT/th_dir
COUNT_DIR=$CUROUT/counts
# We used these in the last notebook
ACCESSION="GCA_000010245.1_ASM1024v1"
PREFIX=${ACCESSION}_genomic
GFF=${PREFIX}.gff
FA=${PREFIX}.fa
Making New Directories¶
No new directories are necessary, we are using directories that were already created for the single-end pipeline.
In [2]:
mkdir -p $TRIMMED $GENOME_DIR $TH_DIR $COUNT_DIR
A Brief journey into for loops¶
Most of the steps in our pipeline aren’t so simple. To apply our
pipeline to multiple sample files, we need to change things in multiple
places. For example, just to run Tophat, we need to change things in
four places between each run, the output directory name (twice) and the
name of the FASTQ. Doing this by hand is not only tedious, but error
prone. Doing almost the same thing repeatedly is something that
computers are very good at, and people are very bad at, so let’s get the
computer to do the hard work. Because the Unix shell is almost magical
(it is a full fledged programming language), we can do this. We will use
a for loop
. This is analogous to how you would teach a child to set
the table: “FOR each place at the table, put a plate . . ., At the shell
you phrase it like this:
for PERSON in Alice Bob Carol Dave Eve
do
put plate at PERSON's place
put napkin at PERSON's place
put fork at PERSON's place
put spoon at PERSON's place
put knife at PERSON's place
done
Here is a real example:
In [3]:
for SAMPLE in A B C D E F
do
echo XXXXX_${SAMPLE}_XXXXX
done
XXXXX_A_XXXXX
XXXXX_B_XXXXX
XXXXX_C_XXXXX
XXXXX_D_XXXXX
XXXXX_E_XXXXX
XXXXX_F_XXXXX
the do
and done
are essential - do
needs to be before the
“loop body” (what is going to be repeated) and done
needs to be
after it.
So let’s try something almost useful:
In [4]:
for SAMPLE in 8A_pilot
do
echo $SAMPLE
done
8A_pilot
Let’s run fastq-mcf in a loop:¶
In [5]:
for SAMPLE in 8A_pilot
do
echo $SAMPLE
fastq-mcf $INFO/neb_adapters.fasta \
$DEMUX/r1.${SAMPLE}.fq.gz \
$DEMUX/r2.${SAMPLE}.fq.gz \
-q 20 -x 0.5 \
-o $TRIMMED/r1.${SAMPLE}.trim.fastq.gz \
-o $TRIMMED/r2.${SAMPLE}.trim.fastq.gz
done
8A_pilot
Error opening adapter file '/Users/cliburn/work/myinfo/neb_adapters.fasta': No such file or directory
Now let’s do the same thing for tophat¶
In [6]:
for SAMPLE in 8A_pilot
do
echo $SAMPLE
mkdir -p $TH_DIR/${SAMPLE}
tophat2 -G $GENOME_DIR/$GFF \
--library-type fr-firststrand \
--output-dir $TH_DIR/${SAMPLE} \
--max-intron-length 5 \
--min-intron-length 4 \
--transcriptome-max-hits 1 \
--max-multihits 1 \
--no-coverage-search \
--no-novel-juncs \
--no-sort-bam \
$GENOME_DIR/$PREFIX \
$TRIMMED/r1.${SAMPLE}.trim.fastq.gz \
$TRIMMED/r2.${SAMPLE}.trim.fastq.gz
done
8A_pilot
readlink: illegal option -- f
usage: readlink [-n] [file ...]
[2017-08-08 09:09:54] Beginning TopHat run (v2.1.1)
-----------------------------------------------
[2017-08-08 09:09:54] Checking for Bowtie
Bowtie version: 2.3.2.0
Error: cannot find transcript file /Users/cliburn/work/scratch/2015_output/genome/GCA_000010245.1_ASM1024v1_genomic.gff
And now for htseq-count!¶
In [7]:
for SAMPLE in 8A_pilot
do
echo $SAMPLE
samtools sort -n $TH_DIR/${SAMPLE}/accepted_hits.bam \
-o $TH_DIR/${SAMPLE}/accepted_hits.name.bam
htseq-count --quiet --order=name --format=bam --stranded=reverse \
--type=gene --idattr=Name \
$TH_DIR/${SAMPLE}/accepted_hits.name.bam \
$GENOME_DIR/$GFF > $COUNT_DIR/${SAMPLE}.csv
done
8A_pilot
[E::hts_open_format] fail to open file '/Users/cliburn/work/scratch/2015_output/th_dir/8A_pilot/accepted_hits.bam'
samtools sort: can't open "/Users/cliburn/work/scratch/2015_output/th_dir/8A_pilot/accepted_hits.bam": No such file or directory
[Errno 2] No such file or directory: '/Users/cliburn/work/scratch/2015_output/th_dir/8A_pilot/accepted_hits.name.bam'
[Exception type: FileNotFoundError, raised in count.py:42]
In [8]:
ls $COUNT_DIR
tail $COUNT_DIR/*
7A_pilot.csv 7C_pilot.csv 8B_pilot.csv r1_8A_pilot.tsv
7B_pilot.csv 8A_pilot.csv 8C_pilot.csv
==> /Users/cliburn/work/scratch/2015_output/counts/7A_pilot.csv <==
==> /Users/cliburn/work/scratch/2015_output/counts/7B_pilot.csv <==
==> /Users/cliburn/work/scratch/2015_output/counts/7C_pilot.csv <==
==> /Users/cliburn/work/scratch/2015_output/counts/8A_pilot.csv <==
==> /Users/cliburn/work/scratch/2015_output/counts/8B_pilot.csv <==
==> /Users/cliburn/work/scratch/2015_output/counts/8C_pilot.csv <==
==> /Users/cliburn/work/scratch/2015_output/counts/r1_8A_pilot.tsv <==