Working with Multiple Samples

Let’s kick it up another notch - we have six pilot samples, let’s run our analysis on all of them!

Shell Variables

Assign the variables in this notebook.

In [1]:
# We used these in previous notebooks
CUROUT=$HOME/work/scratch/2015_output
DEMUX=$CUROUT/demux_fastqs
CURDATA=/data/HTS_2015_data/runs_combined
INFO=$HOME/work/myinfo
TRIMMED=$CUROUT/trimmed_fastqs
GENOME_DIR=$CUROUT/genome
TH_DIR=$CUROUT/th_dir
COUNT_DIR=$CUROUT/counts

# We used these in the last notebook
ACCESSION="GCA_000010245.1_ASM1024v1"
PREFIX=${ACCESSION}_genomic
GFF=${PREFIX}.gff
FA=${PREFIX}.fa

Making New Directories

No new directories are necessary, we are using directories that were already created for the single-end pipeline.

In [2]:
mkdir -p $TRIMMED $GENOME_DIR $TH_DIR $COUNT_DIR

A Brief journey into for loops

Most of the steps in our pipeline aren’t so simple. To apply our pipeline to multiple sample files, we need to change things in multiple places. For example, just to run Tophat, we need to change things in four places between each run, the output directory name (twice) and the name of the FASTQ. Doing this by hand is not only tedious, but error prone. Doing almost the same thing repeatedly is something that computers are very good at, and people are very bad at, so let’s get the computer to do the hard work. Because the Unix shell is almost magical (it is a full fledged programming language), we can do this. We will use a for loop. This is analogous to how you would teach a child to set the table: “FOR each place at the table, put a plate . . ., At the shell you phrase it like this:

for PERSON in Alice Bob Carol Dave Eve
do
put plate at PERSON's place
put napkin at PERSON's place
put fork at PERSON's place
put spoon at PERSON's place
put knife at PERSON's place
done

Here is a real example:

In [3]:
for SAMPLE in A B C D E F
    do
       echo XXXXX_${SAMPLE}_XXXXX
    done
XXXXX_A_XXXXX
XXXXX_B_XXXXX
XXXXX_C_XXXXX
XXXXX_D_XXXXX
XXXXX_E_XXXXX
XXXXX_F_XXXXX

the do and done are essential - do needs to be before the “loop body” (what is going to be repeated) and done needs to be after it.

So let’s try something almost useful:

In [4]:
for SAMPLE in 8A_pilot
    do
        echo $SAMPLE
    done
8A_pilot

Let’s run fastq-mcf in a loop:

In [5]:
for SAMPLE in 8A_pilot
    do
        echo $SAMPLE
        fastq-mcf $INFO/neb_adapters.fasta \
            $DEMUX/r1.${SAMPLE}.fq.gz \
            $DEMUX/r2.${SAMPLE}.fq.gz \
            -q 20 -x 0.5 \
            -o $TRIMMED/r1.${SAMPLE}.trim.fastq.gz \
            -o $TRIMMED/r2.${SAMPLE}.trim.fastq.gz
    done
8A_pilot
Error opening adapter file '/Users/cliburn/work/myinfo/neb_adapters.fasta': No such file or directory

Now let’s do the same thing for tophat

In [6]:
for SAMPLE in 8A_pilot
    do
        echo $SAMPLE
        mkdir -p $TH_DIR/${SAMPLE}
        tophat2 -G $GENOME_DIR/$GFF \
            --library-type fr-firststrand \
            --output-dir $TH_DIR/${SAMPLE} \
            --max-intron-length 5 \
            --min-intron-length 4 \
            --transcriptome-max-hits 1 \
            --max-multihits 1 \
            --no-coverage-search \
            --no-novel-juncs \
            --no-sort-bam \
            $GENOME_DIR/$PREFIX \
            $TRIMMED/r1.${SAMPLE}.trim.fastq.gz \
            $TRIMMED/r2.${SAMPLE}.trim.fastq.gz
    done
8A_pilot
readlink: illegal option -- f
usage: readlink [-n] [file ...]

[2017-08-08 09:09:54] Beginning TopHat run (v2.1.1)
-----------------------------------------------
[2017-08-08 09:09:54] Checking for Bowtie
                  Bowtie version:        2.3.2.0
Error: cannot find transcript file /Users/cliburn/work/scratch/2015_output/genome/GCA_000010245.1_ASM1024v1_genomic.gff

And now for htseq-count!

In [7]:
for SAMPLE in 8A_pilot
    do
        echo $SAMPLE
        samtools sort -n $TH_DIR/${SAMPLE}/accepted_hits.bam \
            -o $TH_DIR/${SAMPLE}/accepted_hits.name.bam

        htseq-count --quiet --order=name --format=bam --stranded=reverse \
            --type=gene --idattr=Name  \
            $TH_DIR/${SAMPLE}/accepted_hits.name.bam \
            $GENOME_DIR/$GFF > $COUNT_DIR/${SAMPLE}.csv
    done
8A_pilot
[E::hts_open_format] fail to open file '/Users/cliburn/work/scratch/2015_output/th_dir/8A_pilot/accepted_hits.bam'
samtools sort: can't open "/Users/cliburn/work/scratch/2015_output/th_dir/8A_pilot/accepted_hits.bam": No such file or directory
  [Errno 2] No such file or directory: '/Users/cliburn/work/scratch/2015_output/th_dir/8A_pilot/accepted_hits.name.bam'
  [Exception type: FileNotFoundError, raised in count.py:42]

In [8]:
ls $COUNT_DIR
tail $COUNT_DIR/*
7A_pilot.csv    7C_pilot.csv    8B_pilot.csv    r1_8A_pilot.tsv
7B_pilot.csv    8A_pilot.csv    8C_pilot.csv
==> /Users/cliburn/work/scratch/2015_output/counts/7A_pilot.csv <==

==> /Users/cliburn/work/scratch/2015_output/counts/7B_pilot.csv <==

==> /Users/cliburn/work/scratch/2015_output/counts/7C_pilot.csv <==

==> /Users/cliburn/work/scratch/2015_output/counts/8A_pilot.csv <==

==> /Users/cliburn/work/scratch/2015_output/counts/8B_pilot.csv <==

==> /Users/cliburn/work/scratch/2015_output/counts/8C_pilot.csv <==

==> /Users/cliburn/work/scratch/2015_output/counts/r1_8A_pilot.tsv <==