In [1]:
# time this notebook
SECONDS=0

Working with Multiple Samples

Let’s kick it up another notch - we have six samples, let’s run our analysis on all of them!

Shell Variables

Assign the variables in this notebook.

In [2]:
CUROUT=$HOME/work/scratch/2015_output_all_data
DEMUX=$CUROUT/demux_fastqs
CURDATA=/data/HTS_2015_data/runs_combined
INFO=$HOME/work/myinfo
TRIMMED=$CUROUT/trimmed_fastqs
GENOME_DIR=$CUROUT/genome
TH_DIR=$CUROUT/th_dir
COUNT_DIR=$CUROUT/counts
QC=$CUROUT/qc_output
ADAPTERS=$INFO/neb_adapters.fasta

ACCESSION="GCA_000010245.1_ASM1024v1"
PREFIX=${ACCESSION}_genomic
GFF=${PREFIX}.gff
FNA=${PREFIX}.fna
FA=${PREFIX}.fa

Making New Directories

No new directories are necessary, we are using directories that were already created for the single-end pipeline.

In [3]:
mkdir -p $QC $TRIMMED $GENOME_DIR $TH_DIR $COUNT_DIR

Quality Control

First let’s run fastqc on everything. This is very easy, we can just give it all the FASTQ files on the command line, and it runs on all of them. We can use the wildcard * to do this simply.

In [4]:
# fastqc --threads 4 --quiet --outdir $QC $DEMUX/*.fq.gz

Prepare Reference Genome and Annotation

In [5]:
FIRST_PART="ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Escherichia_coli/latest_assembly_versions"
for CUR in $GFF $FNA ; do
    rsync rsync://${FIRST_PART}/${ACCESSION}/${CUR}.gz ${GENOME_DIR}
done

Warning Notice!

You are accessing a U.S. Government information system which includes this
computer, network, and all attached devices. This system is for
Government-authorized use only. Unauthorized use of this system may result in
disciplinary action and civil and criminal penalties. System users have no
expectation of privacy regarding any communications or data processed by this
system. At any time, the government may monitor, record, or seize any
communication or data transiting or stored on this information system.

-------------------------------------------------------------------------------

Welcome to the NCBI rsync server.



Warning Notice!

You are accessing a U.S. Government information system which includes this
computer, network, and all attached devices. This system is for
Government-authorized use only. Unauthorized use of this system may result in
disciplinary action and civil and criminal penalties. System users have no
expectation of privacy regarding any communications or data processed by this
system. At any time, the government may monitor, record, or seize any
communication or data transiting or stored on this information system.

-------------------------------------------------------------------------------

Welcome to the NCBI rsync server.


In [6]:
gunzip ${GENOME_DIR}/${GFF}.gz | cat
gunzip ${GENOME_DIR}/${FNA}.gz | cat
mv ${GENOME_DIR}/${FNA} ${GENOME_DIR}/${FA} # tophat wants fasta file to be named *.fa
In [7]:
bowtie2-build $GENOME_DIR/$FA $GENOME_DIR/$PREFIX
Settings:
  Output files: "/Users/cliburn/work/scratch/2015_output_all_data/genome/GCA_000010245.1_ASM1024v1_genomic.*.bt2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  /Users/cliburn/work/scratch/2015_output_all_data/genome/GCA_000010245.1_ASM1024v1_genomic.fa
Building a SMALL index
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 1161583
Using parameters --bmax 871188 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 871188 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
  Building sPrime
  Building sPrimeOrder
  V-Sorting samples
  V-Sorting samples time: 00:00:00
  Allocating rank array
  Ranking v-sort output
  Ranking v-sort output time: 00:00:00
  Invoking Larsson-Sadakane on ranks
  Invoking Larsson-Sadakane on ranks time: 00:00:00
  Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
  (Using difference cover)
  Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Splitting and merging
  Splitting and merging time: 00:00:00
Avg bucket size: 4.64633e+06 (target: 871187)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering Ebwt loop
Getting block 1 of 1
  No samples; assembling all-inclusive block
  Sorting block of length 4646332 for bucket 1
  (Using difference cover)
  Sorting block time: 00:00:01
Returning block of 4646333 for bucket 1
Exited Ebwt loop
fchr[A]: 0
fchr[C]: 1142182
fchr[G]: 2321261
fchr[T]: 3502499
fchr[$]: 4646332
Exiting Ebwt::buildToDisk()
Returning from initFromVector
Wrote 5743338 bytes to primary EBWT file: /Users/cliburn/work/scratch/2015_output_all_data/genome/GCA_000010245.1_ASM1024v1_genomic.1.bt2
Wrote 1161588 bytes to secondary EBWT file: /Users/cliburn/work/scratch/2015_output_all_data/genome/GCA_000010245.1_ASM1024v1_genomic.2.bt2
Re-opening _in1 and _in2 as input streams
Returning from Ebwt constructor
Headers:
    len: 4646332
    bwtLen: 4646333
    sz: 1161583
    bwtSz: 1161584
    lineRate: 6
    offRate: 4
    offMask: 0xfffffff0
    ftabChars: 10
    eftabLen: 20
    eftabSz: 80
    ftabLen: 1048577
    ftabSz: 4194308
    offsLen: 290396
    offsSz: 1161584
    lineSz: 64
    sideSz: 64
    sideBwtSz: 48
    sideBwtLen: 192
    numSides: 24200
    numLines: 24200
    ebwtTotLen: 1548800
    ebwtTotSz: 1548800
    color: 0
    reverse: 0
Total time for call to driver() for forward index: 00:00:01
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
  Time to reverse reference sequence: 00:00:00
bmax according to bmaxDivN setting: 1161583
Using parameters --bmax 871188 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 871188 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
  Building sPrime
  Building sPrimeOrder
  V-Sorting samples
  V-Sorting samples time: 00:00:00
  Allocating rank array
  Ranking v-sort output
  Ranking v-sort output time: 00:00:00
  Invoking Larsson-Sadakane on ranks
  Invoking Larsson-Sadakane on ranks time: 00:00:00
  Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
  (Using difference cover)
  Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Splitting and merging
  Splitting and merging time: 00:00:00
Avg bucket size: 4.64633e+06 (target: 871187)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering Ebwt loop
Getting block 1 of 1
  No samples; assembling all-inclusive block
  Sorting block of length 4646332 for bucket 1
  (Using difference cover)
  Sorting block time: 00:00:01
Returning block of 4646333 for bucket 1
Exited Ebwt loop
fchr[A]: 0
fchr[C]: 1142182
fchr[G]: 2321261
fchr[T]: 3502499
fchr[$]: 4646332
Exiting Ebwt::buildToDisk()
Returning from initFromVector
Wrote 5743338 bytes to primary EBWT file: /Users/cliburn/work/scratch/2015_output_all_data/genome/GCA_000010245.1_ASM1024v1_genomic.rev.1.bt2
Wrote 1161588 bytes to secondary EBWT file: /Users/cliburn/work/scratch/2015_output_all_data/genome/GCA_000010245.1_ASM1024v1_genomic.rev.2.bt2
Re-opening _in1 and _in2 as input streams
Returning from Ebwt constructor
Headers:
    len: 4646332
    bwtLen: 4646333
    sz: 1161583
    bwtSz: 1161584
    lineRate: 6
    offRate: 4
    offMask: 0xfffffff0
    ftabChars: 10
    eftabLen: 20
    eftabSz: 80
    ftabLen: 1048577
    ftabSz: 4194308
    offsLen: 290396
    offsSz: 1161584
    lineSz: 64
    sideSz: 64
    sideBwtSz: 48
    sideBwtLen: 192
    numSides: 24200
    numLines: 24200
    ebwtTotLen: 1548800
    ebwtTotSz: 1548800
    color: 0
    reverse: 1
Total time for backward call to driver() for mirror index: 00:00:01

Looping over all the samples

Now we can put all of the previous commands into one big loop. This is probably a good time for copying and pasting. But we will make a small changes - We need to generate a list of samples that the loop will iterate over

Generating a list of sample names

Since we have “standardized” file names (they have a regular structure) we can use some bash tools to extract the sample names from the names of the demultiplexed files

In [8]:
# ls -1 $DEMUX/i1.[78][ABC]_*.fq.gz | wc
for FASTQ in $DEMUX/i1.[78][ABC]_*.fq.gz
    do
        # echo $FASTQ
        FASTQ=${FASTQ##*/} # strip directory from file path
        # echo $FASTQ
        FASTQ="${FASTQ%.fq.gz}" # strip .fq.gz file extension
        # echo $FASTQ
        SAMPLE=${FASTQ##i1.} # strip "i1." prefix
        echo $SAMPLE

        # filename="${filename%.*}"
    done
[78][ABC]_*

Running the full pipeline on all the samples

In [9]:
# for SAMPLE in 7A_pilot 7B_pilot 7C_pilot 8A_pilot 8B_pilot 8C_pilot
#     do
for FASTQ in $DEMUX/i1.[78][ABC]_*.fq.gz
    do
        # echo $FASTQ
        FASTQ=${FASTQ##*/} # strip directory from file path
        # echo $FASTQ
        FASTQ="${FASTQ%.fq.gz}" # strip .fq.gz file extension
        # echo $FASTQ
        SAMPLE=${FASTQ##i1.} # strip "i1." prefix
        echo $SAMPLE
        fastq-mcf $ADAPTERS \
            $DEMUX/r1.${SAMPLE}.fq.gz \
            $DEMUX/r2.${SAMPLE}.fq.gz \
            -q 20 -x 0.5 \
            -o $TRIMMED/r1.${SAMPLE}.trim.fq.gz \
            -o $TRIMMED/r2.${SAMPLE}.trim.fq.gz

        mkdir -p $TH_DIR/${SAMPLE}
        tophat2 -G $GENOME_DIR/$GFF \
            --library-type fr-firststrand \
            --output-dir $TH_DIR/${SAMPLE} \
            --max-intron-length 5 \
            --min-intron-length 4 \
            --transcriptome-max-hits 1 \
            --max-multihits 1 \
            --no-coverage-search \
            --no-novel-juncs \
            --no-sort-bam \
            $GENOME_DIR/$PREFIX \
            $TRIMMED/r1.${SAMPLE}.trim.fq.gz \
            $TRIMMED/r2.${SAMPLE}.trim.fq.gz

        # ln $TH_DIR/${SAMPLE}/accepted_hits.bam $TH_DIR/${SAMPLE}.bam
        # samtools index $TH_DIR/${SAMPLE}.bam

        samtools sort -n $TH_DIR/${SAMPLE}/accepted_hits.bam \
            -o $TH_DIR/${SAMPLE}/accepted_hits.name.bam

        htseq-count --quiet --order=name --format=bam --stranded=reverse --type=gene \
            --idattr=Name  $TH_DIR/${SAMPLE}/accepted_hits.name.bam \
            $GENOME_DIR/$GFF > $COUNT_DIR/${SAMPLE}.csv
    done
[78][ABC]_*
Error opening adapter file '/Users/cliburn/work/myinfo/neb_adapters.fasta': No such file or directory
readlink: illegal option -- f
usage: readlink [-n] [file ...]

[2017-08-08 09:09:33] Beginning TopHat run (v2.1.1)
-----------------------------------------------
[2017-08-08 09:09:33] Checking for Bowtie
                  Bowtie version:        2.3.2.0
[2017-08-08 09:09:33] Checking for Bowtie index files (genome)..
[2017-08-08 09:09:33] Checking for reference FASTA file
[2017-08-08 09:09:33] Generating SAM header for /Users/cliburn/work/scratch/2015_output_all_data/genome/GCA_000010245.1_ASM1024v1_genomic
Error: could not open pipe gzip -cd < /Users/cliburn/work/scratch/2015_output_all_data/trimmed_fastqs/r1.[78][ABC]_*.trim.fq.gz
[E::hts_open_format] fail to open file '/Users/cliburn/work/scratch/2015_output_all_data/th_dir/[78][ABC]_*/accepted_hits.bam'
samtools sort: can't open "/Users/cliburn/work/scratch/2015_output_all_data/th_dir/[78][ABC]_*/accepted_hits.bam": No such file or directory
  [Errno 2] No such file or directory: '/Users/cliburn/work/scratch/2015_output_all_data/th_dir/[78][ABC]_*/accepted_hits.name.bam'
  [Exception type: FileNotFoundError, raised in count.py:42]

In [10]:
ls $TH_DIR
[78][ABC]_*
In [11]:
ls -ltr $COUNT_DIR
total 0
-rw-r--r--  1 cliburn  staff  0 Aug  8 09:09 [78][ABC]_*.csv
In [12]:
echo "The time it took to run this notebooks was:"
echo "$(($SECONDS/3600)) Hours, $(($SECONDS%3600/60)) Minutes, $(($SECONDS%60)) Seconds"
The time it took to run this notebooks was:
0 Hours, 0 Minutes, 10 Seconds