In [1]:
# time this notebook
SECONDS=0
Working with Multiple Samples¶
Let’s kick it up another notch - we have six samples, let’s run our analysis on all of them!
Shell Variables¶
Assign the variables in this notebook.
In [2]:
CUROUT=$HOME/work/scratch/2015_output_all_data
DEMUX=$CUROUT/demux_fastqs
CURDATA=/data/HTS_2015_data/runs_combined
INFO=$HOME/work/myinfo
TRIMMED=$CUROUT/trimmed_fastqs
GENOME_DIR=$CUROUT/genome
TH_DIR=$CUROUT/th_dir
COUNT_DIR=$CUROUT/counts
QC=$CUROUT/qc_output
ADAPTERS=$INFO/neb_adapters.fasta
ACCESSION="GCA_000010245.1_ASM1024v1"
PREFIX=${ACCESSION}_genomic
GFF=${PREFIX}.gff
FNA=${PREFIX}.fna
FA=${PREFIX}.fa
Making New Directories¶
No new directories are necessary, we are using directories that were already created for the single-end pipeline.
In [3]:
mkdir -p $QC $TRIMMED $GENOME_DIR $TH_DIR $COUNT_DIR
Quality Control¶
First let’s run fastqc on everything. This is very easy, we can just
give it all the FASTQ files on the command line, and it runs on all of
them. We can use the wildcard *
to do this simply.
In [4]:
# fastqc --threads 4 --quiet --outdir $QC $DEMUX/*.fq.gz
Prepare Reference Genome and Annotation¶
In [5]:
FIRST_PART="ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Escherichia_coli/latest_assembly_versions"
for CUR in $GFF $FNA ; do
rsync rsync://${FIRST_PART}/${ACCESSION}/${CUR}.gz ${GENOME_DIR}
done
Warning Notice!
You are accessing a U.S. Government information system which includes this
computer, network, and all attached devices. This system is for
Government-authorized use only. Unauthorized use of this system may result in
disciplinary action and civil and criminal penalties. System users have no
expectation of privacy regarding any communications or data processed by this
system. At any time, the government may monitor, record, or seize any
communication or data transiting or stored on this information system.
-------------------------------------------------------------------------------
Welcome to the NCBI rsync server.
Warning Notice!
You are accessing a U.S. Government information system which includes this
computer, network, and all attached devices. This system is for
Government-authorized use only. Unauthorized use of this system may result in
disciplinary action and civil and criminal penalties. System users have no
expectation of privacy regarding any communications or data processed by this
system. At any time, the government may monitor, record, or seize any
communication or data transiting or stored on this information system.
-------------------------------------------------------------------------------
Welcome to the NCBI rsync server.
In [6]:
gunzip ${GENOME_DIR}/${GFF}.gz | cat
gunzip ${GENOME_DIR}/${FNA}.gz | cat
mv ${GENOME_DIR}/${FNA} ${GENOME_DIR}/${FA} # tophat wants fasta file to be named *.fa
In [7]:
bowtie2-build $GENOME_DIR/$FA $GENOME_DIR/$PREFIX
Settings:
Output files: "/Users/cliburn/work/scratch/2015_output_all_data/genome/GCA_000010245.1_ASM1024v1_genomic.*.bt2"
Line rate: 6 (line is 64 bytes)
Lines per side: 1 (side is 64 bytes)
Offset rate: 4 (one in 16)
FTable chars: 10
Strings: unpacked
Max bucket size: default
Max bucket size, sqrt multiplier: default
Max bucket size, len divisor: 4
Difference-cover sample period: 1024
Endianness: little
Actual local endianness: little
Sanity checking: disabled
Assertions: disabled
Random seed: 0
Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
/Users/cliburn/work/scratch/2015_output_all_data/genome/GCA_000010245.1_ASM1024v1_genomic.fa
Building a SMALL index
Reading reference sizes
Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 1161583
Using parameters --bmax 871188 --dcv 1024
Doing ahead-of-time memory usage test
Passed! Constructing with these parameters: --bmax 871188 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
Building sPrime
Building sPrimeOrder
V-Sorting samples
V-Sorting samples time: 00:00:00
Allocating rank array
Ranking v-sort output
Ranking v-sort output time: 00:00:00
Invoking Larsson-Sadakane on ranks
Invoking Larsson-Sadakane on ranks time: 00:00:00
Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
(Using difference cover)
Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Splitting and merging
Splitting and merging time: 00:00:00
Avg bucket size: 4.64633e+06 (target: 871187)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering Ebwt loop
Getting block 1 of 1
No samples; assembling all-inclusive block
Sorting block of length 4646332 for bucket 1
(Using difference cover)
Sorting block time: 00:00:01
Returning block of 4646333 for bucket 1
Exited Ebwt loop
fchr[A]: 0
fchr[C]: 1142182
fchr[G]: 2321261
fchr[T]: 3502499
fchr[$]: 4646332
Exiting Ebwt::buildToDisk()
Returning from initFromVector
Wrote 5743338 bytes to primary EBWT file: /Users/cliburn/work/scratch/2015_output_all_data/genome/GCA_000010245.1_ASM1024v1_genomic.1.bt2
Wrote 1161588 bytes to secondary EBWT file: /Users/cliburn/work/scratch/2015_output_all_data/genome/GCA_000010245.1_ASM1024v1_genomic.2.bt2
Re-opening _in1 and _in2 as input streams
Returning from Ebwt constructor
Headers:
len: 4646332
bwtLen: 4646333
sz: 1161583
bwtSz: 1161584
lineRate: 6
offRate: 4
offMask: 0xfffffff0
ftabChars: 10
eftabLen: 20
eftabSz: 80
ftabLen: 1048577
ftabSz: 4194308
offsLen: 290396
offsSz: 1161584
lineSz: 64
sideSz: 64
sideBwtSz: 48
sideBwtLen: 192
numSides: 24200
numLines: 24200
ebwtTotLen: 1548800
ebwtTotSz: 1548800
color: 0
reverse: 0
Total time for call to driver() for forward index: 00:00:01
Reading reference sizes
Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
Time to join reference sequences: 00:00:00
Time to reverse reference sequence: 00:00:00
bmax according to bmaxDivN setting: 1161583
Using parameters --bmax 871188 --dcv 1024
Doing ahead-of-time memory usage test
Passed! Constructing with these parameters: --bmax 871188 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
Building sPrime
Building sPrimeOrder
V-Sorting samples
V-Sorting samples time: 00:00:00
Allocating rank array
Ranking v-sort output
Ranking v-sort output time: 00:00:00
Invoking Larsson-Sadakane on ranks
Invoking Larsson-Sadakane on ranks time: 00:00:00
Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
(Using difference cover)
Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Splitting and merging
Splitting and merging time: 00:00:00
Avg bucket size: 4.64633e+06 (target: 871187)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering Ebwt loop
Getting block 1 of 1
No samples; assembling all-inclusive block
Sorting block of length 4646332 for bucket 1
(Using difference cover)
Sorting block time: 00:00:01
Returning block of 4646333 for bucket 1
Exited Ebwt loop
fchr[A]: 0
fchr[C]: 1142182
fchr[G]: 2321261
fchr[T]: 3502499
fchr[$]: 4646332
Exiting Ebwt::buildToDisk()
Returning from initFromVector
Wrote 5743338 bytes to primary EBWT file: /Users/cliburn/work/scratch/2015_output_all_data/genome/GCA_000010245.1_ASM1024v1_genomic.rev.1.bt2
Wrote 1161588 bytes to secondary EBWT file: /Users/cliburn/work/scratch/2015_output_all_data/genome/GCA_000010245.1_ASM1024v1_genomic.rev.2.bt2
Re-opening _in1 and _in2 as input streams
Returning from Ebwt constructor
Headers:
len: 4646332
bwtLen: 4646333
sz: 1161583
bwtSz: 1161584
lineRate: 6
offRate: 4
offMask: 0xfffffff0
ftabChars: 10
eftabLen: 20
eftabSz: 80
ftabLen: 1048577
ftabSz: 4194308
offsLen: 290396
offsSz: 1161584
lineSz: 64
sideSz: 64
sideBwtSz: 48
sideBwtLen: 192
numSides: 24200
numLines: 24200
ebwtTotLen: 1548800
ebwtTotSz: 1548800
color: 0
reverse: 1
Total time for backward call to driver() for mirror index: 00:00:01
Looping over all the samples¶
Now we can put all of the previous commands into one big loop. This is probably a good time for copying and pasting. But we will make a small changes - We need to generate a list of samples that the loop will iterate over
Generating a list of sample names¶
Since we have “standardized” file names (they have a regular structure) we can use some bash tools to extract the sample names from the names of the demultiplexed files
In [8]:
# ls -1 $DEMUX/i1.[78][ABC]_*.fq.gz | wc
for FASTQ in $DEMUX/i1.[78][ABC]_*.fq.gz
do
# echo $FASTQ
FASTQ=${FASTQ##*/} # strip directory from file path
# echo $FASTQ
FASTQ="${FASTQ%.fq.gz}" # strip .fq.gz file extension
# echo $FASTQ
SAMPLE=${FASTQ##i1.} # strip "i1." prefix
echo $SAMPLE
# filename="${filename%.*}"
done
[78][ABC]_*
Running the full pipeline on all the samples¶
In [9]:
# for SAMPLE in 7A_pilot 7B_pilot 7C_pilot 8A_pilot 8B_pilot 8C_pilot
# do
for FASTQ in $DEMUX/i1.[78][ABC]_*.fq.gz
do
# echo $FASTQ
FASTQ=${FASTQ##*/} # strip directory from file path
# echo $FASTQ
FASTQ="${FASTQ%.fq.gz}" # strip .fq.gz file extension
# echo $FASTQ
SAMPLE=${FASTQ##i1.} # strip "i1." prefix
echo $SAMPLE
fastq-mcf $ADAPTERS \
$DEMUX/r1.${SAMPLE}.fq.gz \
$DEMUX/r2.${SAMPLE}.fq.gz \
-q 20 -x 0.5 \
-o $TRIMMED/r1.${SAMPLE}.trim.fq.gz \
-o $TRIMMED/r2.${SAMPLE}.trim.fq.gz
mkdir -p $TH_DIR/${SAMPLE}
tophat2 -G $GENOME_DIR/$GFF \
--library-type fr-firststrand \
--output-dir $TH_DIR/${SAMPLE} \
--max-intron-length 5 \
--min-intron-length 4 \
--transcriptome-max-hits 1 \
--max-multihits 1 \
--no-coverage-search \
--no-novel-juncs \
--no-sort-bam \
$GENOME_DIR/$PREFIX \
$TRIMMED/r1.${SAMPLE}.trim.fq.gz \
$TRIMMED/r2.${SAMPLE}.trim.fq.gz
# ln $TH_DIR/${SAMPLE}/accepted_hits.bam $TH_DIR/${SAMPLE}.bam
# samtools index $TH_DIR/${SAMPLE}.bam
samtools sort -n $TH_DIR/${SAMPLE}/accepted_hits.bam \
-o $TH_DIR/${SAMPLE}/accepted_hits.name.bam
htseq-count --quiet --order=name --format=bam --stranded=reverse --type=gene \
--idattr=Name $TH_DIR/${SAMPLE}/accepted_hits.name.bam \
$GENOME_DIR/$GFF > $COUNT_DIR/${SAMPLE}.csv
done
[78][ABC]_*
Error opening adapter file '/Users/cliburn/work/myinfo/neb_adapters.fasta': No such file or directory
readlink: illegal option -- f
usage: readlink [-n] [file ...]
[2017-08-08 09:09:33] Beginning TopHat run (v2.1.1)
-----------------------------------------------
[2017-08-08 09:09:33] Checking for Bowtie
Bowtie version: 2.3.2.0
[2017-08-08 09:09:33] Checking for Bowtie index files (genome)..
[2017-08-08 09:09:33] Checking for reference FASTA file
[2017-08-08 09:09:33] Generating SAM header for /Users/cliburn/work/scratch/2015_output_all_data/genome/GCA_000010245.1_ASM1024v1_genomic
Error: could not open pipe gzip -cd < /Users/cliburn/work/scratch/2015_output_all_data/trimmed_fastqs/r1.[78][ABC]_*.trim.fq.gz
[E::hts_open_format] fail to open file '/Users/cliburn/work/scratch/2015_output_all_data/th_dir/[78][ABC]_*/accepted_hits.bam'
samtools sort: can't open "/Users/cliburn/work/scratch/2015_output_all_data/th_dir/[78][ABC]_*/accepted_hits.bam": No such file or directory
[Errno 2] No such file or directory: '/Users/cliburn/work/scratch/2015_output_all_data/th_dir/[78][ABC]_*/accepted_hits.name.bam'
[Exception type: FileNotFoundError, raised in count.py:42]
In [10]:
ls $TH_DIR
[78][ABC]_*
In [11]:
ls -ltr $COUNT_DIR
total 0
-rw-r--r-- 1 cliburn staff 0 Aug 8 09:09 [78][ABC]_*.csv
In [12]:
echo "The time it took to run this notebooks was:"
echo "$(($SECONDS/3600)) Hours, $(($SECONDS%3600/60)) Minutes, $(($SECONDS%60)) Seconds"
The time it took to run this notebooks was:
0 Hours, 0 Minutes, 10 Seconds