Working with Paired-Reads¶
So far we have only been working with a single read file, but we have paired-end read data. If we want to use both reads, we need to do things a little bit different at some of the steps.
Shell Variables¶
Assign the variables in this notebook.
In [1]:
# We used these in the last notebook
CUROUT=$HOME/work/scratch/2015_output
DEMUX=$CUROUT/demux_fastqs
CURDATA=/data/HTS_2015_data/runs_combined
TRIMMED=$CUROUT/trimmed_fastqs
INFO=$HOME/work/myinfo
GENOME_DIR=$CUROUT/genome
TH_DIR=$CUROUT/th_dir
COUNT_DIR=$CUROUT/counts
ACCESSION="GCA_000010245.1_ASM1024v1"
PREFIX=${ACCESSION}_genomic
GFF=${PREFIX}.gff
FA=${PREFIX}.fa
Making New Directories¶
No new directories are necessary, we are using directories that were already created for the single-end pipeline.
In [2]:
mkdir -p $TRIMMED $GENOME_DIR $TH_DIR $COUNT_DIR
Running fastq-mcf on Paired Data¶
It only takes two minor changes to run fastq-mcf on paired data, we need to tell it to also load the read 2 file, and also what to call the trimmed output from this file.
- neb_adapters.fasta
- r1.8A_pilot.fq.gz
- r2.8A_pilot.fq.gz : NEW for paired-data
- -q 20
- -x 0.5
- -o r1.8A_pilot.trim.fastq.gz
- -o r2.8A_pilot.trim.fastq.gz : NEW for paired-data
In [3]:
fastq-mcf $INFO/neb_adapters.fasta \
$DEMUX/r1.8A_pilot.fq.gz \
$DEMUX/r2.8A_pilot.fq.gz \
-q 20 -x 0.5 \
-o $TRIMMED/r1.8A_pilot.trim.fastq.gz \
-o $TRIMMED/r2.8A_pilot.trim.fastq.gz
Error opening adapter file '/Users/cliburn/work/myinfo/neb_adapters.fasta': No such file or directory
Note: Now that, since we are now including the reverse reads, contamination with the Universal adapter is now observed
In [4]:
ls $TRIMMED
Running Tophat on Paired Data¶
As with fastq-mcf, running Tophat on Paired Data on requires a minor change - adding the R2 file as an input; of course we will want to save the results to a different output directory.
In [5]:
CUR_DIR=$TH_DIR/paired_8A_pilot
R1_FASTQ=$TRIMMED/r1.8A_pilot.trim.fastq.gz
R2_FASTQ=$TRIMMED/r2.8A_pilot.trim.fastq.gz
mkdir -p $CUR_DIR
ls $TH_DIR
7A_pilot 7C_pilot 8B_pilot paired_8A_pilot
7B_pilot 8A_pilot 8C_pilot r1_8A_pilot
In [6]:
tophat2 -G $GENOME_DIR/$GFF \
--library-type fr-firststrand \
--output-dir $CUR_DIR \
--max-intron-length 5 \
--min-intron-length 4 \
--transcriptome-max-hits 1 \
--max-multihits 1 \
--no-coverage-search \
--no-novel-juncs \
--no-sort-bam \
$GENOME_DIR/$PREFIX \
$R1_FASTQ \
$R2_FASTQ
readlink: illegal option -- f
usage: readlink [-n] [file ...]
[2017-08-08 09:10:12] Beginning TopHat run (v2.1.1)
-----------------------------------------------
[2017-08-08 09:10:12] Checking for Bowtie
Bowtie version: 2.3.2.0
[2017-08-08 09:10:12] Checking for Bowtie index files (genome)..
[2017-08-08 09:10:12] Checking for reference FASTA file
[2017-08-08 09:10:12] Generating SAM header for /Users/cliburn/work/scratch/2015_output/genome/GCA_000010245.1_ASM1024v1_genomic
Error: could not open pipe gzip -cd < /Users/cliburn/work/scratch/2015_output/trimmed_fastqs/r1.8A_pilot.trim.fastq.gz
In [7]:
cat $CUR_DIR/align_summary.txt
cat: /Users/cliburn/work/scratch/2015_output/th_dir/paired_8A_pilot/align_summary.txt: No such file or directory
In [8]:
samtools flagstat $CUR_DIR/accepted_hits.bam
[E::hts_open_format] fail to open file '/Users/cliburn/work/scratch/2015_output/th_dir/paired_8A_pilot/accepted_hits.bam'
samtools flagstat: Cannot open input file "/Users/cliburn/work/scratch/2015_output/th_dir/paired_8A_pilot/accepted_hits.bam": No such file or directory
Running htseq-count on Paired Data¶
We have to do one thing different to run htseq-count on paired data. The
BAM file output by tophat will contain both reads for each spot, but by
default htseq-count expects the two reads in a pair to be right next to
each other in the BAM file. By default Tophat sorts the BAM by the
read’s position in the genome. We can tell htseq-count to expect this
using the --order=pos
option, but it sometimes doesn’t like tophat’s
sorting, so it is best if we just sort it ourselves by name, and for
good measure we will explicitly tell htseq-count that is what we have
done with --order=pos
.
In [9]:
CUR_COUNT=$COUNT_DIR/paired_8A_pilot.tsv
samtools sort -n $CUR_DIR/accepted_hits.bam \
-o $CUR_DIR/accepted_hits.name.bam
[E::hts_open_format] fail to open file '/Users/cliburn/work/scratch/2015_output/th_dir/paired_8A_pilot/accepted_hits.bam'
samtools sort: can't open "/Users/cliburn/work/scratch/2015_output/th_dir/paired_8A_pilot/accepted_hits.bam": No such file or directory
In [10]:
htseq-count --quiet --order=name --format=bam --stranded=reverse \
--type=gene --idattr=Name \
$CUR_DIR/accepted_hits.name.bam \
$GENOME_DIR/$GFF > $CUR_COUNT
[Errno 2] No such file or directory: '/Users/cliburn/work/scratch/2015_output/th_dir/paired_8A_pilot/accepted_hits.name.bam'
[Exception type: FileNotFoundError, raised in count.py:42]
In [11]:
tail $CUR_COUNT
Sanity Check: Total read counts¶
In [12]:
head $CUR_DIR/align_summary.txt
head: /Users/cliburn/work/scratch/2015_output/th_dir/paired_8A_pilot/align_summary.txt: No such file or directory
Check how many reads map unambiguously to genes . . .
In [13]:
grep -v __ $CUR_COUNT | awk '{s+=$2} END {printf "%.0f", s}'
0