Working with Paired-Reads

So far we have only been working with a single read file, but we have paired-end read data. If we want to use both reads, we need to do things a little bit different at some of the steps.

Shell Variables

Assign the variables in this notebook.

In [1]:
# We used these in the last notebook
CUROUT=$HOME/work/scratch/2015_output
DEMUX=$CUROUT/demux_fastqs
CURDATA=/data/HTS_2015_data/runs_combined
TRIMMED=$CUROUT/trimmed_fastqs
INFO=$HOME/work/myinfo
GENOME_DIR=$CUROUT/genome
TH_DIR=$CUROUT/th_dir
COUNT_DIR=$CUROUT/counts

ACCESSION="GCA_000010245.1_ASM1024v1"
PREFIX=${ACCESSION}_genomic
GFF=${PREFIX}.gff
FA=${PREFIX}.fa

Making New Directories

No new directories are necessary, we are using directories that were already created for the single-end pipeline.

In [2]:
mkdir -p $TRIMMED $GENOME_DIR $TH_DIR $COUNT_DIR

Running fastq-mcf on Paired Data

It only takes two minor changes to run fastq-mcf on paired data, we need to tell it to also load the read 2 file, and also what to call the trimmed output from this file.

  1. neb_adapters.fasta
  2. r1.8A_pilot.fq.gz
  3. r2.8A_pilot.fq.gz : NEW for paired-data
  4. -q 20
  5. -x 0.5
  6. -o r1.8A_pilot.trim.fastq.gz
  7. -o r2.8A_pilot.trim.fastq.gz : NEW for paired-data
In [3]:
fastq-mcf $INFO/neb_adapters.fasta \
    $DEMUX/r1.8A_pilot.fq.gz \
    $DEMUX/r2.8A_pilot.fq.gz \
    -q 20 -x 0.5 \
    -o $TRIMMED/r1.8A_pilot.trim.fastq.gz \
    -o $TRIMMED/r2.8A_pilot.trim.fastq.gz
Error opening adapter file '/Users/cliburn/work/myinfo/neb_adapters.fasta': No such file or directory

Note: Now that, since we are now including the reverse reads, contamination with the Universal adapter is now observed

In [4]:
ls $TRIMMED

Running Tophat on Paired Data

As with fastq-mcf, running Tophat on Paired Data on requires a minor change - adding the R2 file as an input; of course we will want to save the results to a different output directory.

In [5]:
CUR_DIR=$TH_DIR/paired_8A_pilot
R1_FASTQ=$TRIMMED/r1.8A_pilot.trim.fastq.gz
R2_FASTQ=$TRIMMED/r2.8A_pilot.trim.fastq.gz

mkdir -p $CUR_DIR
ls $TH_DIR
7A_pilot        7C_pilot        8B_pilot        paired_8A_pilot
7B_pilot        8A_pilot        8C_pilot        r1_8A_pilot
In [6]:
tophat2 -G $GENOME_DIR/$GFF \
    --library-type fr-firststrand \
    --output-dir $CUR_DIR \
    --max-intron-length 5 \
    --min-intron-length 4 \
    --transcriptome-max-hits 1 \
    --max-multihits 1 \
    --no-coverage-search \
    --no-novel-juncs \
    --no-sort-bam \
    $GENOME_DIR/$PREFIX \
    $R1_FASTQ \
    $R2_FASTQ
readlink: illegal option -- f
usage: readlink [-n] [file ...]

[2017-08-08 09:10:12] Beginning TopHat run (v2.1.1)
-----------------------------------------------
[2017-08-08 09:10:12] Checking for Bowtie
                  Bowtie version:        2.3.2.0
[2017-08-08 09:10:12] Checking for Bowtie index files (genome)..
[2017-08-08 09:10:12] Checking for reference FASTA file
[2017-08-08 09:10:12] Generating SAM header for /Users/cliburn/work/scratch/2015_output/genome/GCA_000010245.1_ASM1024v1_genomic
Error: could not open pipe gzip -cd < /Users/cliburn/work/scratch/2015_output/trimmed_fastqs/r1.8A_pilot.trim.fastq.gz

In [7]:
cat $CUR_DIR/align_summary.txt
cat: /Users/cliburn/work/scratch/2015_output/th_dir/paired_8A_pilot/align_summary.txt: No such file or directory

In [8]:
samtools flagstat $CUR_DIR/accepted_hits.bam
[E::hts_open_format] fail to open file '/Users/cliburn/work/scratch/2015_output/th_dir/paired_8A_pilot/accepted_hits.bam'
samtools flagstat: Cannot open input file "/Users/cliburn/work/scratch/2015_output/th_dir/paired_8A_pilot/accepted_hits.bam": No such file or directory

Running htseq-count on Paired Data

We have to do one thing different to run htseq-count on paired data. The BAM file output by tophat will contain both reads for each spot, but by default htseq-count expects the two reads in a pair to be right next to each other in the BAM file. By default Tophat sorts the BAM by the read’s position in the genome. We can tell htseq-count to expect this using the --order=pos option, but it sometimes doesn’t like tophat’s sorting, so it is best if we just sort it ourselves by name, and for good measure we will explicitly tell htseq-count that is what we have done with --order=pos.

In [9]:
CUR_COUNT=$COUNT_DIR/paired_8A_pilot.tsv

samtools sort -n $CUR_DIR/accepted_hits.bam \
    -o $CUR_DIR/accepted_hits.name.bam
[E::hts_open_format] fail to open file '/Users/cliburn/work/scratch/2015_output/th_dir/paired_8A_pilot/accepted_hits.bam'
samtools sort: can't open "/Users/cliburn/work/scratch/2015_output/th_dir/paired_8A_pilot/accepted_hits.bam": No such file or directory

In [10]:
htseq-count --quiet --order=name --format=bam --stranded=reverse \
    --type=gene --idattr=Name \
    $CUR_DIR/accepted_hits.name.bam \
    $GENOME_DIR/$GFF > $CUR_COUNT
  [Errno 2] No such file or directory: '/Users/cliburn/work/scratch/2015_output/th_dir/paired_8A_pilot/accepted_hits.name.bam'
  [Exception type: FileNotFoundError, raised in count.py:42]

In [11]:
tail $CUR_COUNT

Sanity Check: Total read counts

In [12]:
head $CUR_DIR/align_summary.txt
head: /Users/cliburn/work/scratch/2015_output/th_dir/paired_8A_pilot/align_summary.txt: No such file or directory

Check how many reads map unambiguously to genes . . .

In [13]:
grep -v __ $CUR_COUNT | awk '{s+=$2} END {printf "%.0f", s}'
0