Working with Paired-Reads¶

So far we have only been working with a single read file, but we have paired-end read data. If we want to use both reads, we need to do things a little bit different at some of the steps.

Shell Variables¶

Assign the variables in this notebook.

In [1]:

# We used these in the last notebook
CUROUT=$HOME/work/scratch/2015_output
DEMUX=$CUROUT/demux_fastqs
CURDATA=/data/HTS_2015_data/runs_combined
TRIMMED=$CUROUT/trimmed_fastqs
INFO=$HOME/work/myinfo
GENOME_DIR=$CUROUT/genome
TH_DIR=$CUROUT/th_dir
COUNT_DIR=$CUROUT/counts

ACCESSION="GCA_000010245.1_ASM1024v1"
PREFIX=${ACCESSION}_genomic
GFF=${PREFIX}.gff
FA=${PREFIX}.fa

Making New Directories¶

No new directories are necessary, we are using directories that were already created for the single-end pipeline.

In [2]:

mkdir -p $TRIMMED $GENOME_DIR $TH_DIR $COUNT_DIR

Running fastq-mcf on Paired Data¶

It only takes two minor changes to run fastq-mcf on paired data, we need to tell it to also load the read 2 file, and also what to call the trimmed output from this file.

neb_adapters.fasta
r1.8A_pilot.fq.gz
r2.8A_pilot.fq.gz : NEW for paired-data
-q 20
-x 0.5
-o r1.8A_pilot.trim.fastq.gz
-o r2.8A_pilot.trim.fastq.gz : NEW for paired-data

In [3]:

fastq-mcf $INFO/neb_adapters.fasta \
    $DEMUX/r1.8A_pilot.fq.gz \
    $DEMUX/r2.8A_pilot.fq.gz \
    -q 20 -x 0.5 \
    -o $TRIMMED/r1.8A_pilot.trim.fastq.gz \
    -o $TRIMMED/r2.8A_pilot.trim.fastq.gz

Error opening adapter file '/Users/cliburn/work/myinfo/neb_adapters.fasta': No such file or directory

Note: Now that, since we are now including the reverse reads, contamination with the Universal adapter is now observed

In [4]:

ls $TRIMMED

Running Tophat on Paired Data¶

As with fastq-mcf, running Tophat on Paired Data on requires a minor change - adding the R2 file as an input; of course we will want to save the results to a different output directory.

In [5]:

CUR_DIR=$TH_DIR/paired_8A_pilot
R1_FASTQ=$TRIMMED/r1.8A_pilot.trim.fastq.gz
R2_FASTQ=$TRIMMED/r2.8A_pilot.trim.fastq.gz

mkdir -p $CUR_DIR
ls $TH_DIR

7A_pilot        7C_pilot        8B_pilot        paired_8A_pilot
7B_pilot        8A_pilot        8C_pilot        r1_8A_pilot

In [6]:

tophat2 -G $GENOME_DIR/$GFF \
    --library-type fr-firststrand \
    --output-dir $CUR_DIR \
    --max-intron-length 5 \
    --min-intron-length 4 \
    --transcriptome-max-hits 1 \
    --max-multihits 1 \
    --no-coverage-search \
    --no-novel-juncs \
    --no-sort-bam \
    $GENOME_DIR/$PREFIX \
    $R1_FASTQ \
    $R2_FASTQ

readlink: illegal option -- f
usage: readlink [-n] [file ...]

[2017-08-08 09:10:12] Beginning TopHat run (v2.1.1)
-----------------------------------------------
[2017-08-08 09:10:12] Checking for Bowtie
                  Bowtie version:        2.3.2.0
[2017-08-08 09:10:12] Checking for Bowtie index files (genome)..
[2017-08-08 09:10:12] Checking for reference FASTA file
[2017-08-08 09:10:12] Generating SAM header for /Users/cliburn/work/scratch/2015_output/genome/GCA_000010245.1_ASM1024v1_genomic
Error: could not open pipe gzip -cd < /Users/cliburn/work/scratch/2015_output/trimmed_fastqs/r1.8A_pilot.trim.fastq.gz

In [7]:

cat $CUR_DIR/align_summary.txt

cat: /Users/cliburn/work/scratch/2015_output/th_dir/paired_8A_pilot/align_summary.txt: No such file or directory

In [8]:

samtools flagstat $CUR_DIR/accepted_hits.bam

[E::hts_open_format] fail to open file '/Users/cliburn/work/scratch/2015_output/th_dir/paired_8A_pilot/accepted_hits.bam'
samtools flagstat: Cannot open input file "/Users/cliburn/work/scratch/2015_output/th_dir/paired_8A_pilot/accepted_hits.bam": No such file or directory

Running htseq-count on Paired Data¶

We have to do one thing different to run htseq-count on paired data. The BAM file output by tophat will contain both reads for each spot, but by default htseq-count expects the two reads in a pair to be right next to each other in the BAM file. By default Tophat sorts the BAM by the read’s position in the genome. We can tell htseq-count to expect this using the --order=pos option, but it sometimes doesn’t like tophat’s sorting, so it is best if we just sort it ourselves by name, and for good measure we will explicitly tell htseq-count that is what we have done with --order=pos.

In [9]:

CUR_COUNT=$COUNT_DIR/paired_8A_pilot.tsv

samtools sort -n $CUR_DIR/accepted_hits.bam \
    -o $CUR_DIR/accepted_hits.name.bam

[E::hts_open_format] fail to open file '/Users/cliburn/work/scratch/2015_output/th_dir/paired_8A_pilot/accepted_hits.bam'
samtools sort: can't open "/Users/cliburn/work/scratch/2015_output/th_dir/paired_8A_pilot/accepted_hits.bam": No such file or directory

In [10]:

htseq-count --quiet --order=name --format=bam --stranded=reverse \
    --type=gene --idattr=Name \
    $CUR_DIR/accepted_hits.name.bam \
    $GENOME_DIR/$GFF > $CUR_COUNT

  [Errno 2] No such file or directory: '/Users/cliburn/work/scratch/2015_output/th_dir/paired_8A_pilot/accepted_hits.name.bam'
  [Exception type: FileNotFoundError, raised in count.py:42]

In [11]:

tail $CUR_COUNT

Sanity Check: Total read counts¶

In [12]:

head $CUR_DIR/align_summary.txt

head: /Users/cliburn/work/scratch/2015_output/th_dir/paired_8A_pilot/align_summary.txt: No such file or directory

Check how many reads map unambiguously to genes . . .

In [13]:

grep -v __ $CUR_COUNT | awk '{s+=$2} END {printf "%.0f", s}'

Working with Paired-Reads¶

Shell Variables¶

Making New Directories¶

Running fastq-mcf on Paired Data¶

Running Tophat on Paired Data¶

Running htseq-count on Paired Data¶

Sanity Check: Total read counts¶

Page contents

Previous page

Next page

This Page