Counting Reads

The next step after mapping reads is to count the number of reads that fall within each annotated gene in the genome, so lets set up a count directory.

Shell Variables

Assign the variables in this notebook.

In [1]:
# We used these in the last notebook
CUROUT=$HOME/work/scratch/2015_output
GENOME_DIR=$CUROUT/genome
TH_DIR=$CUROUT/th_dir
ACCESSION="GCA_000010245.1_ASM1024v1"
PREFIX=${ACCESSION}_genomic
GFF=$GENOME_DIR/${PREFIX}.gff

# The following are new for this notebook
COUNT_DIR=$CUROUT/counts

Making New Directories

Make the directories that are new in this notebook

In [2]:
mkdir -p $COUNT_DIR

Now let’s check to be sure that worked. We will run ls and check that these directories now exist in the $CUROUT directory.

In [3]:
ls $CUROUT
counts          demux_fastqs    qc_output       trimmed_fastqs

Counting Reads

We will use htseq-count to do the counting, but first we need to make some decisions, because the htseq-count defaults do not work with some annotation files. Here are the most important commandline options that we need to consider: * –format=: Format of the input data. Possible values are sam (for text SAM files) and bam (for binary BAM files). Default is sam. * –stranded=: whether the data is from a strand-specific assay (default: yes). For stranded=no, a read is considered overlapping with a feature regardless of whether it is mapped to the same or the opposite strand as the feature. For stranded=yes and single-end reads, the read has to be mapped to the same strand as the feature. For paired-end reads, the first read has to be on the same strand and the second read on the opposite strand. For stranded=reverse, these rules are reversed. * –type=: feature type (3rd column in GFF file) to be evaluated, all features of other type are ignored (default, suitable for RNA-Seq analysis using an Ensembl GTF file: exon) * –idattr=: GFF attribute to be used as feature ID. Several GFF lines with the same feature ID will be considered as parts of the same feature. The feature ID is used to identity the counts in the output table. The default, suitable for RNA-Seq analysis using an Ensembl GTF file, is gene_id.

And here is how we will set those options: * –format=bam: Since Tophat generated BAM files for us * –stranded=reverse: The dUTP method that we used for generating a strand-specific library produces reads that are anti-sense, htseq-count considers this to be “reverse”. * –type=gene: * –idattr=ID:

We need to look at the GFF file to understand what exactly the --type and --idattr options are, and why we are setting them this way.

In [4]:
head -20 $GFF | cut -c -55
head: /Users/cliburn/work/scratch/2015_output/genome/GCA_000010245.1_ASM1024v1_genomic.gff: No such file or directory
In [5]:
head -20 $GFF
head: /Users/cliburn/work/scratch/2015_output/genome/GCA_000010245.1_ASM1024v1_genomic.gff: No such file or directory

Check to see if the “Name” attribute is unique

In [6]:
grep -v CDS $GFF | cut -f9 | cut -d";" -f2 | sort | uniq -d
grep: /Users/cliburn/work/scratch/2015_output/genome/GCA_000010245.1_ASM1024v1_genomic.gff: No such file or directory

Running htseq-count

So now we are ready! We run htseq-count using htseq-count ALIGNMENT_FILE GFF_FILE. Here is our command for our test sample:

$

* --format=bam: Since Tophat generated BAM files for us
  • –stranded=reverse: The dUTP method that we used for generating a strand-specific library produces reads that are anti-sense, htseq-count considers this to be “reverse”.
  • –type=gene:
  • –idattr=ID:
In [7]:
CUR_DIR=$TH_DIR/r1_8A_pilot
CUR_COUNT=$COUNT_DIR/r1_8A_pilot.tsv
In [8]:
htseq-count --quiet --format=bam --stranded=reverse --type=gene \
    --idattr=Name \
    $CUR_DIR/accepted_hits.bam \
    $GFF > $CUR_COUNT
  [Errno 2] No such file or directory: '/Users/cliburn/work/scratch/2015_output/th_dir/r1_8A_pilot/accepted_hits.bam'
  [Exception type: FileNotFoundError, raised in count.py:42]

Let’s take a quick peek at the results

In [9]:
head $CUR_COUNT

There’s also some useful information at the end of the file:

In [10]:
tail $CUR_COUNT
In [11]:
cat $CUR_COUNT | awk '{s+=$2} END {printf "%.0f", s}'
0
In [12]:
head $CUR_DIR/align_summary.txt
head: /Users/cliburn/work/scratch/2015_output/th_dir/r1_8A_pilot/align_summary.txt: No such file or directory

An easy way to check how many reads map unambiguously to genes is to use grep to remove those lines with information about problem reads that are at the end of the file, then use our awk command again

In [13]:
grep -v __ $CUR_COUNT | awk '{s+=$2} END {printf "%.0f", s}'
0