Counting Reads¶
The next step after mapping reads is to count the number of reads that fall within each annotated gene in the genome, so lets set up a count directory.
Shell Variables¶
Assign the variables in this notebook.
In [1]:
# We used these in the last notebook
CUROUT=$HOME/work/scratch/2015_output
GENOME_DIR=$CUROUT/genome
TH_DIR=$CUROUT/th_dir
ACCESSION="GCA_000010245.1_ASM1024v1"
PREFIX=${ACCESSION}_genomic
GFF=$GENOME_DIR/${PREFIX}.gff
# The following are new for this notebook
COUNT_DIR=$CUROUT/counts
Making New Directories¶
Make the directories that are new in this notebook
In [2]:
mkdir -p $COUNT_DIR
Now let’s check to be sure that worked. We will run ls
and check
that these directories now exist in the $CUROUT
directory.
In [3]:
ls $CUROUT
counts demux_fastqs qc_output trimmed_fastqs
Counting Reads¶
We will use
htseq-count
to do the counting, but first we need to make some decisions, because
the htseq-count
defaults do not work with some annotation files.
Here are the most important commandline options that we need to
consider: * –format=: Format of the input data. Possible values are
sam (for text SAM files) and bam (for binary BAM files). Default is sam.
* –stranded=: whether the data is from a strand-specific assay
(default: yes). For stranded=no, a read is considered overlapping with a
feature regardless of whether it is mapped to the same or the opposite
strand as the feature. For stranded=yes and single-end reads, the read
has to be mapped to the same strand as the feature. For paired-end
reads, the first read has to be on the same strand and the second read
on the opposite strand. For stranded=reverse, these rules are reversed.
* –type=: feature type (3rd column in GFF file) to be evaluated, all
features of other type are ignored (default, suitable for RNA-Seq
analysis using an Ensembl GTF file: exon) * –idattr=: GFF attribute to
be used as feature ID. Several GFF lines with the same feature ID will
be considered as parts of the same feature. The feature ID is used to
identity the counts in the output table. The default, suitable for
RNA-Seq analysis using an Ensembl GTF file, is gene_id.
And here is how we will set those options: * –format=bam: Since Tophat generated BAM files for us * –stranded=reverse: The dUTP method that we used for generating a strand-specific library produces reads that are anti-sense, htseq-count considers this to be “reverse”. * –type=gene: * –idattr=ID:
We need to look at the GFF file to understand what exactly the
--type
and --idattr
options are, and why we are setting them
this way.
In [4]:
head -20 $GFF | cut -c -55
head: /Users/cliburn/work/scratch/2015_output/genome/GCA_000010245.1_ASM1024v1_genomic.gff: No such file or directory
In [5]:
head -20 $GFF
head: /Users/cliburn/work/scratch/2015_output/genome/GCA_000010245.1_ASM1024v1_genomic.gff: No such file or directory
Check to see if the “Name” attribute is unique¶
In [6]:
grep -v CDS $GFF | cut -f9 | cut -d";" -f2 | sort | uniq -d
grep: /Users/cliburn/work/scratch/2015_output/genome/GCA_000010245.1_ASM1024v1_genomic.gff: No such file or directory
Running htseq-count¶
So now we are ready! We run htseq-count
using
htseq-count ALIGNMENT_FILE GFF_FILE
. Here is our command for our
test sample:
$
* --format=bam: Since Tophat generated BAM files for us
- –stranded=reverse: The dUTP method that we used for generating a strand-specific library produces reads that are anti-sense, htseq-count considers this to be “reverse”.
- –type=gene:
- –idattr=ID:
In [7]:
CUR_DIR=$TH_DIR/r1_8A_pilot
CUR_COUNT=$COUNT_DIR/r1_8A_pilot.tsv
In [8]:
htseq-count --quiet --format=bam --stranded=reverse --type=gene \
--idattr=Name \
$CUR_DIR/accepted_hits.bam \
$GFF > $CUR_COUNT
[Errno 2] No such file or directory: '/Users/cliburn/work/scratch/2015_output/th_dir/r1_8A_pilot/accepted_hits.bam'
[Exception type: FileNotFoundError, raised in count.py:42]
Let’s take a quick peek at the results
In [9]:
head $CUR_COUNT
There’s also some useful information at the end of the file:
In [10]:
tail $CUR_COUNT
In [11]:
cat $CUR_COUNT | awk '{s+=$2} END {printf "%.0f", s}'
0
In [12]:
head $CUR_DIR/align_summary.txt
head: /Users/cliburn/work/scratch/2015_output/th_dir/r1_8A_pilot/align_summary.txt: No such file or directory
An easy way to check how many reads map unambiguously to genes is to use grep to remove those lines with information about problem reads that are at the end of the file, then use our awk command again
In [13]:
grep -v __ $CUR_COUNT | awk '{s+=$2} END {printf "%.0f", s}'
0