Counting Reads¶
The next step after mapping reads is to count the number of reads that fall within each annotated gene in the genome, so lets set up a count directory.
Shell Variables¶
In [1]:
# Source the config script
source bioinf_intro_config.sh
mkdir -p $COUNT_OUT
ls $CUROUT
count_out igv star_out
demo_chmod myinfo stuff_for_igv.tgz
genome qc_output trimmed_fastqs
Counting Reads¶
In [2]:
htseq-count --help
Usage: htseq-count [options] alignment_file gff_file
This script takes an alignment file in SAM/BAM format and a feature file in
GFF format and calculates for each feature the number of reads mapping to it.
See http://www-huber.embl.de/users/anders/HTSeq/doc/count.html for details.
Options:
-h, --help show this help message and exit
-f SAMTYPE, --format=SAMTYPE
type of <alignment_file> data, either 'sam' or 'bam'
(default: sam)
-r ORDER, --order=ORDER
'pos' or 'name'. Sorting order of <alignment_file>
(default: name). Paired-end sequencing data must be
sorted either by position or by read name, and the
sorting order must be specified. Ignored for single-
end data.
-s STRANDED, --stranded=STRANDED
whether the data is from a strand-specific assay.
Specify 'yes', 'no', or 'reverse' (default: yes).
'reverse' means 'yes' with reversed strand
interpretation
-a MINAQUAL, --minaqual=MINAQUAL
skip all reads with alignment quality lower than the
given minimum value (default: 10)
-t FEATURETYPE, --type=FEATURETYPE
feature type (3rd column in GFF file) to be used, all
features of other type are ignored (default, suitable
for Ensembl GTF files: exon)
-i IDATTR, --idattr=IDATTR
GFF attribute to be used as feature ID (default,
suitable for Ensembl GTF files: gene_id)
-m MODE, --mode=MODE mode to handle reads overlapping more than one feature
(choices: union, intersection-strict, intersection-
nonempty; default: union)
-o SAMOUT, --samout=SAMOUT
write out all SAM alignment records into an output SAM
file called SAMOUT, annotating each line with its
feature assignment (as an optional field with tag
'XF')
-q, --quiet suppress progress report
Written by Simon Anders (sanders@fs.tum.de), European Molecular Biology
Laboratory (EMBL). (c) 2010. Released under the terms of the GNU General
Public License v3. Part of the 'HTSeq' framework, version 0.7.2.
We will use
htseq-count
to do the counting, but first we need to make some decisions, because
the htseq-count
defaults do not work with some annotation files.
Here are the most important commandline options that we need to
consider: * –format=: Format of the input data. Possible values are
sam (for text SAM files) and bam (for binary BAM files). Default is sam.
* –stranded=: whether the data is from a strand-specific assay
(default: yes). For stranded=no, a read is considered overlapping with a
feature regardless of whether it is mapped to the same or the opposite
strand as the feature. For stranded=yes and single-end reads, the read
has to be mapped to the same strand as the feature. For paired-end
reads, the first read has to be on the same strand and the second read
on the opposite strand. For stranded=reverse, these rules are reversed.
* –type=: feature type (3rd column in GFF file) to be evaluated, all
features of other type are ignored (default, suitable for RNA-Seq
analysis using an Ensembl GTF file: exon) * –idattr=: GFF attribute to
be used as feature ID. Several GFF lines with the same feature ID will
be considered as parts of the same feature. The feature ID is used to
identity the counts in the output table. The default, suitable for
RNA-Seq analysis using an Ensembl GTF file, is gene_id.
And here is how we will set those options: * –format=bam: Since Tophat generated BAM files for us * –stranded=reverse: The dUTP method that we used for generating a strand-specific library produces reads that are anti-sense, htseq-count considers this to be “reverse”.
We need to look at the GFF file to understand what exactly the
--type
and --idattr
options are, and why we are setting them
this way.
In [3]:
head -20 $GTF
#!genome-build CNA3
#!genome-version CNA3
#!genome-date 2015-11
#!genome-build-accession GCA_000149245.3
#!genebuild-last-updated 2015-11
1 ena gene 100 5645 . - . gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1 ena transcript 100 5645 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1 ena exon 5494 5645 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";
1 ena CDS 5494 5645 . - 0 gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
1 ena start_codon 5643 5645 . - 0 gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1 ena exon 5322 5422 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "2"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-2";
1 ena CDS 5322 5422 . - 1 gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "2"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
1 ena exon 3958 5263 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "3"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-3";
1 ena CDS 3958 5263 . - 2 gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "3"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
1 ena exon 3206 3890 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "4"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-4";
1 ena CDS 3206 3890 . - 1 gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "4"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
1 ena exon 2846 3126 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "5"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-5";
1 ena CDS 2846 3126 . - 0 gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "5"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
1 ena exon 2322 2782 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "6"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-6";
1 ena CDS 2322 2782 . - 1 gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "6"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
In [4]:
head -50 $GTF | cut -c -55
#!genome-build CNA3
#!genome-version CNA3
#!genome-date 2015-11
#!genome-build-accession GCA_000149245.3
#!genebuild-last-updated 2015-11
1 ena gene 100 5645 . - . gene_id "CNAG_04548"; gene_so
1 ena transcript 100 5645 . - . gene_id "CNAG_04548"; t
1 ena exon 5494 5645 . - . gene_id "CNAG_04548"; transc
1 ena CDS 5494 5645 . - 0 gene_id "CNAG_04548"; transcr
1 ena start_codon 5643 5645 . - 0 gene_id "CNAG_04548";
1 ena exon 5322 5422 . - . gene_id "CNAG_04548"; transc
1 ena CDS 5322 5422 . - 1 gene_id "CNAG_04548"; transcr
1 ena exon 3958 5263 . - . gene_id "CNAG_04548"; transc
1 ena CDS 3958 5263 . - 2 gene_id "CNAG_04548"; transcr
1 ena exon 3206 3890 . - . gene_id "CNAG_04548"; transc
1 ena CDS 3206 3890 . - 1 gene_id "CNAG_04548"; transcr
1 ena exon 2846 3126 . - . gene_id "CNAG_04548"; transc
1 ena CDS 2846 3126 . - 0 gene_id "CNAG_04548"; transcr
1 ena exon 2322 2782 . - . gene_id "CNAG_04548"; transc
1 ena CDS 2322 2782 . - 1 gene_id "CNAG_04548"; transcr
1 ena exon 1823 2274 . - . gene_id "CNAG_04548"; transc
1 ena CDS 1823 2274 . - 2 gene_id "CNAG_04548"; transcr
1 ena exon 1556 1767 . - . gene_id "CNAG_04548"; transc
1 ena CDS 1556 1767 . - 0 gene_id "CNAG_04548"; transcr
1 ena exon 100 1497 . - . gene_id "CNAG_04548"; transcr
1 ena CDS 168 1497 . - 1 gene_id "CNAG_04548"; transcri
1 ena three_prime_utr 100 167 . - . gene_id "CNAG_04548
1 ena gene 5928 7982 . - . gene_id "CNAG_07303"; gene_s
1 ena transcript 5928 7982 . - . gene_id "CNAG_07303";
1 ena exon 7685 7982 . - . gene_id "CNAG_07303"; transc
1 ena CDS 7685 7769 . - 0 gene_id "CNAG_07303"; transcr
1 ena start_codon 7767 7769 . - 0 gene_id "CNAG_07303";
1 ena exon 5928 7626 . - . gene_id "CNAG_07303"; transc
1 ena CDS 6209 7626 . - 2 gene_id "CNAG_07303"; transcr
1 ena five_prime_utr 7770 7982 . - . gene_id "CNAG_0730
1 ena three_prime_utr 5928 6208 . - . gene_id "CNAG_073
1 ena transcript 6209 7769 . - . gene_id "CNAG_07303";
1 ena exon 7685 7769 . - . gene_id "CNAG_07303"; transc
1 ena CDS 7685 7769 . - 0 gene_id "CNAG_07303"; transcr
1 ena start_codon 7767 7769 . - 0 gene_id "CNAG_07303";
1 ena exon 6209 7626 . - . gene_id "CNAG_07303"; transc
1 ena CDS 6212 7626 . - 2 gene_id "CNAG_07303"; transcr
1 ena stop_codon 6209 6211 . - 0 gene_id "CNAG_07303";
1 ena gene 8766 9603 . - . gene_id "CNAG_07304"; gene_s
1 ena transcript 8766 9603 . - . gene_id "CNAG_07304";
1 ena exon 9311 9603 . - . gene_id "CNAG_07304"; transc
1 ena CDS 9311 9480 . - 0 gene_id "CNAG_07304"; transcr
1 ena start_codon 9478 9480 . - 0 gene_id "CNAG_07304";
1 ena exon 8766 9246 . - . gene_id "CNAG_07304"; transc
1 ena CDS 9129 9246 . - 1 gene_id "CNAG_07304"; transcr
Running htseq-count¶
So now we are ready! We run htseq-count
using
htseq-count ALIGNMENT_FILE GFF_FILE
. Here is our command for our
test sample:
- –format=bam: Since Tophat generated BAM files for us
- –stranded=reverse: The dUTP method that we used for generating a strand-specific library produces reads that are anti-sense, htseq-count considers this to be “reverse”.
In [5]:
htseq-count --quiet \
--format=bam \
--stranded=reverse \
${STAR_OUT}/27_MA_P_S38_L002_R1_Aligned.out.bam \
$GTF > ${COUNT_OUT}/27_MA_P_S38_L002_R1.tsv
[Errno 2] No such file or directory: '/Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L002_R1_Aligned.out.bam'
[Exception type: FileNotFoundError, raised in count.py:42]
Let’s take a quick peek at the results
In [6]:
head ${COUNT_OUT}/27_MA_P_S38_L002_R1.tsv
There’s also some useful information at the end of the file:
In [7]:
tail ${COUNT_OUT}/27_MA_P_S38_L002_R1.tsv
Sanity Check: Total read counts¶
In [8]:
head ${STAR_OUT}/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab
head: /Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab: No such file or directory
In [9]:
cat ${STAR_OUT}/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab | awk '{s+=$4} END {printf "%.0f", s}'
cat: /Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab: No such file or directory
0
In [10]:
cat ${COUNT_OUT}/27_MA_P_S38_L002_R1.tsv | awk '{s+=$2} END {printf "%.0f", s}'
0
In [11]:
ls ${STAR_OUT}/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab
ls: /Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab: No such file or directory
In [12]:
cat ${STAR_OUT}/27_MA_P_S38_L002_R1_Log.final.out
cat: /Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L002_R1_Log.final.out: No such file or directory
In [13]:
grep "Uniquely mapped reads number" ${STAR_OUT}/27_MA_P_S38_L002_R1_Log.final.out
grep: /Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L002_R1_Log.final.out: No such file or directory
An easy way to check how many reads map unambiguously to genes is to use grep to remove those lines with information about problem reads that are at the end of the file, then use our awk command again
In [14]:
grep __ ${COUNT_OUT}/27_MA_P_S38_L002_R1.tsv
In [15]:
grep -v __ ${COUNT_OUT}/27_MA_P_S38_L002_R1.tsv | awk '{s+=$2} END {printf "%.0f", s}'
0
In [16]:
grep N_ ${STAR_OUT}/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab
grep: /Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab: No such file or directory
In [17]:
grep -v N_ ${STAR_OUT}/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab | awk '{s+=$4} END {printf "%.0f", s}'
grep: /Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab: No such file or directory
0