Counting Reads

The next step after mapping reads is to count the number of reads that fall within each annotated gene in the genome, so lets set up a count directory.

Shell Variables

In [1]:
# Source the config script
source bioinf_intro_config.sh

mkdir -p $COUNT_OUT
ls $CUROUT
count_out               igv                     star_out
demo_chmod              myinfo                  stuff_for_igv.tgz
genome                  qc_output               trimmed_fastqs

Counting Reads

In [2]:
htseq-count --help
Usage: htseq-count [options] alignment_file gff_file

This script takes an alignment file in SAM/BAM format and a feature file in
GFF format and calculates for each feature the number of reads mapping to it.
See http://www-huber.embl.de/users/anders/HTSeq/doc/count.html for details.

Options:
  -h, --help            show this help message and exit
  -f SAMTYPE, --format=SAMTYPE
                        type of <alignment_file> data, either 'sam' or 'bam'
                        (default: sam)
  -r ORDER, --order=ORDER
                        'pos' or 'name'. Sorting order of <alignment_file>
                        (default: name). Paired-end sequencing data must be
                        sorted either by position or by read name, and the
                        sorting order must be specified. Ignored for single-
                        end data.
  -s STRANDED, --stranded=STRANDED
                        whether the data is from a strand-specific assay.
                        Specify 'yes', 'no', or 'reverse' (default: yes).
                        'reverse' means 'yes' with reversed strand
                        interpretation
  -a MINAQUAL, --minaqual=MINAQUAL
                        skip all reads with alignment quality lower than the
                        given minimum value (default: 10)
  -t FEATURETYPE, --type=FEATURETYPE
                        feature type (3rd column in GFF file) to be used, all
                        features of other type are ignored (default, suitable
                        for Ensembl GTF files: exon)
  -i IDATTR, --idattr=IDATTR
                        GFF attribute to be used as feature ID (default,
                        suitable for Ensembl GTF files: gene_id)
  -m MODE, --mode=MODE  mode to handle reads overlapping more than one feature
                        (choices: union, intersection-strict, intersection-
                        nonempty; default: union)
  -o SAMOUT, --samout=SAMOUT
                        write out all SAM alignment records into an output SAM
                        file called SAMOUT, annotating each line with its
                        feature assignment (as an optional field with tag
                        'XF')
  -q, --quiet           suppress progress report

Written by Simon Anders (sanders@fs.tum.de), European Molecular Biology
Laboratory (EMBL). (c) 2010. Released under the terms of the GNU General
Public License v3. Part of the 'HTSeq' framework, version 0.7.2.

We will use htseq-count to do the counting, but first we need to make some decisions, because the htseq-count defaults do not work with some annotation files. Here are the most important commandline options that we need to consider: * –format=: Format of the input data. Possible values are sam (for text SAM files) and bam (for binary BAM files). Default is sam. * –stranded=: whether the data is from a strand-specific assay (default: yes). For stranded=no, a read is considered overlapping with a feature regardless of whether it is mapped to the same or the opposite strand as the feature. For stranded=yes and single-end reads, the read has to be mapped to the same strand as the feature. For paired-end reads, the first read has to be on the same strand and the second read on the opposite strand. For stranded=reverse, these rules are reversed. * –type=: feature type (3rd column in GFF file) to be evaluated, all features of other type are ignored (default, suitable for RNA-Seq analysis using an Ensembl GTF file: exon) * –idattr=: GFF attribute to be used as feature ID. Several GFF lines with the same feature ID will be considered as parts of the same feature. The feature ID is used to identity the counts in the output table. The default, suitable for RNA-Seq analysis using an Ensembl GTF file, is gene_id.

And here is how we will set those options: * –format=bam: Since Tophat generated BAM files for us * –stranded=reverse: The dUTP method that we used for generating a strand-specific library produces reads that are anti-sense, htseq-count considers this to be “reverse”.

We need to look at the GFF file to understand what exactly the --type and --idattr options are, and why we are setting them this way.

In [3]:
head -20 $GTF
#!genome-build CNA3
#!genome-version CNA3
#!genome-date 2015-11
#!genome-build-accession GCA_000149245.3
#!genebuild-last-updated 2015-11
1       ena     gene    100     5645    .       -       .       gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1       ena     transcript      100     5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1       ena     exon    5494    5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";
1       ena     CDS     5494    5645    .       -       0       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
1       ena     start_codon     5643    5645    .       -       0       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1       ena     exon    5322    5422    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "2"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-2";
1       ena     CDS     5322    5422    .       -       1       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "2"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
1       ena     exon    3958    5263    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "3"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-3";
1       ena     CDS     3958    5263    .       -       2       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "3"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
1       ena     exon    3206    3890    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "4"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-4";
1       ena     CDS     3206    3890    .       -       1       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "4"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
1       ena     exon    2846    3126    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "5"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-5";
1       ena     CDS     2846    3126    .       -       0       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "5"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
1       ena     exon    2322    2782    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "6"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-6";
1       ena     CDS     2322    2782    .       -       1       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "6"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
In [4]:
head -50 $GTF | cut -c -55
#!genome-build CNA3
#!genome-version CNA3
#!genome-date 2015-11
#!genome-build-accession GCA_000149245.3
#!genebuild-last-updated 2015-11
1       ena     gene    100     5645    .       -       .       gene_id "CNAG_04548"; gene_so
1       ena     transcript      100     5645    .       -       .       gene_id "CNAG_04548"; t
1       ena     exon    5494    5645    .       -       .       gene_id "CNAG_04548"; transc
1       ena     CDS     5494    5645    .       -       0       gene_id "CNAG_04548"; transcr
1       ena     start_codon     5643    5645    .       -       0       gene_id "CNAG_04548";
1       ena     exon    5322    5422    .       -       .       gene_id "CNAG_04548"; transc
1       ena     CDS     5322    5422    .       -       1       gene_id "CNAG_04548"; transcr
1       ena     exon    3958    5263    .       -       .       gene_id "CNAG_04548"; transc
1       ena     CDS     3958    5263    .       -       2       gene_id "CNAG_04548"; transcr
1       ena     exon    3206    3890    .       -       .       gene_id "CNAG_04548"; transc
1       ena     CDS     3206    3890    .       -       1       gene_id "CNAG_04548"; transcr
1       ena     exon    2846    3126    .       -       .       gene_id "CNAG_04548"; transc
1       ena     CDS     2846    3126    .       -       0       gene_id "CNAG_04548"; transcr
1       ena     exon    2322    2782    .       -       .       gene_id "CNAG_04548"; transc
1       ena     CDS     2322    2782    .       -       1       gene_id "CNAG_04548"; transcr
1       ena     exon    1823    2274    .       -       .       gene_id "CNAG_04548"; transc
1       ena     CDS     1823    2274    .       -       2       gene_id "CNAG_04548"; transcr
1       ena     exon    1556    1767    .       -       .       gene_id "CNAG_04548"; transc
1       ena     CDS     1556    1767    .       -       0       gene_id "CNAG_04548"; transcr
1       ena     exon    100     1497    .       -       .       gene_id "CNAG_04548"; transcr
1       ena     CDS     168     1497    .       -       1       gene_id "CNAG_04548"; transcri
1       ena     three_prime_utr 100     167     .       -       .       gene_id "CNAG_04548
1       ena     gene    5928    7982    .       -       .       gene_id "CNAG_07303"; gene_s
1       ena     transcript      5928    7982    .       -       .       gene_id "CNAG_07303";
1       ena     exon    7685    7982    .       -       .       gene_id "CNAG_07303"; transc
1       ena     CDS     7685    7769    .       -       0       gene_id "CNAG_07303"; transcr
1       ena     start_codon     7767    7769    .       -       0       gene_id "CNAG_07303";
1       ena     exon    5928    7626    .       -       .       gene_id "CNAG_07303"; transc
1       ena     CDS     6209    7626    .       -       2       gene_id "CNAG_07303"; transcr
1       ena     five_prime_utr  7770    7982    .       -       .       gene_id "CNAG_0730
1       ena     three_prime_utr 5928    6208    .       -       .       gene_id "CNAG_073
1       ena     transcript      6209    7769    .       -       .       gene_id "CNAG_07303";
1       ena     exon    7685    7769    .       -       .       gene_id "CNAG_07303"; transc
1       ena     CDS     7685    7769    .       -       0       gene_id "CNAG_07303"; transcr
1       ena     start_codon     7767    7769    .       -       0       gene_id "CNAG_07303";
1       ena     exon    6209    7626    .       -       .       gene_id "CNAG_07303"; transc
1       ena     CDS     6212    7626    .       -       2       gene_id "CNAG_07303"; transcr
1       ena     stop_codon      6209    6211    .       -       0       gene_id "CNAG_07303";
1       ena     gene    8766    9603    .       -       .       gene_id "CNAG_07304"; gene_s
1       ena     transcript      8766    9603    .       -       .       gene_id "CNAG_07304";
1       ena     exon    9311    9603    .       -       .       gene_id "CNAG_07304"; transc
1       ena     CDS     9311    9480    .       -       0       gene_id "CNAG_07304"; transcr
1       ena     start_codon     9478    9480    .       -       0       gene_id "CNAG_07304";
1       ena     exon    8766    9246    .       -       .       gene_id "CNAG_07304"; transc
1       ena     CDS     9129    9246    .       -       1       gene_id "CNAG_07304"; transcr

Running htseq-count

So now we are ready! We run htseq-count using htseq-count ALIGNMENT_FILE GFF_FILE. Here is our command for our test sample:

  • –format=bam: Since Tophat generated BAM files for us
  • –stranded=reverse: The dUTP method that we used for generating a strand-specific library produces reads that are anti-sense, htseq-count considers this to be “reverse”.
In [5]:
htseq-count --quiet \
    --format=bam \
    --stranded=reverse \
    ${STAR_OUT}/27_MA_P_S38_L002_R1_Aligned.out.bam \
    $GTF > ${COUNT_OUT}/27_MA_P_S38_L002_R1.tsv
  [Errno 2] No such file or directory: '/Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L002_R1_Aligned.out.bam'
  [Exception type: FileNotFoundError, raised in count.py:42]

Let’s take a quick peek at the results

In [6]:
head ${COUNT_OUT}/27_MA_P_S38_L002_R1.tsv

There’s also some useful information at the end of the file:

In [7]:
tail ${COUNT_OUT}/27_MA_P_S38_L002_R1.tsv

Sanity Check: Total read counts

In [8]:
head ${STAR_OUT}/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab
head: /Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab: No such file or directory

In [9]:
cat ${STAR_OUT}/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab | awk '{s+=$4} END {printf "%.0f", s}'
cat: /Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab: No such file or directory
0
In [10]:
cat ${COUNT_OUT}/27_MA_P_S38_L002_R1.tsv | awk '{s+=$2} END {printf "%.0f", s}'
0
In [11]:
ls ${STAR_OUT}/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab
ls: /Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab: No such file or directory

In [12]:
cat ${STAR_OUT}/27_MA_P_S38_L002_R1_Log.final.out
cat: /Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L002_R1_Log.final.out: No such file or directory

In [13]:
grep "Uniquely mapped reads number" ${STAR_OUT}/27_MA_P_S38_L002_R1_Log.final.out
grep: /Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L002_R1_Log.final.out: No such file or directory

An easy way to check how many reads map unambiguously to genes is to use grep to remove those lines with information about problem reads that are at the end of the file, then use our awk command again

In [14]:
grep __ ${COUNT_OUT}/27_MA_P_S38_L002_R1.tsv

In [15]:
grep -v __ ${COUNT_OUT}/27_MA_P_S38_L002_R1.tsv | awk '{s+=$2} END {printf "%.0f", s}'
0
In [16]:
grep N_ ${STAR_OUT}/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab
grep: /Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab: No such file or directory

In [17]:
grep -v N_ ${STAR_OUT}/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab | awk '{s+=$4} END {printf "%.0f", s}'
grep: /Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab: No such file or directory
0