Mapping Reads to a Reference Genome

Shell Variables

In [1]:
# Source the config script
source bioinf_intro_config.sh

ls $CUROUT
count_out               igv                     star_out
demo_chmod              myinfo                  stuff_for_igv.tgz
genome                  qc_output               trimmed_fastqs

Mapping with STAR

In [2]:
STAR \
    --runMode alignReads \
    --twopassMode None \
    --genomeDir $GENOME_DIR \
    --readFilesIn $TRIMMED/27_MA_P_S38_L002_R1_001.trim.fastq.gz \
    --readFilesCommand gunzip -c \
    --outFileNamePrefix ${STAR_OUT}/27_MA_P_S38_L002_R1_ \
    --quantMode GeneCounts \
    --outSAMtype BAM Unsorted \
    --outSAMunmapped Within
STAR: Bad Option: --runMode.
Usage:  STAR cmd [options] [-find] file1 ... filen [find expression]

Use     STAR -help
and     STAR -xhelp
to get a list of valid cmds and options.

Use     STAR H=help
to get a list of valid archive header formats.

Use     STAR diffopts=help
to get a list of valid diff options.

We will start with these parameters, but there is an extensive list of command line options detailed in the STAR Manual, it is a good idea to read through and try to understand all of them. We will discuss some more later.

  • –runMode alignReads : map reads
  • –twopassMode : Run one pass or two? If two-pass mode is on, STAR tries to discover novel junctions, then reruns mapping with these added to the annotation
  • –genomeDir : directory containing the genome index
  • –readFilesIn : input FASTQ
  • –readFilesCommand gunzip -c : use “gunzip -c” to uncompress FASTQ on-the-fly, since it is gzipped
  • –outFileNamePrefix : prefix (and path) to use for all output files
  • –quantMode GeneCounts : output a table of read counts per gene
  • –outSAMtype BAM Unsorted : output an unsort BAM file
  • –outSAMunmapped Within : included unmapped reads in the BAM file

STAR Output

So what happened? Let’s take a look …

In [3]:
ls ${STAR_OUT}
In [4]:
head ${STAR_OUT}/27_MA_P_S38_L002_R1*
head: /Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L002_R1*: No such file or directory

STAR generates several files for each FASTQ: * Log.out : lots of details of the run, including all parameters used * Log.final.out : Important summary statistics * ReadsPerGene.out.tab : Count table, the main thing we are interested in * SJ.out.tab : All splice junctions, including ones from the GTF and novel junctions discovered by STAR * Log.progress.out: run statistics updated during run, not so interesting at the end

Let’s take a closer look at Log.final.out

In [5]:
cat ${STAR_OUT}/27_MA_P_S38_L002_R1_Log.final.out
cat: /Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L002_R1_Log.final.out: No such file or directory

Sanity Check: Number of Reads

Are the number of input reads what we expect? Let’s look at how many reads are in the input FASTQ

In [6]:
zcat $TRIMMED/27_MA_P_S38_L002_R1_001.trim.fastq.gz | awk '{s++}END{print s/4}'
zcat: can't stat: /Users/cliburn/work/scratch/bioinf_intro/trimmed_fastqs/27_MA_P_S38_L002_R1_001.trim.fastq.gz (/Users/cliburn/work/scratch/bioinf_intro/trimmed_fastqs/27_MA_P_S38_L002_R1_001.trim.fastq.gz.Z): No such file or directory
0

Sanity Check: Unmapped Reads

It is always a good idea to examine a sample of unmapped reads to figure out what they are. The easiest way to do this is with BLAST. In the past I have discovered that an experiment was contaminated with a different species by BLASTing unmapped reads. In that case there were a large number of unmapped reads, which raised my suspicions.

Even with a high rate of mapped reads, it is worth spending a few minutes to check them out. The simplest thing to do is use samtools to generate a FASTA from the unmapped.bam, grab a few of these sequences, and then BLAST them against the nr database

In [7]:
samtools fasta -f 4 ${STAR_OUT}/27_MA_P_S38_L002_R1_Aligned.out.bam | head -n20
[E::hts_open_format] fail to open file '/Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L002_R1_Aligned.out.bam'
samtools bam2fq: Cannot read file "/Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L002_R1_Aligned.out.bam": No such file or directory