Mapping Reads to a Reference Genome¶
Shell Variables¶
In [1]:
# Source the config script
source bioinf_intro_config.sh
ls $CUROUT
count_out igv star_out
demo_chmod myinfo stuff_for_igv.tgz
genome qc_output trimmed_fastqs
Mapping with STAR¶
In [2]:
STAR \
--runMode alignReads \
--twopassMode None \
--genomeDir $GENOME_DIR \
--readFilesIn $TRIMMED/27_MA_P_S38_L002_R1_001.trim.fastq.gz \
--readFilesCommand gunzip -c \
--outFileNamePrefix ${STAR_OUT}/27_MA_P_S38_L002_R1_ \
--quantMode GeneCounts \
--outSAMtype BAM Unsorted \
--outSAMunmapped Within
STAR: Bad Option: --runMode.
Usage: STAR cmd [options] [-find] file1 ... filen [find expression]
Use STAR -help
and STAR -xhelp
to get a list of valid cmds and options.
Use STAR H=help
to get a list of valid archive header formats.
Use STAR diffopts=help
to get a list of valid diff options.
We will start with these parameters, but there is an extensive list of command line options detailed in the STAR Manual, it is a good idea to read through and try to understand all of them. We will discuss some more later.
- –runMode alignReads : map reads
- –twopassMode : Run one pass or two? If two-pass mode is on, STAR tries to discover novel junctions, then reruns mapping with these added to the annotation
- –genomeDir : directory containing the genome index
- –readFilesIn : input FASTQ
- –readFilesCommand gunzip -c : use “gunzip -c” to uncompress FASTQ on-the-fly, since it is gzipped
- –outFileNamePrefix : prefix (and path) to use for all output files
- –quantMode GeneCounts : output a table of read counts per gene
- –outSAMtype BAM Unsorted : output an unsort BAM file
- –outSAMunmapped Within : included unmapped reads in the BAM file
STAR Output¶
So what happened? Let’s take a look …
In [3]:
ls ${STAR_OUT}
In [4]:
head ${STAR_OUT}/27_MA_P_S38_L002_R1*
head: /Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L002_R1*: No such file or directory
STAR generates several files for each FASTQ: * Log.out : lots of details of the run, including all parameters used * Log.final.out : Important summary statistics * ReadsPerGene.out.tab : Count table, the main thing we are interested in * SJ.out.tab : All splice junctions, including ones from the GTF and novel junctions discovered by STAR * Log.progress.out: run statistics updated during run, not so interesting at the end
Let’s take a closer look at Log.final.out
In [5]:
cat ${STAR_OUT}/27_MA_P_S38_L002_R1_Log.final.out
cat: /Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L002_R1_Log.final.out: No such file or directory
Sanity Check: Number of Reads¶
Are the number of input reads what we expect? Let’s look at how many reads are in the input FASTQ
In [6]:
zcat $TRIMMED/27_MA_P_S38_L002_R1_001.trim.fastq.gz | awk '{s++}END{print s/4}'
zcat: can't stat: /Users/cliburn/work/scratch/bioinf_intro/trimmed_fastqs/27_MA_P_S38_L002_R1_001.trim.fastq.gz (/Users/cliburn/work/scratch/bioinf_intro/trimmed_fastqs/27_MA_P_S38_L002_R1_001.trim.fastq.gz.Z): No such file or directory
0
Sanity Check: Unmapped Reads¶
It is always a good idea to examine a sample of unmapped reads to figure out what they are. The easiest way to do this is with BLAST. In the past I have discovered that an experiment was contaminated with a different species by BLASTing unmapped reads. In that case there were a large number of unmapped reads, which raised my suspicions.
Even with a high rate of mapped reads, it is worth spending a few
minutes to check them out. The simplest thing to do is use samtools
to generate a FASTA from the unmapped.bam, grab a few of these
sequences, and then BLAST them against the nr
database
In [7]:
samtools fasta -f 4 ${STAR_OUT}/27_MA_P_S38_L002_R1_Aligned.out.bam | head -n20
[E::hts_open_format] fail to open file '/Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L002_R1_Aligned.out.bam'
samtools bam2fq: Cannot read file "/Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L002_R1_Aligned.out.bam": No such file or directory