{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Mapping Reads to a Reference Genome\n", "\n", "\n", "## Shell Variables" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Source the config script\n", "source bioinf_intro_config.sh\n", "\n", "ls $CUROUT" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mapping with STAR\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "STAR \\\n", " --runMode alignReads \\\n", " --twopassMode None \\\n", " --genomeDir $GENOME_DIR \\\n", " --readFilesIn $TRIMMED/27_MA_P_S38_L002_R1_001.trim.fastq.gz \\\n", " --readFilesCommand gunzip -c \\\n", " --outFileNamePrefix ${STAR_OUT}/27_MA_P_S38_L002_R1_ \\\n", " --quantMode GeneCounts \\\n", " --outSAMtype BAM Unsorted \\\n", " --outSAMunmapped Within" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will start with these parameters, but there is an extensive list of command line options detailed in the [STAR Manual](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf), it is a good idea to read through and try to understand all of them. We will discuss some more later.\n", "\n", "* --runMode alignReads : map reads \n", "* --twopassMode : Run one pass or two? If two-pass mode is on, STAR tries to discover novel junctions, then reruns mapping with these added to the annotation\n", "* --genomeDir : directory containing the genome index\n", "* --readFilesIn : input FASTQ\n", "* --readFilesCommand gunzip -c : use \"gunzip -c\" to uncompress FASTQ on-the-fly, since it is gzipped\n", "* --outFileNamePrefix : prefix (and path) to use for all output files\n", "* --quantMode GeneCounts : output a table of read counts per gene\n", "* --outSAMtype BAM Unsorted : output an unsort BAM file\n", "* --outSAMunmapped Within : included unmapped reads in the BAM file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### STAR Output\n", "So what happened? Let's take a look . . ." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ls ${STAR_OUT}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "head ${STAR_OUT}/27_MA_P_S38_L002_R1*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "STAR generates several files for each FASTQ:\n", "* Log.out : lots of details of the run, including all parameters used\n", "* Log.final.out : Important summary statistics\n", "* ReadsPerGene.out.tab : Count table, the main thing we are interested in\n", "* SJ.out.tab : All splice junctions, including ones from the GTF and novel junctions discovered by STAR\n", "* Log.progress.out: run statistics updated during run, not so interesting at the end\n", "\n", "Let's take a closer look at Log.final.out" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cat ${STAR_OUT}/27_MA_P_S38_L002_R1_Log.final.out" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Sanity Check: Number of Reads\n", "Are the number of input reads what we expect? Let's look at how many reads are in the input FASTQ" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "zcat $TRIMMED/27_MA_P_S38_L002_R1_001.trim.fastq.gz | awk '{s++}END{print s/4}' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Sanity Check: Unmapped Reads\n", "It is always a good idea to examine a sample of unmapped reads to figure out what they are. The easiest way to do this is with BLAST. In the past I have discovered that an experiment was contaminated with a different species by BLASTing unmapped reads. In that case there were a large number of unmapped reads, which raised my suspicions.\n", "\n", "Even with a high rate of mapped reads, it is worth spending a few minutes to check them out. The simplest thing to do is use `samtools` to generate a FASTA from the unmapped.bam, grab a few of these sequences, and then [BLAST them against the nr database](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "samtools fasta -f 4 ${STAR_OUT}/27_MA_P_S38_L002_R1_Aligned.out.bam | head -n20" ] } ], "metadata": { "kernelspec": { "display_name": "Bash", "language": "bash", "name": "bash" }, "language_info": { "codemirror_mode": "shell", "file_extension": ".sh", "mimetype": "text/x-sh", "name": "bash" } }, "nbformat": 4, "nbformat_minor": 1 }