{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Working with Multiple Samples\n", "Let's kick it up another notch - we have six pilot samples, let's run our analysis on all of them! \n", "\n", "## Shell Variables\n", "Assign the variables in this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# We used these in previous notebooks\n", "CUROUT=$HOME/work/scratch/2015_output\n", "DEMUX=$CUROUT/demux_fastqs\n", "CURDATA=/data/HTS_2015_data/runs_combined\n", "INFO=$HOME/work/myinfo\n", "TRIMMED=$CUROUT/trimmed_fastqs\n", "GENOME_DIR=$CUROUT/genome\n", "TH_DIR=$CUROUT/th_dir\n", "COUNT_DIR=$CUROUT/counts\n", "\n", "# We used these in the last notebook\n", "ACCESSION=\"GCA_000010245.1_ASM1024v1\"\n", "PREFIX=${ACCESSION}_genomic\n", "GFF=${PREFIX}.gff\n", "FA=${PREFIX}.fa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Making New Directories\n", "No new directories are necessary, we are using directories that were already created for the single-end pipeline." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "mkdir -p $TRIMMED $GENOME_DIR $TH_DIR $COUNT_DIR" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A Brief journey into for loops\n", "Most of the steps in our pipeline aren't so simple. To apply our pipeline to multiple sample files, we need to change things in multiple places. For example, just to run Tophat, we need to change things in four places between each run, the output directory name (twice) and the name of the FASTQ. Doing this by hand is not only tedious, but error prone. Doing almost the same thing repeatedly is something that computers are very good at, and people are very bad at, so let's get the computer to do the hard work. Because the Unix shell is almost magical (it is a full fledged programming language), we can do this. We will use a `for loop`. This is analogous to how you would teach a child to set the table: \"FOR each place at the table, put a plate . . .,\n", "At the shell you phrase it like this:\n", "\n", " for PERSON in Alice Bob Carol Dave Eve\n", " do\n", " put plate at PERSON's place\n", " put napkin at PERSON's place\n", " put fork at PERSON's place\n", " put spoon at PERSON's place\n", " put knife at PERSON's place\n", " done\n", "\n", "Here is a real example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for SAMPLE in A B C D E F\n", " do\n", " echo XXXXX_${SAMPLE}_XXXXX \n", " done" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "the `do` and `done` are essential - `do` needs to be before the \"loop body\" (what is going to be repeated) and `done` needs to be after it.\n", "\n", "So let's try something almost useful:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for SAMPLE in 8A_pilot\n", " do\n", " echo $SAMPLE\n", " done" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let's run fastq-mcf in a loop:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for SAMPLE in 8A_pilot\n", " do\n", " echo $SAMPLE\n", " fastq-mcf $INFO/neb_adapters.fasta \\\n", " $DEMUX/r1.${SAMPLE}.fq.gz \\\n", " $DEMUX/r2.${SAMPLE}.fq.gz \\\n", " -q 20 -x 0.5 \\\n", " -o $TRIMMED/r1.${SAMPLE}.trim.fastq.gz \\\n", " -o $TRIMMED/r2.${SAMPLE}.trim.fastq.gz\n", " done" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Now let's do the same thing for tophat" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for SAMPLE in 8A_pilot\n", " do\n", " echo $SAMPLE\n", " mkdir -p $TH_DIR/${SAMPLE}\n", " tophat2 -G $GENOME_DIR/$GFF \\\n", " --library-type fr-firststrand \\\n", " --output-dir $TH_DIR/${SAMPLE} \\\n", " --max-intron-length 5 \\\n", " --min-intron-length 4 \\\n", " --transcriptome-max-hits 1 \\\n", " --max-multihits 1 \\\n", " --no-coverage-search \\\n", " --no-novel-juncs \\\n", " --no-sort-bam \\\n", " $GENOME_DIR/$PREFIX \\\n", " $TRIMMED/r1.${SAMPLE}.trim.fastq.gz \\\n", " $TRIMMED/r2.${SAMPLE}.trim.fastq.gz\n", " done" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### And now for htseq-count!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for SAMPLE in 8A_pilot\n", " do\n", " echo $SAMPLE\n", " samtools sort -n $TH_DIR/${SAMPLE}/accepted_hits.bam \\\n", " -o $TH_DIR/${SAMPLE}/accepted_hits.name.bam\n", "\n", " htseq-count --quiet --order=name --format=bam --stranded=reverse \\\n", " --type=gene --idattr=Name \\\n", " $TH_DIR/${SAMPLE}/accepted_hits.name.bam \\\n", " $GENOME_DIR/$GFF > $COUNT_DIR/${SAMPLE}.csv\n", " done" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ls $COUNT_DIR\n", "tail $COUNT_DIR/*" ] } ], "metadata": { "kernelspec": { "display_name": "Bash", "language": "bash", "name": "bash" }, "language_info": { "codemirror_mode": "shell", "file_extension": ".sh", "mimetype": "text/x-sh", "name": "bash" } }, "nbformat": 4, "nbformat_minor": 0 }