Making Generic Commands

Let’s let put the Bash shell to work for us by being smart with our variables

Shell Variables

Assign the variables in this notebook.

In [1]:
source bioinf_intro_config.sh
mkdir -p $TRIMMED $STAR_OUT

Using a FASTQ variable

Most of the commands in our pipeline are complicated! To apply our pipeline to different FASTQs, we need to change things in multiple places. For example, just to run trimming with fastq-mcf, we need to change things in two places between each run: the input FASTQ and the output FASTQ. If we were doing this with paired-end reads, we would have to change four things. Doing this by hand is not only tedious, but error prone. Doing almost the same thing repeatedly is something that people are bad at, but computers are very good at! So let’s get the computer to do the hard work. Because the Bash shell is almost magical (it is a full fledged programming language), we can do this easily.

In [2]:
FASTQ="A"
echo "RUNNING FASTQ: ~~~~${FASTQ}~~~~"
echo "TRIMING: ${FASTQ}"
echo "MAPPING: ${FASTQ}"
RUNNING FASTQ: ~~~~A~~~~
TRIMING: A
MAPPING: A

The for loop in Bash is conceptually the same as in any other programming language, although the syntax may be different. The do and done are essential - do needs to be before the “loop body” (what is going to be repeated) and done needs to be after it.

So let’s try something almost useful:

In [3]:
FASTQ="27_MA_P_S38_L002_R1"
echo "RUNNING FASTQ: ~~~~${FASTQ}~~~~"
echo "TRIMING: ${FASTQ}"
echo "MAPPING: ${FASTQ}"
RUNNING FASTQ: ~~~~27_MA_P_S38_L002_R1~~~~
TRIMING: 27_MA_P_S38_L002_R1
MAPPING: 27_MA_P_S38_L002_R1

Now for the real thing …

Let’s run fastq-mcf:

In [4]:
FASTQ="27_MA_P_S38_L002_R1"
echo "TRIMING: $FASTQ"
fastq-mcf $MYINFO/neb_e7600_adapters.fasta \
    $RAW_FASTQS/${FASTQ}_001.fastq.gz \
        -q 20 -x 0.5 \
        -o $TRIMMED/${FASTQ}_001.trim.fastq.gz
TRIMING: 27_MA_P_S38_L002_R1
Command Line: /Users/cliburn/work/scratch/bioinf_intro/myinfo/neb_e7600_adapters.fasta /data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L002_R1_001.fastq.gz -q 20 -x 0.5 -o /Users/cliburn/work/scratch/bioinf_intro/trimmed_fastqs/27_MA_P_S38_L002_R1_001.trim.fastq.gz
Scale used: 2.2
gunzip: can't stat: /data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L002_R1_001.fastq.gz (/data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L002_R1_001.fastq.gz.gz): No such file or directory
Phred: 64
No records in file /data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L002_R1_001.fastq.gz

Now let’s do the same thing for STAR

In [5]:
FASTQ="27_MA_P_S38_L002_R1"
echo "MAPPING: $FASTQ"
STAR \
    --runMode alignReads \
    --twopassMode None \
    --genomeDir $GENOME_DIR \
    --readFilesIn $TRIMMED/${FASTQ}_001.trim.fastq.gz \
    --readFilesCommand gunzip -c \
    --outFileNamePrefix ${STAR_OUT}/${FASTQ}_ \
    --quantMode GeneCounts \
    --outSAMtype None
MAPPING: 27_MA_P_S38_L002_R1
STAR: Bad Option: --runMode.
Usage:  STAR cmd [options] [-find] file1 ... filen [find expression]

Use     STAR -help
and     STAR -xhelp
to get a list of valid cmds and options.

Use     STAR H=help
to get a list of valid archive header formats.

Use     STAR diffopts=help
to get a list of valid diff options.

And let’s check the result

In [6]:
ls ${STAR_OUT}
In [7]:
head ${STAR_OUT}/${FASTQ}_ReadsPerGene.out.tab
head: /Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L002_R1_ReadsPerGene.out.tab: No such file or directory